Some encoding jobs experience longer queuing times than expected

Incident Report for Bitmovin Inc

Postmortem

Summary

Scheduling decisions took too long and thus encoding jobs were processed slower due to an inefficient database query and inserts on the encoding cloud scheduler.

Date

The issue occurred between November 01, 16:00, and November 2, 15:30. All times are UTC.

Root Cause

The encoding cloud scheduler is responsible for managing the encoding queue and starting encoding tasks on the individual instances. It does that for both managed and cloud connect encoding jobs across all regions and cloud providers. All queued and running encoding jobs are stored in a database and processed.

The encoding cloud scheduler implements a fair scheduling algorithm that avoids starving any single customer from making progress. To this end, the scheduler needs to query the database to find out about finished and long-running encoding jobs.

Due to an increased database load, this query and every scheduling decision slowed down significantly. Additionally, this query was executed repeatedly for each scaling decision which further increased the load on the database and was slowing down overall scheduling decisions.

There is monitoring for the database load in place, but it didn’t trigger as the threshold was only considered for individual queries. Additional alerts for database load also did not trigger as the overall CPU load was slightly below the threshold too.

Implications

The query responsible for picking the next encoding task could not be executed within the expected time and therefore queuing each individual encoding job took longer and fewer encoding jobs were running in parallel than expected.

Remediation

The Bitmovin platform engineering team identified a high load on the database and determined the aggregation query which caused the increased load. The query was terminated and stopped from being executed again.

Besides that there were several other optimizations put in place to further improve scaling:

Avoid executing queries multiple times during each scheduling decision
Indices for the database table to increase querying performance
More efficient creation of tasks for the next steps in the scheduling state machine

Timeline

Nov 1st 16:00 - Encodings started to queue up and fewer encodings ran in parallel than expected

Nov 1st 19:30 - Bitmovin’s platform engineering team was notified

Nov 2nd 09:30 - A first fix was deployed speeding up the scheduling decision by avoiding a long-running query. Queuing times went back to normal for most cases.

Nov 2nd 15:30 - A second fix was deployed that was speeding up insertion tasks in the database during high load. All queuing times went back to normal.

Prevention

Although there are alerts for high database load in place, these specific circumstances did not trigger those alerts as CPU load and individual query times were in range. The Bitmovin platform engineering team will investigate better means of identifying and alerting such specific scenarios in the database but also at the application level.

The investigation of certain bottlenecks of the system was not straightforward. The Bitmovin platform engineering team will add observability metrics of time-critical components to identify such bottlenecks proactively in the future. Additionally, alerts will be added to proactively take counter-measurements and therefore avoid such incidents.

Posted Nov 04, 2022 - 12:08 UTC

Resolved

We deployed an additional fix that further improved the situation. The queuing times are in the expected range across regions since the previous update. This incident is resolved now. We will continue to monitor the situation. A detailed report of this incident will follow soon.

Posted Nov 02, 2022 - 15:44 UTC

Monitoring

A fix is deployed and we are currently seeing normal queuing times across all regions. We will continue to closely monitor the queuing times.

Posted Nov 02, 2022 - 09:50 UTC

Identified

We have identified the underlying issue of the increased queuing times and are working on a fix.

Posted Nov 02, 2022 - 08:31 UTC

Update

The queuing times on AWS have improved as well for the majority of our customers. We will continue our investigation tomorrow morning CET with our specialized engineering teams.

Posted Nov 01, 2022 - 22:31 UTC

Update

We still see longer queuing times on AWS. The queuing times on GCP are almost back to normal. Azure seems not to be impacted at the moment. We are continuing to investigate this issue.

Posted Nov 01, 2022 - 21:13 UTC

Investigating

We are currently experiencing long queuing times across all regions

Posted Nov 01, 2022 - 20:06 UTC

This incident affected: Bitmovin API (Encoding Service).