Encoding queuing times are longer than expected
Incident Report for Bitmovin Inc
Postmortem

Summary

Not all available encoding slots were used and therefore encodings were processed slower due to an inefficient database query on the scheduler and high load on the database.

Date

The issue occurred on September 14, 2022, between 14:30 and 18:00. All times are UTC.

Root Cause

The cloud scheduler is responsible for managing the encoding queue and starting encoding tasks on the individual instances. It does that for both managed and cloud connect encodings. All queued and running encoding tasks are stored in a database and processed.

To pick the next encoding task, the cloud scheduler queries the database. However, during that time there was a higher load on the database, as the database was doing some longer running aggregation queries in the background. Additionally, the query to get the next encoding result was not optimized. As a result, this query took way longer than expected and therefore the queued encodings got picked up very slowly. 

There is monitoring for the database load in place, but it didn’t trigger as the threshold was set too high. 

Implications

The query responsible for picking the next encoding task could not be executed within the expected time and therefore fewer encodings were running in parallel than expected.

Remediation

As soon as Bitmovin’s engineering team was notified, the team had a look at the high load of the database. Once the team found out that the aggregation query caused the high load, the query was terminated and stopped from being executed again.

Timeline

14:30 - Encodings started to queue up and fewer encodings ran in parallel than expected

17:00 - Bitmovin’s engineering team was notified

17:51 - System returned back to normal state

Prevention

Although alerts for high database load are in place, the threshold was too high to catch this event. Additionally, there are alerts in place for encodings stuck in queued state. However, as some encodings were still processed, those alerts did not fire.

Bitmovin’s engineering team will change the alerts to fire earlier so that in the future those alerts will be triggered for this scenario.

Furthermore, Bitmovin’s engineering team will optimize queries for data aggregation and for picking encoding tasks to reduce load on the database.

Posted Sep 19, 2022 - 15:56 UTC

Resolved
This incident has been resolved. We will provide a detailed report of this incident once we have completed our internal investigation.
Posted Sep 14, 2022 - 19:11 UTC
Monitoring
We are seeing encodings being processed successfully across all regions again since 2022-09-14 17:51 UTC and all systems begin to recover.
Posted Sep 14, 2022 - 18:12 UTC
Identified
Since 2022-09-14 16:00 UTC we are observing issues in the encoding scheduler not picking up new encodings jobs properly anymore. Encodings might have high queue times.

We identified the underlying issue and are working on a fix.
Posted Sep 14, 2022 - 18:09 UTC
Update
We are continuing to investigate this issue.
Posted Sep 14, 2022 - 17:54 UTC
Investigating
We are currently experiencing long queuing times across all regions
Posted Sep 14, 2022 - 17:41 UTC
This incident affected: Bitmovin API (Encoding Service).