Not all available encoding slots were used and therefore encodings were processed slower due to an inefficient database query on the scheduler and high load on the database.
The issue occurred on September 14, 2022, between 14:30 and 18:00. All times are UTC.
The cloud scheduler is responsible for managing the encoding queue and starting encoding tasks on the individual instances. It does that for both managed and cloud connect encodings. All queued and running encoding tasks are stored in a database and processed.
To pick the next encoding task, the cloud scheduler queries the database. However, during that time there was a higher load on the database, as the database was doing some longer running aggregation queries in the background. Additionally, the query to get the next encoding result was not optimized. As a result, this query took way longer than expected and therefore the queued encodings got picked up very slowly.
There is monitoring for the database load in place, but it didn’t trigger as the threshold was set too high.
The query responsible for picking the next encoding task could not be executed within the expected time and therefore fewer encodings were running in parallel than expected.
As soon as Bitmovin’s engineering team was notified, the team had a look at the high load of the database. Once the team found out that the aggregation query caused the high load, the query was terminated and stopped from being executed again.
14:30 - Encodings started to queue up and fewer encodings ran in parallel than expected
17:00 - Bitmovin’s engineering team was notified
17:51 - System returned back to normal state
Although alerts for high database load are in place, the threshold was too high to catch this event. Additionally, there are alerts in place for encodings stuck in queued state. However, as some encodings were still processed, those alerts did not fire.
Bitmovin’s engineering team will change the alerts to fire earlier so that in the future those alerts will be triggered for this scenario.
Furthermore, Bitmovin’s engineering team will optimize queries for data aggregation and for picking encoding tasks to reduce load on the database.