Scheduling decisions took too long and thus encoding jobs were processed slower due to an inefficient database query and inserts on the encoding cloud scheduler.
The issue occurred between November 01, 16:00, and November 2, 15:30. All times are UTC.
The encoding cloud scheduler is responsible for managing the encoding queue and starting encoding tasks on the individual instances. It does that for both managed and cloud connect encoding jobs across all regions and cloud providers. All queued and running encoding jobs are stored in a database and processed.
The encoding cloud scheduler implements a fair scheduling algorithm that avoids starving any single customer from making progress. To this end, the scheduler needs to query the database to find out about finished and long-running encoding jobs.
Due to an increased database load, this query and every scheduling decision slowed down significantly. Additionally, this query was executed repeatedly for each scaling decision which further increased the load on the database and was slowing down overall scheduling decisions.
There is monitoring for the database load in place, but it didn’t trigger as the threshold was only considered for individual queries. Additional alerts for database load also did not trigger as the overall CPU load was slightly below the threshold too.
The query responsible for picking the next encoding task could not be executed within the expected time and therefore queuing each individual encoding job took longer and fewer encoding jobs were running in parallel than expected.
The Bitmovin platform engineering team identified a high load on the database and determined the aggregation query which caused the increased load. The query was terminated and stopped from being executed again.
Besides that there were several other optimizations put in place to further improve scaling:
Nov 1st 16:00 - Encodings started to queue up and fewer encodings ran in parallel than expected
Nov 1st 19:30 - Bitmovin’s platform engineering team was notified
Nov 2nd 09:30 - A first fix was deployed speeding up the scheduling decision by avoiding a long-running query. Queuing times went back to normal for most cases.
Nov 2nd 15:30 - A second fix was deployed that was speeding up insertion tasks in the database during high load. All queuing times went back to normal.
Although there are alerts for high database load in place, these specific circumstances did not trigger those alerts as CPU load and individual query times were in range. The Bitmovin platform engineering team will investigate better means of identifying and alerting such specific scenarios in the database but also at the application level.
The investigation of certain bottlenecks of the system was not straightforward. The Bitmovin platform engineering team will add observability metrics of time-critical components to identify such bottlenecks proactively in the future. Additionally, alerts will be added to proactively take counter-measurements and therefore avoid such incidents.