Due to a high load on the database handling the instance management encodings could not be started and stayed in the queued state.
The issue occurred on July 29, 2022 between 16:40 and 18:25. All times are UTC.
The services responsible for starting and managing instances write their tasks to a database. In order to clean instances after they have been used, all the instance information is stored in the database. Although the instances are shut down as soon as they are not required anymore, the entries in the database persist for some time and are automatically cleaned up by a curator.
Due to a data inconsistency caused by an edge case not experienced before, the cleanup task stalled and also crashed multiple times. As a result, the curator started to run a lot of queries in parallel against the database. To reduce the load on our database, there is also a mechanism in place to kill long-running queries. Usually, this prevents the database from running at a high load, however, as the curator has retry logic, the queries ran again and again, but they never finished.
This led to a high load on the database and no other queries could be executed successfully during this time, thus instance acquisition was considerably slowed which led to a lot of stuck encodings.
The system is monitored by multiple internal and external systems. However, the system responsible for reporting issues with queued encodings in real-time had crashed and therefore just the second level monitoring system, which is intentionally delayed, started to report at 18:00.
The services responsible for acquiring and managing instances could not access the database within a reasonable time. This caused transactions to fail and new instances could not be assigned to the encoding tasks.
By the time Bitmovin’s engineering team started an investigation, the database load had already reduced and the system had recovered by itself.
16:40 - Database load increased and new encodings were stuck in queued state
18:00 - The second-level internal monitoring system triggered an alarm that too many encodings are in queued state
18:20 - Database load reduced on its own, allowing new instances to be acquired and encodings started again
18:25 - System returned back to a normal state and encodings proceeded as expected
Although there is internal monitoring in place and the Bitmovin engineering team received alerts for queued encodings, there was no alerting in place for that specific database load scenario. Bitmovin’s engineering team will set up further monitoring to cover this new use case in addition to the existing ones.
Also, Bitmovin’s engineering team will conduct an investigation on how to improve the monitoring stability of the first-level system so that a crash will not happen in the future or so alerts would be surfaced sooner.
Furthermore, the Bitmovin engineering team already implemented a fix to handle the new data inconsistency edge case correctly and will investigate what can be done to improve the retry logic of the curator service.