Bitmovin’s player license service experienced an extended period of high load on its database. This created a backlog of jobs which could not be executed properly and prevented access to the license database internally, leading to playback denied responses for customers because licenses could not be found.
The incident occurred on 9th January, 2023 between 00:20 am UTC and 10:47 am UTC.
The Bitmovin player database started lagging under load from a batch job that calculates customer impression usage, causing increased load due to request retries. This caused database read timeouts which led to rejected frontend license requests because they also require database reads. This situation was unforeseen and therefore the anticipated automatic load mitigations didn’t work as expected in this case.
The Bitmovin player license database was unresponsive and unable to handle all incoming requests which led to ~10% of all license requests falsely being rejected between 2022-01-10 02:40 UTC and 2022-01-10 10:40 UTC.
While the batch job that calculates customer impression usage was disabled billing statistics were not updated. All missing statistics were backfilled after the service was restored.
Bitmovin’s engineering team isolated the service that was causing high database load and disabled it to quickly stabilize the Bitmovin player license service.A hotfix to the licensing service was deployed on 2022-01-11 that will now return “granted” license responses in case database issues are encountered.
2022-01-10 00:20 UTC - load on database started to increase
2022-01-10 02:40 UTC - first failing requests
2022-01-10 03:50 - 05:00 UTC system has calmed down again
2022-01-10 05:00 UTC - investigation
2022-01-10 10:40 UTC - root cause identified and the failing service was stopped
2022-01-10 10:46 UTC - problem was solved and service fully restored
2022-01-11 10:00 UTC - Safer error database handling was deployed to the licensing system.
The Bitmovin license service will now return "granted" responses to clients in case of internal database read errors to reduce external dependencies for playback, and thus customer impact.
Additionally, internal logic was changed to decrease database load during batch job processing and to do fewer retries in case of timeouts as well as lower alerting thresholds to learn about potential issues sooner. The service relying on batch job processing will soon be retired.