Elevated Error Rates in Licensing Calls
Incident Report for Bitmovin Inc
Postmortem

Summary

Bitmovin’s player license service experienced an extended period of high load on its database. This created a backlog of jobs which could not be executed properly and prevented access to the license database internally, leading to playback denied responses for customers because licenses could not be found.

Date

The incident occurred on 9th January, 2023 between 00:20 am UTC and 10:47 am UTC.

Root Cause

The Bitmovin player database started lagging under load from a batch job that calculates customer impression usage, causing increased load due to request retries. This caused database read timeouts which led to rejected frontend license requests because they also require database reads. This situation was unforeseen and therefore the anticipated automatic load mitigations didn’t work as expected in this case.

Implications

The Bitmovin player license database was unresponsive and unable to handle all incoming requests which led to ~10% of all license requests falsely being rejected between 2022-01-10 02:40 UTC and 2022-01-10 10:40 UTC.

While the batch job that calculates customer impression usage was disabled billing statistics were not updated. All missing statistics were backfilled after the service was restored.

Remediation

Bitmovin’s engineering team isolated the service that was causing high database load and disabled it to quickly stabilize the Bitmovin player license service.A hotfix to the licensing service was deployed on 2022-01-11 that will now return “granted” license responses in case database issues are encountered.

Timeline

2022-01-10 00:20 UTC - load on database started to increase

2022-01-10 02:40 UTC - first failing requests

2022-01-10 03:50 - 05:00 UTC system has calmed down again

2022-01-10 05:00 UTC - investigation

2022-01-10 10:40 UTC - root cause identified and the failing service was stopped

2022-01-10 10:46 UTC - problem was solved and service fully restored

2022-01-11 10:00 UTC - Safer error database handling was deployed to the licensing system.

Prevention

The Bitmovin license service will now return "granted" responses to clients in case of internal database read errors to reduce external dependencies for playback, and thus customer impact.

Additionally, internal logic was changed to decrease database load during batch job processing and to do fewer retries in case of timeouts as well as lower alerting thresholds to learn about potential issues sooner. The service relying on batch job processing will soon be retired.

Posted Jan 13, 2023 - 08:02 UTC

Resolved
The incident has been resolved.
We have confirmed that the measures we took to stop the database errors have worked and the service is fully restored.
Our team is currently performing a thorough root cause analysis and we will post the post-mortem once done.
Posted Jan 09, 2023 - 13:03 UTC
Update
We are continuing to monitor for any further issues.
The search for the root cause is currently underway and we will post a full post-mortem in due time.
Posted Jan 09, 2023 - 12:00 UTC
Monitoring
We have been experiencing an elevated level of database errors in our player licensing back-end. The errors started at 2023/01/09 00:00 UTC.
This caused our licensing API to sometimes return errors when a licensing call was made, thus causing the player to not playback content in clients across all platforms.

At 2023/01/09 10:47 UTC we have restored service and are still monitoring the situation.
Posted Jan 09, 2023 - 11:29 UTC
This incident affected: Player Licensing.