Elevated API Errors
Incident Report for Bitmovin Inc
Postmortem

Summary

A manual cleanup routine got stalled and caused a lock on certain database tables that are necessary to manage encoding jobs. The API-related endpoints returned HTTP 500 errors during that time. Customers depending on that endpoint (either directly via the API or indirectly via the dashboard) could not properly do so. After identifying and fixing the cause, the involved endpoints returned to normal operation.

Date

The issue occurred on December 1, 2023, between 07:40 and 8:15. All times in UTC.

Root Cause

A routine manual cleanup procedure caused a lock on certain database tables and stalled so that the locks could not be released. Services depending on this database resource were then impacted and unable to process API requests.

Implications

Customers were not able to start encodings. Some encoding jobs had longer than expected turnaround times. The involved API requests targeting the encoding endpoint returned HTTP 500 errors.

Remediation

The faulty database operation was identified and terminated.

Timeline

07:40 - Internal alerts notified the team about failures.

07:50 - The team began investigating.

08:00 - The faulty component was identified. The team began investigating the involved operations.

08:15 - The faulty operation was identified and terminated. The affected service recovered. 08:20 - The team kept monitoring and verifying the proper operation of the service.

Prevention

The process for the cleanup procedure has been updated to not use the procedure that caused this incident.

The team will analyze this procedure in detail to understand why it caused a lock on the database and stalled. Measures to prevent this procedure from stalling will be taken.

As soon as the updated procedure is safe again, the team will continue to use it to fulfill the required maintenance tasks.

Posted Dec 05, 2023 - 16:15 UTC

Resolved
All services continue to work normally again. The incident is resolved. The team will come back with an RCA beginning of next week.
Posted Dec 01, 2023 - 08:38 UTC
Monitoring
Error rates are back to normal and encoding jobs are processing normally again. The team is monitoring and further investigating the root cause.
Posted Dec 01, 2023 - 08:19 UTC
Investigating
We are currently investigating elevated error rate on our API. Starting encoding jobs seems to be impacted. We will come back with more information as soon as we have them.
Posted Dec 01, 2023 - 08:04 UTC
This incident affected: Bitmovin API (Encoding Service).