Elevated API Errors

Incident Report for Bitmovin Inc

Postmortem

Summary

A manual cleanup routine got stalled and caused a lock on certain database tables that are necessary to manage encoding jobs. The API-related endpoints returned HTTP 500 errors during that time. Customers depending on that endpoint (either directly via the API or indirectly via the dashboard) could not properly do so. After identifying and fixing the cause, the involved endpoints returned to normal operation.

Date

The issue occurred on December 1, 2023, between 07:40 and 8:15. All times in UTC.

Root Cause

A routine manual cleanup procedure caused a lock on certain database tables and stalled so that the locks could not be released. Services depending on this database resource were then impacted and unable to process API requests.

Implications

Customers were not able to start encodings. Some encoding jobs had longer than expected turnaround times. The involved API requests targeting the encoding endpoint returned HTTP 500 errors.

Remediation

The faulty database operation was identified and terminated.

Timeline

07:40 - Internal alerts notified the team about failures.

07:50 - The team began investigating.

08:00 - The faulty component was identified. The team began investigating the involved operations.

08:15 - The faulty operation was identified and terminated. The affected service recovered. 08:20 - The team kept monitoring and verifying the proper operation of the service.

Prevention

The process for the cleanup procedure has been updated to not use the procedure that caused this incident.

The team will analyze this procedure in detail to understand why it caused a lock on the database and stalled. Measures to prevent this procedure from stalling will be taken.

As soon as the updated procedure is safe again, the team will continue to use it to fulfill the required maintenance tasks.

Posted Dec 05, 2023 - 16:15 UTC

Resolved

All services continue to work normally again. The incident is resolved. The team will come back with an RCA beginning of next week.

Posted Dec 01, 2023 - 08:38 UTC

Monitoring

Error rates are back to normal and encoding jobs are processing normally again. The team is monitoring and further investigating the root cause.

Posted Dec 01, 2023 - 08:19 UTC

Investigating

We are currently investigating elevated error rate on our API. Starting encoding jobs seems to be impacted. We will come back with more information as soon as we have them.

Posted Dec 01, 2023 - 08:04 UTC

This incident affected: Bitmovin API (Encoding Service).