A manual cleanup routine got stalled and caused a lock on certain database tables that are necessary to manage encoding jobs. The API-related endpoints returned HTTP 500 errors during that time. Customers depending on that endpoint (either directly via the API or indirectly via the dashboard) could not properly do so. After identifying and fixing the cause, the involved endpoints returned to normal operation.
The issue occurred on December 1, 2023, between 07:40 and 8:15. All times in UTC.
A routine manual cleanup procedure caused a lock on certain database tables and stalled so that the locks could not be released. Services depending on this database resource were then impacted and unable to process API requests.
Customers were not able to start encodings. Some encoding jobs had longer than expected turnaround times. The involved API requests targeting the encoding endpoint returned HTTP 500 errors.
The faulty database operation was identified and terminated.
07:40 - Internal alerts notified the team about failures.
07:50 - The team began investigating.
08:00 - The faulty component was identified. The team began investigating the involved operations.
08:15 - The faulty operation was identified and terminated. The affected service recovered. 08:20 - The team kept monitoring and verifying the proper operation of the service.
The process for the cleanup procedure has been updated to not use the procedure that caused this incident.
The team will analyze this procedure in detail to understand why it caused a lock on the database and stalled. Measures to prevent this procedure from stalling will be taken.
As soon as the updated procedure is safe again, the team will continue to use it to fulfill the required maintenance tasks.