Long queue times and scheduling failed for some encoding jobs

Incident Report for Bitmovin Inc

Postmortem

Summary

A component in charge of provisioning infrastructure resources was in an overload situation which caused long queue times and partially to scheduling errors on AWS. To stabilize the system we scaled down job processing and gradually scaled up again in a controlled way.

Date

The issue occurred on September 12, 2024, between 12:04 and 15:03. All times in UTC.

Root Cause

An unusual spike in encoding job processing that was not smoothed out by our scheduling algorithm caused an overload in the component responsible for requesting instances for encoding job processing in AWS. The component could not handle the amount of work and recover itself, affecting all other jobs on AWS.

Implications

Encoding jobs that were started remained in the queued state. Some jobs failed to start and transitioned to the error state with a “Scheduling failed” message.

Remediation

The engineering team quickly identified the affected component causing the long queue times and scheduling failed errors. The load on this component was reduced by delaying the processing of encoding jobs. This then allowed the overloaded component to recover. Once it had recovered the processing of jobs was ramped up to normal operations again. The reduction in job processing also delayed non-AWS Encoding jobs.

Timeline

12:04 - The monitoring systems alerted the engineering team about an overloaded system component and started investigating.

12:15 - The engineering team closely monitored the impacted component to identify the impact.

12:32 - The engineering team started investigating different approaches to let the impacted system recover.

13:30 - The engineering team identified that customer job processing on AWS is impacted and reduced the number of jobs that are processed in the system.

14:00 - The component recovered and the engineering team started to scale up the encoding job processing again.

14:24 - The full processing capacity was restored and the system continued to process the queued jobs normally.

15:03 - The engineering team continued closely monitoring the systems.

‌

Prevention

After the first investigations, the engineering team will do the following actions to prevent a similar overload scenario in the future of this component:

Scale the underlying database to a bigger instance type
Improve the scheduling algorithm of the system to smooth out peak load patterns
Review data access patterns to avoid high load of the component

Eventually, the specific scenario that led to the overload situation will be simulated in a separate environment to validate the prevention measures are working as expected.

Posted Sep 19, 2024 - 07:53 UTC

Resolved

Everything is back to normal. The team is still monitoring the situation and will provide a RCA in the next few days.

Posted Sep 12, 2024 - 15:03 UTC

Monitoring

The system is starting to recover and encoding jobs are picked up again. It will still take a while until the available slots are fully utilized.

Posted Sep 12, 2024 - 14:24 UTC

Update

The team has identified the problem and is working on recovering the system.

Posted Sep 12, 2024 - 13:40 UTC

Investigating

The team is investigating long queue times and scheduling failed errors for encoding jobs running on AWS.

Posted Sep 12, 2024 - 13:31 UTC

This incident affected: Bitmovin API (Encoding Service).