A component in charge of provisioning infrastructure resources was in an overload situation which caused long queue times and partially to scheduling errors on AWS. To stabilize the system we scaled down job processing and gradually scaled up again in a controlled way.
The issue occurred on September 12, 2024, between 12:04 and 15:03. All times in UTC.
An unusual spike in encoding job processing that was not smoothed out by our scheduling algorithm caused an overload in the component responsible for requesting instances for encoding job processing in AWS. The component could not handle the amount of work and recover itself, affecting all other jobs on AWS.
Encoding jobs that were started remained in the queued state. Some jobs failed to start and transitioned to the error state with a “Scheduling failed” message.
The engineering team quickly identified the affected component causing the long queue times and scheduling failed errors. The load on this component was reduced by delaying the processing of encoding jobs. This then allowed the overloaded component to recover. Once it had recovered the processing of jobs was ramped up to normal operations again. The reduction in job processing also delayed non-AWS Encoding jobs.
12:04 - The monitoring systems alerted the engineering team about an overloaded system component and started investigating.
12:15 - The engineering team closely monitored the impacted component to identify the impact.
12:32 - The engineering team started investigating different approaches to let the impacted system recover.
13:30 - The engineering team identified that customer job processing on AWS is impacted and reduced the number of jobs that are processed in the system.
14:00 - The component recovered and the engineering team started to scale up the encoding job processing again.
14:24 - The full processing capacity was restored and the system continued to process the queued jobs normally.
15:03 - The engineering team continued closely monitoring the systems.
After the first investigations, the engineering team will do the following actions to prevent a similar overload scenario in the future of this component:
Eventually, the specific scenario that led to the overload situation will be simulated in a separate environment to validate the prevention measures are working as expected.