Encoding jobs seem to be stuck - finished message are not processed quickly enough

Incident Report for Bitmovin Inc

Postmortem

Summary

Bitmovin uses a messaging system to handle asynchronous communication. This messaging system failed and many encoding jobs could not be properly processed and were stuck either in a processing or queued state. These encoding jobs were set to ERROR by the Engineering team as they were in an invalid state and would have never finished. The encoding jobs stuck in the processing state caused some customers to be unable to use all of their encoding slots during the time frame of the incident. The system was fully operational again by 18:30 on February 20, 2023, and the cleanup of stuck encoding jobs was completed at 22:00.

Date

The issue occurred on February 20, 2023, between 15:00 and 18:30. All times in UTC.

Root Cause

The majority of messaging components restarted, which led to an uneven traffic distribution inside the messaging system resulting in many messages not being accepted or delivered successfully. This caused various API services to be unable to communicate with one another, which prevented the tracking of encoding statuses. As a result, encoding jobs stuck in the processing state counted towards the customers' parallel encoding slots, preventing new encoding jobs from being started.

Implications

Newly queued, already queued, and already running encoding jobs were not managed correctly by the scheduling logic and therefore could not run successfully and new and existing encoding jobs could not fully use the available encoding slots. The stuck encoding jobs were set to ERROR by the Engineering team. Encoding jobs that were cleaned up by engineers did not trigger configured notifications. During this time frame, an elevated error rate for API services was observed.

Remediation

Once the Engineering team became aware of the missing status updates, the responsible component was identified and monitored.

The failure on the affected component was identified and restarted together with all services using it. During this time, the team monitored both the system performance as well as activity in customer accounts. As a precautionary measure, the usable encoding slots for selected customers were also limited throughout the system recovery process.

Once the component recovered, the faulty encoding jobs were cleaned up to free-up encoding slots. Stuck encoding jobs were set to ERROR status to clean up the system while also restoring the available encoding slots to their initial values.

The engineering team continued to monitor the situation following the incident.

Timeline

Feb 20, 15:10 - Multiple components of the messaging system restarted. This resulted in a load shift to the remaining components, which led to an uneven load distribution. During that time the Engineering team began to investigate the incident.

Feb 20, 16:30 - The Engineering team identified services that showed communication errors within the messaging system. As remediation, the team restarted the affected services, which allowed them to communicate with the system again.

Feb 20, 17:50 - While the communication issues persisted, the Engineering team identified the messaging system as the faulty component, due to uneven workload distribution, and restarted it.

Feb 20, 18:30 - Encoding jobs were processed again successfully. The Engineering team continued to monitor the process and began to clean up the faulty encoding jobs.

Feb 20, 20:00 - The Engineering team continued to monitor system performance closely. No further errors were observed.

Feb 20, 22:00 - The Engineering team finished the cleanup and set all faulty encoding jobs to ERROR status. The system was fully operational again.

Prevention

The Engineering team will:

Take measures to improve the monitoring and logging of the Bitmovin messaging system.
Make efforts to identify and alert similar issues with the messaging system before they affect customer workflows.
Update the messaging system to allow for better load distribution and more resilience.
Run additional stress tests for the most critical components of the encoding system, in an isolated environment, and make further optimizations based on the findings.

Posted Feb 22, 2023 - 18:08 UTC

Resolved

The team continued monitoring the system and everything is processing normally. A few more stuck encoding jobs were found that were set to "Error" state. The team will work on an RCA that will be published in the next few days to this incident.

Posted Feb 20, 2023 - 21:49 UTC

Monitoring

The team has cleaned up almost all stuck encoding jobs and is double-checking. So far everything runs smoothly. The team keeps monitoring the situation.

Posted Feb 20, 2023 - 19:31 UTC

Update

Our engineering did an update to the messaging infrastructure and we see the system recovering. Newly added encoding jobs are picked up and processed successfully.
Some encoding jobs are stuck in "Queued" or "In Progress" status. The team will set all those stuck encoding jobs to "Error". This will also resolve the situation for customers whose queues are full with stuck encodings.
The team keeps monitoring the situation and we will update this incident if there is any news.

Posted Feb 20, 2023 - 18:17 UTC

Update

The team is still investigating problems with our messaging infrastructure where messages of the encoding jobs are not processed. Starting of encoding jobs as well as finalizing encoding jobs is impacted. We will post updates as soon as they are available.

Posted Feb 20, 2023 - 16:52 UTC

Investigating

We are investigating encoding jobs that are not finalized as the processing of the finished message is taking too long. The encoding jobs seem to be stuck in the processing state. The team is investigating. We will post regular updates on this incident here.

Posted Feb 20, 2023 - 15:46 UTC

This incident affected: Bitmovin API (Encoding Service).