Bitmovin uses a messaging system to handle asynchronous communication. This messaging system failed and many encoding jobs could not be properly processed and were stuck either in a processing or queued state. These encoding jobs were set to ERROR by the Engineering team as they were in an invalid state and would have never finished. The encoding jobs stuck in the processing state caused some customers to be unable to use all of their encoding slots during the time frame of the incident. The system was fully operational again by 18:30 on February 20, 2023, and the cleanup of stuck encoding jobs was completed at 22:00.
The issue occurred on February 20, 2023, between 15:00 and 18:30. All times in UTC.
The majority of messaging components restarted, which led to an uneven traffic distribution inside the messaging system resulting in many messages not being accepted or delivered successfully. This caused various API services to be unable to communicate with one another, which prevented the tracking of encoding statuses. As a result, encoding jobs stuck in the processing state counted towards the customers' parallel encoding slots, preventing new encoding jobs from being started.
Newly queued, already queued, and already running encoding jobs were not managed correctly by the scheduling logic and therefore could not run successfully and new and existing encoding jobs could not fully use the available encoding slots. The stuck encoding jobs were set to ERROR by the Engineering team. Encoding jobs that were cleaned up by engineers did not trigger configured notifications. During this time frame, an elevated error rate for API services was observed.
Once the Engineering team became aware of the missing status updates, the responsible component was identified and monitored.
The failure on the affected component was identified and restarted together with all services using it. During this time, the team monitored both the system performance as well as activity in customer accounts. As a precautionary measure, the usable encoding slots for selected customers were also limited throughout the system recovery process.
Once the component recovered, the faulty encoding jobs were cleaned up to free-up encoding slots. Stuck encoding jobs were set to ERROR status to clean up the system while also restoring the available encoding slots to their initial values.
The engineering team continued to monitor the situation following the incident.
Feb 20, 15:10 - Multiple components of the messaging system restarted. This resulted in a load shift to the remaining components, which led to an uneven load distribution. During that time the Engineering team began to investigate the incident.
Feb 20, 16:30 - The Engineering team identified services that showed communication errors within the messaging system. As remediation, the team restarted the affected services, which allowed them to communicate with the system again.
Feb 20, 17:50 - While the communication issues persisted, the Engineering team identified the messaging system as the faulty component, due to uneven workload distribution, and restarted it.
Feb 20, 18:30 - Encoding jobs were processed again successfully. The Engineering team continued to monitor the process and began to clean up the faulty encoding jobs.
Feb 20, 20:00 - The Engineering team continued to monitor system performance closely. No further errors were observed.
Feb 20, 22:00 - The Engineering team finished the cleanup and set all faulty encoding jobs to ERROR status. The system was fully operational again.
The Engineering team will: