At 16:15 our system was flooded with an unusually high number of encoding jobs in a very short amount of time. The engineering team was alerted by elevated API errors exceeding set thresholds and started to investigate. On inspection, the team encountered an unusually high load on one of the database instances with an equally high number of active connections. Resulting in the underlying services being unable to acquire a new database connection and thus failing the call with an API server error.
With the API issue resolved a number of enqueued encoding jobs remained in a faulty state within the scheduling service (which is responsible for starting new encoding jobs in the Bitmovin encoding system). The team resolved the faulty state and encoding jobs started to be enqueued again at 19:10.
Please note: For some of the encoding jobs that started during the time of the incident the finished message got lost and thus the encoding job got stuck and put into an error state by the engineering team.
The issue occurred on April 13, 2023, between 16:15 and 19:10. All times in UTC.
Due to the unusually high number of created and enqueued encoding jobs the amount of available database connections was exhausted and thus some services failed to acquire new database connections which led to the following problems:
Many queued, and already running encoding jobs were not managed correctly by the scheduling logic and therefore could not run successfully and new and existing encoding jobs could not fully use the available encoding slots. The stuck encoding jobs were set to ERROR by the engineering team. Encoding jobs that were cleaned up by engineers did not trigger configured notifications. Between 16:15 and 17:00 an elevated error rate for API services was observed. Therefore we propose to have proper retry handling with exponential backoff in place as described in our best practice guide.
Once the Engineering team was made aware of the issues by the monitoring and alerting systems as well as customer feedback, the investigation into the problems was started.
Once the unusually high number of connections on the database was discovered the services that connect to the database were restarted which caused the number of connections to go down to manageable levels again. The team fixed the inconsistent state in the scheduling service to allow all encoding jobs to get into a running state again.
Some encoding jobs were detected to be stuck during the time of the incident and the engineering team set them to error. Our recommendation is to rerun those encoding jobs.
Apr 13, 16:15 - Internal monitoring and first customer reports indicated an elevated error rate for our API. Exhausted database connections were identified as the root cause.
Apr 13, 17:10 - The number of used database connections was recovered back to normal levels. API error rates were back to normal levels.
Apr 13, 19:10 - The faulty state in the scheduling service was fixed, and queued encoding jobs started normally again. Stuck encoding jobs could reduce the number of available encoding slots for existing customers.
Apr 13, 21:45 - Cleanup of stuck encoding jobs was finished, and affected customers were able to leverage all their encoding slots again.
The team will rework limits to the amount of allowed create requests on Bitmovin encoding resources and rate-limit excessive usage of the service. This will be done in a way that will not affect the normal operation of our current customer's integrations with the Bitmovin encoding service.
One root cause of the incident was that services were running out of database connections leading to failures in the services. The team will review the limits of database connections available in the overall system and the affected services and make appropriate optimizations.
In addition, the causes and implications of the faulty state of the scheduling service will be analyzed and fixed.