Elevated API errors and stuck encoding jobs

Incident Report for Bitmovin Inc

Postmortem

Summary

At 16:15 our system was flooded with an unusually high number of encoding jobs in a very short amount of time. The engineering team was alerted by elevated API errors exceeding set thresholds and started to investigate. On inspection, the team encountered an unusually high load on one of the database instances with an equally high number of active connections. Resulting in the underlying services being unable to acquire a new database connection and thus failing the call with an API server error.

With the API issue resolved a number of enqueued encoding jobs remained in a faulty state within the scheduling service (which is responsible for starting new encoding jobs in the Bitmovin encoding system). The team resolved the faulty state and encoding jobs started to be enqueued again at 19:10.

Please note: For some of the encoding jobs that started during the time of the incident the finished message got lost and thus the encoding job got stuck and put into an error state by the engineering team.

Date

The issue occurred on April 13, 2023, between 16:15 and 19:10. All times in UTC.

Root Cause

Due to the unusually high number of created and enqueued encoding jobs the amount of available database connections was exhausted and thus some services failed to acquire new database connections which led to the following problems:

Elevated API errors
Some stuck running encoding jobs
Faulty state in the scheduling service, not being able to schedule some queued encoding jobs

Implications

Many queued, and already running encoding jobs were not managed correctly by the scheduling logic and therefore could not run successfully and new and existing encoding jobs could not fully use the available encoding slots. The stuck encoding jobs were set to ERROR by the engineering team. Encoding jobs that were cleaned up by engineers did not trigger configured notifications. Between 16:15 and 17:00 an elevated error rate for API services was observed. Therefore we propose to have proper retry handling with exponential backoff in place as described in our best practice guide.

Remediation

Once the Engineering team was made aware of the issues by the monitoring and alerting systems as well as customer feedback, the investigation into the problems was started.

Once the unusually high number of connections on the database was discovered the services that connect to the database were restarted which caused the number of connections to go down to manageable levels again. The team fixed the inconsistent state in the scheduling service to allow all encoding jobs to get into a running state again.
Some encoding jobs were detected to be stuck during the time of the incident and the engineering team set them to error. Our recommendation is to rerun those encoding jobs.

Timeline

Apr 13, 16:15 - Internal monitoring and first customer reports indicated an elevated error rate for our API. Exhausted database connections were identified as the root cause.

Apr 13, 17:10 - The number of used database connections was recovered back to normal levels. API error rates were back to normal levels.

Apr 13, 19:10 - The faulty state in the scheduling service was fixed, and queued encoding jobs started normally again. Stuck encoding jobs could reduce the number of available encoding slots for existing customers.

Apr 13, 21:45 - Cleanup of stuck encoding jobs was finished, and affected customers were able to leverage all their encoding slots again.

Prevention

The team will rework limits to the amount of allowed create requests on Bitmovin encoding resources and rate-limit excessive usage of the service. This will be done in a way that will not affect the normal operation of our current customer's integrations with the Bitmovin encoding service.

One root cause of the incident was that services were running out of database connections leading to failures in the services. The team will review the limits of database connections available in the overall system and the affected services and make appropriate optimizations.

In addition, the causes and implications of the faulty state of the scheduling service will be analyzed and fixed.

What to do as a customer

Please check your encoding jobs during the incident time
If they are FINISHED they have been processed correctly and you can use the output files as normal.
If your encoding jobs are in ERROR state, please redo them. Please note that we don't bill these ERROR state encoding jobs.
In rare cases, it can be that you see some inconsistent data in the dashboard of impacted encoding jobs, such as long-running ANALYSIS tasks. Please ignore them, they have no impact on the results of your encoding.

Posted Apr 18, 2023 - 12:10 UTC

Resolved

The incident is resolved. Everything is operating normally again. The team will provide a post-mortem in the upcoming days.

Posted Apr 13, 2023 - 21:56 UTC

Monitoring

All encoding jobs that were stuck in a running state have been purged. Encoding jobs are processing normally. We will keep monitoring the system.

Posted Apr 13, 2023 - 21:33 UTC

Update

Queued encoding jobs are being processed again. The team is looking now at the running encoding jobs that are stuck.

Posted Apr 13, 2023 - 19:23 UTC

Update

API error rate is back at normal levels. The team is now investigating encoding jobs that are stuck in queued and running state.

Posted Apr 13, 2023 - 18:11 UTC

Investigating

We are experiencing an elevated level of API errors and are currently looking into the issue.

Posted Apr 13, 2023 - 17:29 UTC

This incident affected: Bitmovin API (Account Service, Input Service, Encoding Service, Output Service, Statistics Service, Infrastructure Service, Configuration Service, Manifest Service, Player Service).