Increased Error Rate Bitmovin API

Incident Report for Bitmovin Inc

Postmortem

Summary

An inefficiency in one of the API gateway routing components handling https://api.bitmovin.com caused an overload on a backend database. This resulted in missing information to route HTTP requests properly in the Bitmovin system. As a result, the gateway returned HTTP errors for valid requests. As soon as the overload on that backend database was mitigated the gateway returned to properly handle requests again.

Within that time the dashboard did not work, it was not possible to schedule any new Encoding jobs, and a small number of existing Encoding jobs were not properly processed.

A first initial recovery of the system was only temporary and provided more insights into the underlying issues. After some time during the monitoring the API Gateway failed again. With the additional insights gained, the team was able to fully recover the API Gateway.

After the API gateway was fully functional, any jobs stuck were moved to an ‘Error’ state. Please see the ‘What to do as a customer’ section below should you have an encoding in Error state and have not already retried the encoding.

Date

The issue occurred on May 4th, 2023, from 13:35 to 14:10 and again from 15:00 to 18:05. All times in UTC.

Root Cause

High resource consumption on the database used as the backend for one of the gateway components was caused by inefficient queries. Those queries are part of the internal logic of the software component we use as a gateway and are used to synchronize state across the replicated gateway. This led first to very slow query results and later on to query timeouts on the gateway component. Because of the high parallelism of the gateway the database load unusually increased simultaneously. The query timeouts caused the gateway component to fail.

Implications

The affected API gateway component caused an outage for communication between customers and the Bitmovin API. Since the API Gateway has a caching logic in place, not all requests failed during the time of the outage.

As the Bitmovin Dashboard also uses the same Bitmovin API it was also not available during the incidents.

Remediation

The initial approaches to fix the problem were only temporary and shortly afterwards resulted in a similar situation.

After the routing gateway component was identified as the root cause, the team removed the load from the database by stopping the gateway component. The resource limits and number of instances were adjusted according to the values observed during the outage. This resulted in a lower resource consumption on the database cluster.

The gradual startup of the API gateway was then carefully monitored to prevent overwhelming the database with queries again.

Afterwards, the team continued to monitor the gateway and cleaned up Encoding jobs that were stuck.

Timeline

May 4, 13:00 - Backend database CPU utilization increased.

May 4, 13:35 - API gateway requests started failing as the gateway was unable to get data from the backend database

May 4, 13:50 - The gateway component instances and one affected backend database node were restarted.

May 4, 14:10 - The gateway component became responsive again and we saw HTTP 5xx errors return back to normal levels.

May 4, 14:30 - 14:50 - The remaining nodes of the backend database were restarted.

May 4, 15:00 - HTTP error rates increased again

May 4, 16:00 - Gateway was shut down, adjusted, and gradually started again.

May 4, 16:15 - HTTP error rates started to decrease again and the startup was further monitored.

May 4, 18:05 - HTTP error rates back to normal. The incident was resolved.

Prevention

As a first measure, the backend database resources were adjusted to better cope with load spikes, which will cover similar situations. As a result of an investigation of the used Gateway components, the team is evaluating options for further redundancy to create a more robust setup and focus on implementing those as part of our technology roadmap in Q2 and Q3.

We are confident that the backend database adjustments have resolved the issue. However, we are continuing to investigate the inefficiencies observed in the backend database and ensure additional redundancy with an improved setup.

What to do as a customer

Please check your encoding jobs during the incident time
If they are FINISHED, they have been processed correctly and you can use the output files as normal.
If your encoding jobs are in ERROR state, please redo them. Please note that we don't bill these ERROR state encoding jobs.

Posted May 10, 2023 - 09:42 UTC

Resolved

HTTP Error rates are back to normal. An RCA will be provided in the next days.

Posted May 04, 2023 - 14:41 UTC

Monitoring

The error rates are back to normal. The team is continuing to monitor the situation.

Posted May 04, 2023 - 14:12 UTC

Identified

The team has identified the faulty component and is working on fixing the issue.

Posted May 04, 2023 - 14:07 UTC

Investigating

We see our API returning an increased number of HTTP 503 Errors. The team is investigating.

Posted May 04, 2023 - 13:55 UTC

This incident affected: Bitmovin API (Account Service, Input Service, Encoding Service, Output Service, Statistics Service, Infrastructure Service, Configuration Service, Manifest Service, Player Service) and Bitmovin Dashboard.