Elevated API Errors
Incident Report for Bitmovin Inc
Postmortem

Summary

Some encoding start calls - especially for encodings with lots of configured streams and muxings - were taking much longer than expected. If this time was surpassing 60 seconds they eventually resulted in 504 Gateway Timeout responses.

Date

The issue occurred between

  • November 01, 16:00 until November 04, 12:00, and
  • November 04, 19:55 until November 06, 15:15 (due to a required rollback of previous speed optimizations).

All times are UTC.

Root Cause

When starting an encoding job the Bitmovin API services have to gather all configured data of this encoding and combine them into a single configuration of the whole encoding. This is done synchronously so that we can validate if the setup of the encoding is correct and inform the customer immediately in case there is any misconfiguration.

This is especially slow for encodings with lots of configured streams and muxings as a lot of information for each entity has to be retrieved from the Database (for example Input Streams, Muxings, Filters, Sprites, Thumbnails, etc.). Additionally, we have to perform some updates on the stream level before scheduling an encoding to ensure data consistency.

During the mentioned time frame the load on the Bitmovin API - and its database - was also higher than usual due to increased customer activity, resulting in an overall reduction of retrieval speed from the database. This was rather insignificant for individual queries (e.g., retrieving a Stream) but showed an impact on database query-heavy workflows like the Encoding Start call.

This together led to a situation where individual encoding start calls took more than 60 seconds more frequently and thus resulted in a timeout enforced by the Bitmovin API Gateway, resulting in a 504 Gateway Timeout response for customers.

Implications

Customers - especially with lots of streams and muxings - received a much higher number than usual of 504 Gateway Timeout responses on Encoding Start calls. Indicating that encodings might not have been properly started.

Remediation

The Bitmovin platform engineering team identified the root cause of the elevated Gateway Timeouts on November 3rd at about 20:00 UTC and promoted a fix to production on November 4th at 12:00 UTC.

The fix especially addressed poorly performing database queries, query, and update behavior. This fix reduced the average Encoding Start call time by about four times. Additionally, the fix includes improved observability into the Encoding Start process to make further improvements possible.

Post application of these fixes the engineering teams pro-actively monitored across customers to see the encoding status, successful job completion, and normal encoding queue within expected ranges. Updating support teams on a successful resolution.

Timeline

  • Nov 1st 15:00 Bitmovin’s monitoring system started indicating higher than usual 504 Gateway timeouts
  • Nov 1st 19:30 Bitmovin’s platform engineering team was notified
  • Nov 3rd 18:00 Bitmovin’s platform engineering team started the in-depth investigation
  • Nov 4th 12:00 The team promoted a fix to production which brought down the 504 Gateway timeouts to an expected level.
  • Nov 4th 19:55 The engineering team had to perform a rollback as the implemented fix was suspected to cause other issues.
  • Nov 6th 15:15 The fix was promoted once again to production as it did not seem to have caused the other issues.

Note that during the same time frame another related issue was ongoing that caused the high database load. This led to the gap between notifying the engineering team and starting an in-depth investigation. See our status page for RCAs of the other, related incidents.

Prevention

Although there are alerts for high database load in place, these specific circumstances did not trigger those alerts as CPU load and individual query times were in range. The Bitmovin platform engineering team will investigate better means of identifying and alerting such specific scenarios in the database but also at the application level.

The Bitmovin Engineering team already added additional measures to improve the observability of Encoding Start calls. This will allow the team to remove additional performance bottlenecks in the future.

Posted Nov 08, 2022 - 15:09 UTC

Resolved
Error responses have returned to a normal level.
Posted Nov 04, 2022 - 13:43 UTC
Monitoring
A fix has been deployed at around 12:00 UTC and the number of API errors has dropped significantly.
We will continue to closely monitor API error rates.
Posted Nov 04, 2022 - 12:21 UTC
Identified
The team has identified the issue and is working on a fix.
Posted Nov 04, 2022 - 09:53 UTC
Update
Bitmovins engineering team deployed a version with addition logging and metric to better diagnose the issue.
As stop gap solution we have scaled the instances to make this problem less likely to happen.
Posted Nov 04, 2022 - 08:13 UTC
Update
The Bitmovin engineering team is still investigating the reason for the elevated timeout issues. In the meantime the backend nodes have been scaled up to distribute the load and minimize the amount of timeouts.
Posted Nov 03, 2022 - 18:40 UTC
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Nov 03, 2022 - 14:10 UTC
This incident affected: Bitmovin API (Account Service, Input Service, Encoding Service, Output Service, Statistics Service, Infrastructure Service, Configuration Service, Manifest Service, Player Service).