Some encoding start calls - especially for encodings with lots of configured streams and muxings - were taking much longer than expected. If this time was surpassing 60 seconds they eventually resulted in 504 Gateway Timeout responses.
The issue occurred between
All times are UTC.
When starting an encoding job the Bitmovin API services have to gather all configured data of this encoding and combine them into a single configuration of the whole encoding. This is done synchronously so that we can validate if the setup of the encoding is correct and inform the customer immediately in case there is any misconfiguration.
This is especially slow for encodings with lots of configured streams and muxings as a lot of information for each entity has to be retrieved from the Database (for example Input Streams, Muxings, Filters, Sprites, Thumbnails, etc.). Additionally, we have to perform some updates on the stream level before scheduling an encoding to ensure data consistency.
During the mentioned time frame the load on the Bitmovin API - and its database - was also higher than usual due to increased customer activity, resulting in an overall reduction of retrieval speed from the database. This was rather insignificant for individual queries (e.g., retrieving a Stream) but showed an impact on database query-heavy workflows like the Encoding Start call.
This together led to a situation where individual encoding start calls took more than 60 seconds more frequently and thus resulted in a timeout enforced by the Bitmovin API Gateway, resulting in a 504 Gateway Timeout response for customers.
Customers - especially with lots of streams and muxings - received a much higher number than usual of 504 Gateway Timeout responses on Encoding Start calls. Indicating that encodings might not have been properly started.
The Bitmovin platform engineering team identified the root cause of the elevated Gateway Timeouts on November 3rd at about 20:00 UTC and promoted a fix to production on November 4th at 12:00 UTC.
The fix especially addressed poorly performing database queries, query, and update behavior. This fix reduced the average Encoding Start call time by about four times. Additionally, the fix includes improved observability into the Encoding Start process to make further improvements possible.
Post application of these fixes the engineering teams pro-actively monitored across customers to see the encoding status, successful job completion, and normal encoding queue within expected ranges. Updating support teams on a successful resolution.
Note that during the same time frame another related issue was ongoing that caused the high database load. This led to the gap between notifying the engineering team and starting an in-depth investigation. See our status page for RCAs of the other, related incidents.
Although there are alerts for high database load in place, these specific circumstances did not trigger those alerts as CPU load and individual query times were in range. The Bitmovin platform engineering team will investigate better means of identifying and alerting such specific scenarios in the database but also at the application level.
The Bitmovin Engineering team already added additional measures to improve the observability of Encoding Start calls. This will allow the team to remove additional performance bottlenecks in the future.