On 2025-07-10 between 00:06 UTC and 00:26 UTC, and again from 01:19 UTC to 01:39 UTC, our API experienced elevated latency and an increased rate of 503 Service Unavailable and 504 Gateway Timeout errors. The issue was detected by our monitoring systems during the first incident window, prompting an immediate investigation.
The issue was traced to a customer workflow that generated unexpected high volume traffic, causing excessive load on our API gateway system. The gateway was unable to keep up with the volume of rate limiting decisions required, leading to memory pressure on the gateway nodes. This memory pressure resulted in longer request processing times and upstream timeouts.
Once identified, additional API gateway capacity was added and memory pressure on the gateway nodes was alleviated. Response times and error rates returned to normal levels as of 01:39 UTC.
The high traffic volume during the incident resulted in a significant backlog of encoding jobs. Queue times and processing throughput have been impacted and are slowly returning to normal levels as the system processes through the accumulated backlog.
We have already tweaked our API gateway configuration to increase the number of available nodes and allocate more memory per node to better handle traffic spikes. Additionally, we will be implementing auto-scaling capabilities over the coming weeks to further prevent similar incidents in the future.
This incident also triggered a comprehensive review of our rate limiting configuration. As a result of this analysis, we have adjusted our rate limits to better balance system protection with customer workflow requirements. To see our current API rate limits, please check the following documentation: https://developer.bitmovin.com/encoding/reference/introduction-of-api-rate-limits
As a general best practice, we recommend implementing retries with exponential backoff in workflows that depend on our API, to gracefully handle occasional transient errors like 503/504 responses.
We apologize for any inconvenience this may have caused.