Elevated 504 Gateway Timeout Errors (25.03.2025 – 27.03.2025)

Incident Report for Bitmovin Inc

Postmortem

Summary

Between 25 March 2025, 11:36 UTC and 27 March 2025, 11:12 UTC, our API experienced a slightly elevated rate of 504 Gateway Timeout errors. The issue was reported by customers at 10:30 UTC on 27 March, prompting an investigation.

Impact

  • A small number of API requests (~0.02%) resulted in 504 errors.
  • These errors originated from our muxing service, which typically operates at a 0% error rate.
  • The elevated rate was low enough that it did not trigger our automated monitoring or alerting systems.

Root Cause

The issue was traced to a DELETE query that became stuck in the preparing state. This led to lock contention in the database, occasionally blocking other queries and resulting in timeouts.

Resolution

Once identified, the problematic query was terminated and lock contention was cleared. Error rates returned to normal levels as of 11:12 UTC on 27 March.

Next Steps

  • Review and improve monitoring and alerting thresholds to ensure low-frequency but persistent errors trigger timely investigation in the future.
  • Investigate why the DELETE query was stuck in the preparing state and why it caused lock contention affecting unrelated queries.
  • Review and optimize database operations to minimize the risk of similar issues in the future.

Recommendations for Customers

As a general best practice, we recommend implementing retries with exponential backoff in workflows that depend on our API, to gracefully handle occasional transient errors like 504 timeouts.

We apologize for any inconvenience this may have caused and appreciate the customers who reported the issue.

Posted Mar 27, 2025 - 16:08 UTC

Resolved

Between 25 March 2025, 11:36 UTC and 27 March 2025, 11:12 UTC, our API experienced a slightly elevated rate of 504 Gateway Timeout errors.

Impact
- A small number of API requests (~0.02%) resulted in 504 errors.
- These errors originated from our muxing service, which typically operates at a 0% error rate.
- The elevated rate was low enough that it did not trigger our automated monitoring or alerting systems.
Posted Mar 25, 2025 - 10:30 UTC