Date: 2025-10-16
Duration: 07:49–08:30 UTC (41 minutes)
Customer Impact: Calls to the https://api.bitmovin.com/v1/encoding/encodings/{encoding_id}/muxings/ endpoints failed, leading to workflow stoppages. The failure rate remained below monitoring thresholds, so the issue was not immediately detected.
At 07:49 UTC, our encoding cleanup process (responsible for deleting old encodings outside of retention) triggered a gap-lock on one of our database tables. While the long-running delete transaction was active, the API was unable to insert new records into the table.
This directly impacted the muxings endpoints, which failed during this period and caused encoding workflows to stop. Because the error rate was below our monitoring alert thresholds, the incident went unnoticed until the team was alerted at 08:20 UTC.
The issue was caused by outdated MySQL table statistics, which led the MySQL query optimizer to select an inefficient execution plan:
/encoding/encodings/{encoding_id}/muxings/ fail, workflows stop. Monitoring does not alert as errors are below configured threshold.After resolving the incident, we immediately stopped the cleanup process and investigated why it suddenly caused issues despite having run successfully for years.
Since then, we have:
Update MySQL statistics so the query optimizer selects the correct execution plan.
Rework alerting strategy to detect low-rate but workflow-blocking errors earlier.