Bitmovin - Elevated API Error Rate on Muxing creation – Incident details

Elevated API Error Rate on Muxing creation

Resolved
Major outage
Started about 2 months agoLasted 40 minutes

Affected

Encoding

Major outage from 5:50 AM to 6:30 AM

Encoding API

Major outage from 5:50 AM to 6:30 AM

Updates
  • Postmortem
    Postmortem

    Incident Post-Mortem – Encoding Cleanup Gap-Lock

    Date: 2025-10-16
    Duration: 07:49–08:30 UTC (41 minutes)
    Customer Impact: Calls to the https://api.bitmovin.com/v1/encoding/encodings/{encoding_id}/muxings/ endpoints failed, leading to workflow stoppages. The failure rate remained below monitoring thresholds, so the issue was not immediately detected.

    Summary

    At 07:49 UTC, our encoding cleanup process (responsible for deleting old encodings outside of retention) triggered a gap-lock on one of our database tables. While the long-running delete transaction was active, the API was unable to insert new records into the table.

    This directly impacted the muxings endpoints, which failed during this period and caused encoding workflows to stop. Because the error rate was below our monitoring alert thresholds, the incident went unnoticed until the team was alerted at 08:20 UTC.

    Root Cause

    The issue was caused by outdated MySQL table statistics, which led the MySQL query optimizer to select an inefficient execution plan:

    • Instead of using an index scan on the large table, the query optimizer chose a full table scan.

    • Under REPEATABLE_READ isolation, this resulted in gap-locks across wide portions of the table.

    • In practice, this behaved like a table-level lock, blocking all inserts until the cleanup query was stopped.

    Timeline

    • 07:49 UTC – Encoding cleanup process starts. Optimizer chooses full table scan → gap-locks prevent inserts.

    • 07:49–08:20 UTC – Calls to /encoding/encodings/{encoding_id}/muxings/ fail, workflows stop. Monitoring does not alert as errors are below configured threshold.

    • 08:20 UTC – Engineering team is alerted.

    • 08:20–08:30 UTC – Cleanup process identified as root cause. Process is stopped, lock released.

    • 08:30 UTC – Incident resolved, API resumes normal operation.

    Immediate Actions

    After resolving the incident, we immediately stopped the cleanup process and investigated why it suddenly caused issues despite having run successfully for years.

    Since then, we have:

    • Tightened timeouts for deletion commands in the cleanup process.

    • Changed to a more forgiving isolation level to prevent broad blocking locks.

    • Adjusted the cleanup process to use smaller batch sizes and higher concurrency, which reduced database impact and increased throughput.

    • Added significantly more monitoring to the cleanup process to better track query performance and catch anomalies early.

    Next Steps

    • Update MySQL statistics so the query optimizer selects the correct execution plan.

      • Expected completion: by end of this week (2025-10-24)

    • Rework alerting strategy to detect low-rate but workflow-blocking errors earlier.

      • Expected completion: within 3 weeks (by 2025-11-06)

  • Resolved
    Resolved

    We experienced elevated API error rates on the encoding endpoints between 07:50 and 08:30 UTC. The issue, caused by the encoding service during muxing creation, was identified and promptly resolved. The engineering team is preparing a post-mortem and will share it here once available.

  • Investigating
    Investigating
    We are currently investigating this incident.