Bitmovin - Increased Error Rates on Encoding API – Incident details

Increased Error Rates on Encoding API

Resolved
Partial outage 60 %
Started 18 days agoLasted about 5 hours

Affected

Encoding

Degraded performance from 5:07 PM to 5:39 PM, Operational from 5:07 PM to 5:39 PM, Partial outage from 5:39 PM to 6:18 PM, Degraded performance from 5:39 PM to 6:18 PM, Partial outage from 6:18 PM to 7:24 PM, Degraded performance from 6:18 PM to 7:24 PM, Partial outage from 7:24 PM to 8:33 PM, Degraded performance from 7:24 PM to 8:33 PM, Partial outage from 8:33 PM to 9:30 PM, Degraded performance from 8:33 PM to 9:30 PM, Partial outage from 9:30 PM to 10:02 PM, Degraded performance from 9:30 PM to 10:02 PM, Partial outage from 10:02 PM to 10:24 PM, Degraded performance from 10:02 PM to 10:31 PM

Encoding API

Degraded performance from 5:07 PM to 5:39 PM, Partial outage from 5:39 PM to 10:24 PM, Degraded performance from 10:24 PM to 10:31 PM

Encoding Scheduler

Operational from 5:07 PM to 5:39 PM, Degraded performance from 5:39 PM to 10:31 PM

Cloud Provisioning

Operational from 5:07 PM to 7:24 PM, Partial outage from 7:24 PM to 10:24 PM, Degraded performance from 10:24 PM to 10:31 PM

Amazon Web Services

Operational from 5:07 PM to 7:24 PM, Partial outage from 7:24 PM to 10:24 PM, Degraded performance from 10:24 PM to 10:31 PM

Google Cloud

Operational from 5:07 PM to 7:24 PM, Partial outage from 7:24 PM to 10:24 PM, Degraded performance from 10:24 PM to 10:31 PM

Updates
  • Postmortem
    Postmortem

    Root Cause Analysis – Encoding Service Outage 23 Nov 2025

    Summary
    On 23 Nov 2025, the Bitmovin Encoding Platform experienced a service outage affecting our VoD and Live encoding pipeline. The issue resulted in temporary unavailability of encoding operations including starting and stopping of encodings and unavailability of status updates of running encodings for the time of the outage.

    Root Cause

    A bug in one of our encoding services caused memory usage and database (DB) data transfer volumes to increase slowly but steadily over the course of approximately one month.

    This resulted in:

    • Gradually escalating DB read traffic from the encoding services due to increasing object size with time

    • Overwhelming the service with growing message queues for this specific service due to decreased throughput

    • Data transfer reaching ~900 MB/s outbound from the database to the service instances

    As memory usage kept rising, multiple encoding service instances ultimately crashed simultaneously, leading to an abrupt stop in processing and further service degradation due to accumulation of ~300,000 unprocessed service messages during the outage from 18:00 CET.

    Impact

    • Encoding jobs failed to start after queueing 

    • Running encoding jobs failed to update their status to “In Progress” or “Finished” during the incident

    Detection

    • 18:06 CET – Investigation initiated based on internal monitoring and alerts.

    • 18:07 CET – Outage confirmed and posted on status page.

    Internally, teams observed service pods crashing due to memory limit exhaustion, causing message processing interruptions and queue growth.

    Investigation, Mitigation & Recovery

    After the service crashes, we initiated a controlled process:

    1. Increased service memory limits on our kubernetes cluster 18:25 CET

    2. Reduced parallel message queue worker logic of the affected service to reduce DB read load by about 30% 19:02 CET

    3. Implemented further monitoring and tracing in the service to increase visibility into the massive DB data streaming causes 19:38 CET

    4. Re-routed several messages on our messaging system to further decrease DB read load 19:51 CET

    5. Sped up processing of messages causing excessive read operations through service based skipping with a new service version 20:23 CET

    6. Deployed new service which decreased DB read load to normal levels 20:56

    7. Increased message throughput for services to restore the system state by processing the message backlog of about 350k messages in RabbitMQ 21:30 CET

    8. Restored all service configuration and gradually increased encoding throughput 23:00 CET

    9. Full service restored 23:31 CET

    Next Steps and Preventive Actions

    To prevent recurrence, we are implementing the following:

    • Implement a holistic fix for the bug causing excessive DB read load

    • Expand message queue and memory usage monitoring with stricter alerts for read amplification

    • Comprehensive review of encoding service code paths related to heavy DB reads

    • Message queue pressure safeguards including throttling for struggling services

  • Resolved
    Resolved

    Our systems are stable, and we are no longer observing any errors or delays.

    A Root Cause Analysis (RCA) will follow once our internal review is completed. Thank you for your patience throughout this disruption.

  • Monitoring
    Monitoring

    We are increasing parallel encoding slots every 10 minutes and the queued encoding backlog built up in during the incident will be finished soon. We encourage our customers to restart potentially stopped workflows again.

  • Update
    Update

    Our internal system state has recovered and we are slowly increasing parallel processing throughput for the queued encoding backlog.

  • Identified
    Identified

    We have identified the root cause of the issue and implemented a fix, encoding backlog processing has sped up but will take more time to fully recover the overall system state.

  • Update
    Update

    Our team has likely identified the root cause of high DB load causing delays in encoding status updates in our API and dashboard. We are deploying a new version of the affected service and will have an update in a few minutes.

  • Investigating
    Investigating

    Our team is continuing to investigate the issue impacting slow API requests, delayed encoding status updates and delayed manifest generations (manifests generated with the encoding start calls). Encoding output is not affected.

    At this time, we have not yet identified the root cause. Diagnostic efforts are ongoing, and all necessary teams are engaged.

    We sincerely apologize for the inconvenience and will provide updates as soon as we have more information.

    Thank you for your patience and understanding.

  • Update
    Update

    We have successfully reduced the Encoding API error rate; however, our services are still struggling to process the backlog of queued service messages.

    To help stabilise the system, we have further lowered the VOD processing limits and continue to actively work on clearing the queue and restoring full service performance.

    Webhook notifications are also lagging behind.

    Further updates will follow as we make progress.

  • Identified
    Identified

    We are currently experiencing a partial outage of the Encoding API. Error rates remain elevated and queued encodings continue to accumulate.

    To stabilise the platform, we have temporarily reduced maximum processing concurrency. While this measure helps improve system stability, it also impacts throughput and leads to longer wait times for queued jobs.

    Our engineering team is actively working on restoring normal performance. We will provide further updates as we make progress.

  • Investigating
    Investigating
    We are currently investigating this incident.