Root Cause Analysis – Encoding Service Outage 23 Nov 2025
Summary
On 23 Nov 2025, the Bitmovin Encoding Platform experienced a service outage affecting our VoD and Live encoding pipeline. The issue resulted in temporary unavailability of encoding operations including starting and stopping of encodings and unavailability of status updates of running encodings for the time of the outage.
Root Cause
A bug in one of our encoding services caused memory usage and database (DB) data transfer volumes to increase slowly but steadily over the course of approximately one month.
This resulted in:
Gradually escalating DB read traffic from the encoding services due to increasing object size with time
Overwhelming the service with growing message queues for this specific service due to decreased throughput
Data transfer reaching ~900 MB/s outbound from the database to the service instances
As memory usage kept rising, multiple encoding service instances ultimately crashed simultaneously, leading to an abrupt stop in processing and further service degradation due to accumulation of ~300,000 unprocessed service messages during the outage from 18:00 CET.
Impact
Detection
Internally, teams observed service pods crashing due to memory limit exhaustion, causing message processing interruptions and queue growth.
Investigation, Mitigation & Recovery
After the service crashes, we initiated a controlled process:
Increased service memory limits on our kubernetes cluster 18:25 CET
Reduced parallel message queue worker logic of the affected service to reduce DB read load by about 30% 19:02 CET
Implemented further monitoring and tracing in the service to increase visibility into the massive DB data streaming causes 19:38 CET
Re-routed several messages on our messaging system to further decrease DB read load 19:51 CET
Sped up processing of messages causing excessive read operations through service based skipping with a new service version 20:23 CET
Deployed new service which decreased DB read load to normal levels 20:56
Increased message throughput for services to restore the system state by processing the message backlog of about 350k messages in RabbitMQ 21:30 CET
Restored all service configuration and gradually increased encoding throughput 23:00 CET
Full service restored 23:31 CET
Next Steps and Preventive Actions
To prevent recurrence, we are implementing the following:
Implement a holistic fix for the bug causing excessive DB read load
Expand message queue and memory usage monitoring with stricter alerts for read amplification
Comprehensive review of encoding service code paths related to heavy DB reads
Message queue pressure safeguards including throttling for struggling services