On Saturday, September 27, 2025, between 11:27 and 15:20 UTC, Bitmovin experienced a disruption in the Manifest Generation service. During this window, manifest jobs initiated via the separate “start manifest generation” call remained stuck in the “QUEUED” state and were not processed further, resulting in delayed/unavailable manifest generation for affected workflows until service recovery.
11:36 - Detection: Initial report received; incident triaged.
12:15 - Investigation: Noted high message backlog in the queue; immediate deep-dive initiated.
14:38 - Verification: Confirmed that manifest creation triggered via a manifest /start endpoint was not processing (stuck in QUEUED
)
14:50 - Mitigation: Identified a bug in the messaging cluster; began remediation - manually resized the cluster and restarted nodes to re-sync.
15:20 - Recovery: Service has returned to operational status; new manifest generation jobs are processing normally. The team began restarting the queued manifests that were affected.
15:57 - Restoration complete:All remaining QUEUED manifests were restarted and successfully finished.
After the planned database maintenance actions were completed, the system started to recover and to process its queued requests again.
The recovery process and the load during that time triggered an unexpected behaviour in the messaging system, which led to improper event message propagation. This was responsible for not properly handling post-encoding manifest creation jobs.
The incident response team was monitoring the system load and behaviour after a planned migration and maintenance of the core database.
Manifests remaining in the QUEUED state were observed, and the team started investigating further.
A failure in the messaging component was identified as a core problem blocking the manifest generation, and the team started the mitigation:
In parallel, the team initiated manual steps to recover queued manifest generations by implementing and running the necessary automatisms and scripts.
All manifests that were created independently (i.e., not as part of an encoding workflow) during the timeframe 2025-09-27 11:27 UTC to 2025-09-27 15:20 UTC were affected.
These manifests did not progress beyond the QUEUED state. This occurred specifically when they were triggered using the dedicated “Manifest Start” API endpoints. As a result, any such manifests remained blocked in the manifest generation queue and were not processed further until the issue was resolved.
Manifests that get created by providing their configuration as part of the “Encoding /start” API endpoint directly were not affected.
The software component of the messaging system will be reviewed for this specific high-load situation of similar maintenance situations, and alternative options will be taken into consideration that are better suited for similar events.
We apologise for the inconvenience caused and appreciate your understanding as we continue to improve our service reliability.