Manifest processing remains in queued state

Incident Report for Bitmovin Inc

Postmortem

Incident Summary

On Saturday, September 27, 2025, between 11:27 and 15:20 UTC, Bitmovin experienced a disruption in the Manifest Generation service. During this window, manifest jobs initiated via the separate “start manifest generation” call remained stuck in the “QUEUED” state and were not processed further, resulting in delayed/unavailable manifest generation for affected workflows until service recovery.

Timeline (UTC)

11:36 - Detection: Initial report received; incident triaged.

12:15 - Investigation: Noted high message backlog in the queue; immediate deep-dive initiated.

14:38 - Verification: Confirmed that manifest creation triggered via a manifest /start endpoint was not processing (stuck in QUEUED)

14:50 - Mitigation: Identified a bug in the messaging cluster; began remediation - manually resized the cluster and restarted nodes to re-sync.

15:20 - Recovery: Service has returned to operational status; new manifest generation jobs are processing normally. The team began restarting the queued manifests that were affected.

15:57 - Restoration complete:All remaining QUEUED manifests were restarted and successfully finished.

Detailed Root Cause

After the planned database maintenance actions were completed, the system started to recover and to process its queued requests again.

The recovery process and the load during that time triggered an unexpected behaviour in the messaging system, which led to improper event message propagation. This was responsible for not properly handling post-encoding manifest creation jobs.

Incident Response and Recovery

The incident response team was monitoring the system load and behaviour after a planned migration and maintenance of the core database.

Manifests remaining in the QUEUED state were observed, and the team started investigating further.

A failure in the messaging component was identified as a core problem blocking the manifest generation, and the team started the mitigation:

  • Implementing a fix for the affected component and confirming proper operation
  • identifying the stuck manifest jobs

In parallel, the team initiated manual steps to recover queued manifest generations by implementing and running the necessary automatisms and scripts.

Immediate Corrective Actions

  • Resize cluster and Nodes of the messaging system
  • Initiate re-synch of messages
  • Manual re-processing of the manifest generations remaining in QUEUED state

Impact and Improvements

All manifests that were created independently (i.e., not as part of an encoding workflow) during the timeframe 2025-09-27 11:27 UTC to 2025-09-27 15:20 UTC were affected.

Impact Description

These manifests did not progress beyond the QUEUED state. This occurred specifically when they were triggered using the dedicated “Manifest Start” API endpoints. As a result, any such manifests remained blocked in the manifest generation queue and were not processed further until the issue was resolved.

Manifests that get created by providing their configuration as part of the “Encoding /start” API endpoint directly were not affected.

Further Improvements:

  • The software component of the messaging system will be reviewed for this specific high-load situation of similar maintenance situations, and alternative options will be taken into consideration that are better suited for similar events.

    • Until 15th of October we will have reviewed our messaging system configuration under high load situations including potentially consultation of external advisors.
    • Until 31st of October we will have implemented configuration improvements to prevent future message loss in high load scenarios.

We apologise for the inconvenience caused and appreciate your understanding as we continue to improve our service reliability.

Posted Oct 02, 2025 - 10:03 UTC

Resolved

The issue affecting manifest processing has been resolved, and full functionality has been restored.
Our systems are stable, and we are no longer observing any errors or delays.
A Root Cause Analysis (RCA) will follow once our internal review is completed. Thank you for your patience throughout this disruption.
Posted Sep 27, 2025 - 17:04 UTC

Monitoring

All queued manifests have been rescheduled and generated successfully.
Posted Sep 27, 2025 - 15:58 UTC

Identified

We have identified the root cause for some manifests stalling in the QUEUED state and are working on resolving this issue. New manifests should be generated properly already.
Posted Sep 27, 2025 - 14:51 UTC

Investigating

We are currently investigating an issue where certain manifest creation requests remain in the state QUEUED and remain in this state. Our team is investigating this behaviour right now. Further updates will be provided here as soon as they become available.
Posted Sep 27, 2025 - 11:36 UTC
This incident affected: Bitmovin API (Manifest Service).