Increased queue times in Encoding Service

Incident Report for Bitmovin Inc

Postmortem

Root Cause Analysis – Encoding Queue Slowdown Incident

Date: Tuesday, October 21, 2025
Impact Window: 14:40 – 15:48 UTC
Status: Resolved

Summary

On October 21, 2025, our encoding service experienced increased queue times due to an external connectivity issue in Microsoft Azure. The issue affected our ability to establish connections to encoding machines in one of Azure’s regions. While all encoding jobs were ultimately processed successfully, some customers experienced longer-than-usual queue delays.

Impact

  • Encoding jobs were delayed between 14:40 and 15:48 UTC.
  • The encoding system continued to process jobs, but overall throughput was reduced.
  • All queued jobs were eventually completed successfully, though some customers experienced significant queue wait times.

Root Cause

A Microsoft Azure network connectivity incident caused our scheduler service to encounter long connection timeouts when attempting to reach instances in one affected Azure region.

Because these timeouts were lengthy, a significant portion of our scheduling system’s concurrent worker processes became occupied waiting for failed connections. This led to a temporary reduction in the scheduler’s effective job processing capacity, resulting in longer queue times across the system.

Timeline (all times in UTC)

  • 14:40 – Engineering team was alerted to increased queue times in the encoding service.
  • 14:50 – The issue was identified as connection timeouts to Azure instances. The team scaled up the scheduling system to improve throughput and reduce the backlog.
  • 15:48 – A fix was deployed that mitigated the timeout impact, and queue processing returned to normal speeds.

Resolution

The applied fix reduced the impact of the long Azure connection timeouts, restoring the scheduler’s job processing capacity. Once this was in place, the backlog quickly cleared, and encoding performance returned to normal levels.

Next Steps

  • We are now evaluating our timeout handling logic in the scheduling service as well as reviewing our retry and back-off policies that led to this issue.
  • Our Q4 roadmap already included a plan to separate scheduling services by cloud vendor. This will ensure that issues with one provider do not impact others.
Posted Oct 24, 2025 - 10:59 UTC

Resolved

The incident has been resolved. All customer encodings were processed successfully, though some experienced longer-than-usual queue times.

We are still investigating the cause of the slowdown, which appears to be related to an Azure network connectivity incident that caused long timeouts when connecting to instances.
We will continue investigating and post a full RCA once we have identified the root cause of the queue slowdowns.
Posted Oct 21, 2025 - 16:52 UTC

Update

The encoding service is currently processing jobs normally. However, we are still investigating the root cause of the earlier slowdown. Our engineering team continues to monitor closely.
Posted Oct 21, 2025 - 15:55 UTC

Update

We are still investigating the issue. The system continues to process jobs, but not fast enough to keep up with the queue. Our engineering team is actively working on this, and we will share another update once we know more.
Posted Oct 21, 2025 - 15:34 UTC

Investigating

We are currently observing increased queue times in our encoding service.
The engineering team is investigating the issue, and we will provide the next update at 15:30 UTC.
Posted Oct 21, 2025 - 14:57 UTC
This incident affected: Bitmovin API (Encoding Service).