Scheduling failed of Encoding jobs in Azure

Incident Report for Bitmovin Inc

Postmortem

Summary

Bitmovin’s engineering team observed failing encoding jobs configured to run on Azure. They also got informed of suspicious activity in Bitmovin’s Microsoft Azure Subscription used for Bitmovin Managed Encoding running in Azure regions. Launching infrastructure on this subscription was deactivated without prior notification. This prevented Bitmovin from launching a new computing infrastructure, leading to encoding job failure with “Scheduling failed” error messages. Encoding jobs configured to run on other cloud regions like AWS or Google were not affected at any time. Customers were instructed to fall back to cloud regions in AWS and Google. Bitmovin moved all compute to an alternative Azure subscription to unblock customers running encoding jobs in Azure regions. Microsoft admitted an incorrect detection and thus resource block on Bitmovin’s Azure subscription.

Date

The issue occurred on September 11, 2023, between 14:12 and September 13, 17:09. All times in UTC.

Root Cause

Microsoft Azure's Suspicious Activity Detector incorrectly identified Bitmovin's request for additional resources as suspicious, leading to the deactivation of Bitmovin’s main subscription. Microsoft has since identified the logic is too stringent in looking at abuse patterns; has adjusted this detection, and applied further quality controls to avoid resource blocks being applied incorrectly.

Our scheduling logic received missing capacity errors while requesting new instances in our main Azure subscription caused by the incorrectly applied resource block by Azure. This led to “Scheduling failed” error messages for customers running encoding jobs in Azure regions.

Implications

Workloads scheduled by customers using Managed Encoding in Azure could not be processed. The encoding jobs immediately transitioned to the error state. Other cloud vendors were not affected. The Cloud Connect feature for Azure Infrastructure was partially and temporarily impacted.

Remediation

The affected customers were notified and advised to change their encoding job configuration to utilize another cloud provider to process the encoding jobs. Customer communication was completed directly by the Bitmovin Customer Experience team and Status Page. The Bitmovin Engineering team switched the managed Azure subscription to an alternative one which was not affected by the resource blocks.

Timeline

Sep 11, 14:12 - The Engineering team observed failing encoding jobs configured to run on Azure and started investigating.

Sep 11, 14:16 - The Engineering team identified a resource block on the Bitmovin Azure subscription as the root cause of the failure.

Sep 11, 15:30 - A Support case with our Azure partner was opened. The support case was escalated via our partner contacts.

Sep 11, 15:40 - The Bitmovin support team started contacting customers running on Azure regions and advised them to switch the configuration of the encoding jobs to run on an alternative cloud provider.

Sep 12, 07:00 - The engineering team started working on a solution to switch the Azure encoding workloads to another Azure subscription.

Sep 12, 10:04 - The engineering team updated the scheduling logic to make a limited set of Azure regions available again for encoding workloads running on the prepared Azure subscription. Turnaround times were longer than usual as they did not run at full capacity yet.

Sep 12, 16:09 - The remaining Azure regions were also made available for Azure encoding workloads using the same strategy.

Sep 12, 21:17 - The Azure support ticket to remove the resource blocks was manually escalated via Microsoft directly.

Sep 13, 13:15 - Engineering rolled out an update that enabled normal turnaround times for Azure encoding jobs again.

Sep 13, 17:09 - The Bitmovin incident was resolved - all Bitmovin customers could run encoding jobs on Azure again.

Sep 14, 08:00 - The Engineering team works on getting the original Azure subscription activated again together with the Partner and Microsoft. They are also working to understand the root cause as to why the subscription got disabled and work on a solution to prevent this in the future.

Sept 14, 17:30 - Microsoft provided Bitmovin with an RCA saying that the newly added compromise detection logic was too stringent and a response analyst inaccurately validated the subscription leading to the resource blocks being applied incorrectly.

Prevention

The Engineering team will work with Microsoft Azure and our partner to prevent such situations in the future. The Engineering team will keep the Azure subscription failover implemented as a temporary solution and adapt tooling to make switching between Azure subscriptions easier.

Microsoft confirmed they have adjusted their incorrect detection and applied further quality controls to avoid resource blocks being applied incorrectly.

Posted Sep 15, 2023 - 13:21 UTC

Resolved

Encoding turnaround times are back to normal levels on all Azure regions. Our team was able to move the entire workload to an alternative Azure subscription now.
The team will come back with a post-mortem in the next days.

Posted Sep 13, 2023 - 17:09 UTC

Update

Encoding times in Azure across all regions are not yet up to normal. The team is investigating multiple ways to solve this and will provide an update once this is fully resolved. The recommendation is still to fall back to other cloud providers in the meantime.

Posted Sep 12, 2023 - 16:08 UTC

Update

Encoding on Azure can be scheduled again as usual.
The team is monitoring the system closely and will resolve the issue if we don't encounter any further issues.

Posted Sep 12, 2023 - 12:00 UTC

Monitoring

Encoding on Azure can be scheduled again in the regions
* Australia East
* Europe North
* Europe West

The team is monitoring the system closely and working on adding additional regions. We will provide updates in the future as we add additional regions.

Posted Sep 12, 2023 - 11:08 UTC

Identified

There is a problem with our Azure subscription. Our engineering team is in contact with Azure support to resolve the issue with the subscription. We advise all customers to use a different cloud in the meantime.
Customers encoding with other cloud providers are not affected by this incident.

Posted Sep 11, 2023 - 15:57 UTC

Update

We are continuing to investigate this issue.

Posted Sep 11, 2023 - 15:18 UTC

Investigating

The team is investigating failed scheduling for encoding jobs on Azure Cloud

Posted Sep 11, 2023 - 15:15 UTC

This incident affected: Bitmovin API (Encoding Service).