Bitmovin’s engineering team observed failing encoding jobs configured to run on Azure. They also got informed of suspicious activity in Bitmovin’s Microsoft Azure Subscription used for Bitmovin Managed Encoding running in Azure regions. Launching infrastructure on this subscription was deactivated without prior notification. This prevented Bitmovin from launching a new computing infrastructure, leading to encoding job failure with “Scheduling failed” error messages. Encoding jobs configured to run on other cloud regions like AWS or Google were not affected at any time. Customers were instructed to fall back to cloud regions in AWS and Google. Bitmovin moved all compute to an alternative Azure subscription to unblock customers running encoding jobs in Azure regions. Microsoft admitted an incorrect detection and thus resource block on Bitmovin’s Azure subscription.
The issue occurred on September 11, 2023, between 14:12 and September 13, 17:09. All times in UTC.
Microsoft Azure's Suspicious Activity Detector incorrectly identified Bitmovin's request for additional resources as suspicious, leading to the deactivation of Bitmovin’s main subscription. Microsoft has since identified the logic is too stringent in looking at abuse patterns; has adjusted this detection, and applied further quality controls to avoid resource blocks being applied incorrectly.
Our scheduling logic received missing capacity errors while requesting new instances in our main Azure subscription caused by the incorrectly applied resource block by Azure. This led to “Scheduling failed” error messages for customers running encoding jobs in Azure regions.
Workloads scheduled by customers using Managed Encoding in Azure could not be processed. The encoding jobs immediately transitioned to the error state. Other cloud vendors were not affected. The Cloud Connect feature for Azure Infrastructure was partially and temporarily impacted.
The affected customers were notified and advised to change their encoding job configuration to utilize another cloud provider to process the encoding jobs. Customer communication was completed directly by the Bitmovin Customer Experience team and Status Page. The Bitmovin Engineering team switched the managed Azure subscription to an alternative one which was not affected by the resource blocks.
Sep 11, 14:12 - The Engineering team observed failing encoding jobs configured to run on Azure and started investigating.
Sep 11, 14:16 - The Engineering team identified a resource block on the Bitmovin Azure subscription as the root cause of the failure.
Sep 11, 15:30 - A Support case with our Azure partner was opened. The support case was escalated via our partner contacts.
Sep 11, 15:40 - The Bitmovin support team started contacting customers running on Azure regions and advised them to switch the configuration of the encoding jobs to run on an alternative cloud provider.
Sep 12, 07:00 - The engineering team started working on a solution to switch the Azure encoding workloads to another Azure subscription.
Sep 12, 10:04 - The engineering team updated the scheduling logic to make a limited set of Azure regions available again for encoding workloads running on the prepared Azure subscription. Turnaround times were longer than usual as they did not run at full capacity yet.
Sep 12, 16:09 - The remaining Azure regions were also made available for Azure encoding workloads using the same strategy.
Sep 12, 21:17 - The Azure support ticket to remove the resource blocks was manually escalated via Microsoft directly.
Sep 13, 13:15 - Engineering rolled out an update that enabled normal turnaround times for Azure encoding jobs again.
Sep 13, 17:09 - The Bitmovin incident was resolved - all Bitmovin customers could run encoding jobs on Azure again.
Sep 14, 08:00 - The Engineering team works on getting the original Azure subscription activated again together with the Partner and Microsoft. They are also working to understand the root cause as to why the subscription got disabled and work on a solution to prevent this in the future.
Sept 14, 17:30 - Microsoft provided Bitmovin with an RCA saying that the newly added compromise detection logic was too stringent and a response analyst inaccurately validated the subscription leading to the resource blocks being applied incorrectly.
The Engineering team will work with Microsoft Azure and our partner to prevent such situations in the future. The Engineering team will keep the Azure subscription failover implemented as a temporary solution and adapt tooling to make switching between Azure subscriptions easier.
Microsoft confirmed they have adjusted their incorrect detection and applied further quality controls to avoid resource blocks being applied incorrectly.