API returning 404 errors from Google Load Balancer
Incident Report for Bitmovin Inc
Postmortem

Summary

Due to an outage of Google Cloud Networking (https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh), the API returned 404 errors with a Google Load Balancer HTML error page.

Date

The issue occurred on Nov 16th, 2021 between 17:35 and 18:10. All times are UTC.

Root Cause

Bitmovin’s API uses the Google Load Balancer in front of the ingress to the API. This is done to do proper SSL termination and also to have stable ingress for our API.

Implications

All API calls returned a 404 error page with a Google Load Balancer HTML error page. Running encodings might also have not finished successfully or stalled, as the encoder couldn’t patch updates back to the API due to this incident.

Remediation

Once Bitmovin's engineering team found that Google’s Load Balancer could be the issue, the team decided to set up a fallback solution using Bitmovin’s DNS provider to bypass the Load Balancer for the time of the service disruption. However, these changes take some time as also the DNS for api.bitmovin.com needs to be updated. During that time, Google’s Load Balancer came back to normal operation and the incident was resolved.

Timeline

17:35 - Google Load Balancer issue occurred

17:48 - Investigation started

17:55 - Issue was found and mitigation work was underway

18:10 - Google Load Balancer recovered and the API was fully operational again

Prevention

As Bitmovin’s API is built on Cloud services, preventing issues in the underlying infrastructure is hard. However, Bitmovin’s team identified possible improvements which will lead to faster reactions in similar situations:

  1. Internal alerts haven’t been triggered, as both internal and external monitoring relies on Google’s infrastructure. Bitmovin’s engineering team will conduct an investigation on how to have better external and internal monitoring that doesn’t rely on Google’s infrastructure only.
  2. Bitmovin’s engineering team will also establish a process to make failovers from Google’s Load Balancer to Bitmovin’s DNS provider easier and faster.
Posted Nov 17, 2021 - 14:13 UTC

Resolved
Google has reported that they have mitigated the problem and our systems work normally now.
Posted Nov 16, 2021 - 18:57 UTC
Update
Google has acknowledged the problem (https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh). Mitigation seems to be underway. Our systems have recovered and we continue to monitor the situation.
Posted Nov 16, 2021 - 18:22 UTC
Monitoring
Bitmovin's API is up and running again. We are monitoring the situation.
Posted Nov 16, 2021 - 18:12 UTC
Investigating
Bitmovin's API is currently returning 404 errors with a Google Load Balancer page. We believe the issue is on Google's side. We are investigating the issue.
Posted Nov 16, 2021 - 18:01 UTC
This incident affected: Bitmovin API (Account Service, Input Service, Encoding Service, Output Service, Statistics Service, Infrastructure Service, Configuration Service, Manifest Service, Player Service).