Bitmovin website and dashboard are down - Services work properly
Incident Report for Bitmovin Inc
Postmortem

Summary

A kubernetes API version incompatibility caused a proxy component to fail. This led to an outage of the public website, dashboard, demos and documentation pages. The engineering team identified the problem and implemented a fix by bringing up a replacement proxy.

Date

The incident occurred on February 27, 2023 between 15:40 and 18:54. All times are in UTC.

Root Cause

Bitmovin uses a reverse proxy in front of the public website, dashboard, docs and demo pages. The technology used for this proxy are multiple instances of a Nginx Kubernetes Ingress Controller. The Kubernetes cluster on which these instances were running, was automatically upgraded to a newer version by our Cloud Provider. This newer version also deprecated APIs that were required for the operation of the proxy (Nginx-Ingress-Controller). This caused the proxy to fail to restart and thus no traffic could be served to properties residing under https://bitmovin.com/*

Implications

The public website https://bitmovin.com and the web applications hosted on that domain (dashboard, docs and demo pages) were not reachable and could not be used by customers. Our API, analytics and player licensing were not affected.

Remediation

After identifying the issue the engineering team started with multiple approaches to fix the problem. The most viable solution was shortly selected and the team started to bring up a new proxy instance outside the cluster. Step by step the applications were configured again in the proxy and checked for correct operation and monitored.

Timeline

2023-02-27 15:40
Internal monitoring alerted the teams about the incident.‌

2023-02-27 15:48
The Engineering team started investigating the incident.

2023-02-27 16:00
The faulty proxy component was identified and the engineering team started attempts to upgrade to a compatible version of Nginx Kubernetes Ingress Controller.

2023-02-27 17:13
Attempts to upgrade the existing proxy component were abandoned and the engineering team started provisioning a bare-metal Nginx reverse proxy to resolve the situation.

2023-02-27 17:48
The Bitmovin Dashboard was online again.

2023-02-27 18:20
The Bitmovin Website and the demo pages were available again.

2023-02-27 18:41
The Bitmovin documentation was available again. The Engineering team kept monitoring.‌

2023-02-27 18:54
The incident was resolved.

Prevention

The engineering team will adjust the monitoring, alerting and internal processes to better handle kubernetes version incompatibilities of the affected cluster.

The new proxy setup will also be extended to increase its resilience and it will be fully integrated into the existing configuration, deployment, alerting and monitoring tooling. This is part of an ongoing improvement specifically for the proxy component that was already planned and will lead to a simpler and better maintainable setup with no external cloud provider dependencies.

Posted Mar 01, 2023 - 12:58 UTC

Resolved
This incident has been resolved.
Posted Feb 27, 2023 - 18:54 UTC
Monitoring
The documentation is back online as well and we are moving into a monitoring stage.
Posted Feb 27, 2023 - 18:41 UTC
Update
Our website and demos are back online, we are working on getting our documentation back to life.
Posted Feb 27, 2023 - 18:20 UTC
Update
We managed to bring bitmovin.com/dashboard back online and are continuing to fix the issue for all affected properties.
Posted Feb 27, 2023 - 17:48 UTC
Identified
The team is working on getting the website and dashboard back online. Bitmovin API, Player Licensing, and Analytics Service are unaffected and function properly.
Posted Feb 27, 2023 - 17:13 UTC