A kubernetes API version incompatibility caused a proxy component to fail. This led to an outage of the public website, dashboard, demos and documentation pages. The engineering team identified the problem and implemented a fix by bringing up a replacement proxy.
The incident occurred on February 27, 2023 between 15:40 and 18:54. All times are in UTC.
Bitmovin uses a reverse proxy in front of the public website, dashboard, docs and demo pages. The technology used for this proxy are multiple instances of a Nginx Kubernetes Ingress Controller. The Kubernetes cluster on which these instances were running, was automatically upgraded to a newer version by our Cloud Provider. This newer version also deprecated APIs that were required for the operation of the proxy (Nginx-Ingress-Controller). This caused the proxy to fail to restart and thus no traffic could be served to properties residing under https://bitmovin.com/*
The public website https://bitmovin.com and the web applications hosted on that domain (dashboard, docs and demo pages) were not reachable and could not be used by customers. Our API, analytics and player licensing were not affected.
After identifying the issue the engineering team started with multiple approaches to fix the problem. The most viable solution was shortly selected and the team started to bring up a new proxy instance outside the cluster. Step by step the applications were configured again in the proxy and checked for correct operation and monitored.
2023-02-27 15:40
Internal monitoring alerted the teams about the incident.
2023-02-27 15:48
The Engineering team started investigating the incident.
2023-02-27 16:00
The faulty proxy component was identified and the engineering team started attempts to upgrade to a compatible version of Nginx Kubernetes Ingress Controller.
2023-02-27 17:13
Attempts to upgrade the existing proxy component were abandoned and the engineering team started provisioning a bare-metal Nginx reverse proxy to resolve the situation.
2023-02-27 17:48
The Bitmovin Dashboard was online again.
2023-02-27 18:20
The Bitmovin Website and the demo pages were available again.
2023-02-27 18:41
The Bitmovin documentation was available again. The Engineering team kept monitoring.
2023-02-27 18:54
The incident was resolved.
The engineering team will adjust the monitoring, alerting and internal processes to better handle kubernetes version incompatibilities of the affected cluster.
The new proxy setup will also be extended to increase its resilience and it will be fully integrated into the existing configuration, deployment, alerting and monitoring tooling. This is part of an ongoing improvement specifically for the proxy component that was already planned and will lead to a simpler and better maintainable setup with no external cloud provider dependencies.