Incident Overview
Date: 2024-11-06
Start Time: 07:30 UTC
Resolution Time: 16:21 UTC
Backfill Completion: 19:14 UTC
Impact: Service downtime preventing data insertions and query capabilities on the Bitmovin Analytics Database Cluster for all users. Buffered data was backfilled post-resolution to prevent data loss.
Root Cause: Overly rapid decommissioning of nodes due to a tweak in the semi-automated upgrade process, leading the remaining nodes to falsely detect a split-brain scenario. The issue was further complicated by unrelated software issues on recovering nodes during a specialized quorum recovery procedure.
1. What Happened?
During a planned upgrade to enhance the Bitmovin Analytics Database Cluster, a critical issue arose, resulting in a prolonged outage. Here’s a breakdown of events:
- Start of Incident: The incident began on 2024-11-06 at 07:30 UTC. During the upgrade, a recent tweak to our automated upgrade process had introduced faster node decommissioning. This automation removed nodes quickly, leading to the database thinking a split-brain scenario had occurred and refusing to accept writes without a leader.
- Initial Mitigation Attempt: The team attempted to reboot the cluster at 7:43 UTC to re-elect a leader. However, the database retained the split-brain state, blocking leader election and preventing cluster recovery.
- Vendor Collaboration: After realizing that a standard recovery approach wouldn’t suffice, we engaged our database vendor, initiating a specialized cluster quorum recovery procedure.
- Additional Software Issue: During the recovery process, we encountered an unrelated software issue on the recovering nodes, causing the new leader to crash repeatedly due to memory mapping issues. It took an extended amount of time to diagnose and resolve this issue which delayed the overall system recovery significantly.
- Resolution of Software Issue: After identifying and resolving this software problem by adjusting virtual and physical memory limits, we resumed the quorum recovery process at 13:50 UTC, which restored the cluster’s functionality by 16:21 UTC.
- Buffering and Backfilling: All incoming data that couldn’t be written during the incident was buffered to a secondary system (buffering cloud queue) and successfully backfilled post-resolution, completing at 19:14 UTC.
2. Impact Analysis
- Service Unavailability: The outage affected both insert and query operations, impacting all users dependent on the Bitmovin Analytics Database Cluster.
- Downtime: Total duration of the primary outage was approximately 8 hours and 51 minutes (from 07:30 UTC to 16:21 UTC).
- Customer Impact: All data interactions were halted; however, buffered data was successfully backfilled post-resolution and no data was lost.
- Data Integrity: Buffered incoming data was fully restored through backfilling.
3. Root Cause
The root cause of this incident was an overly aggressive tweak to the automated upgrade process, which removed nodes too quickly during a routine upgrade, leading to:
- Quorum Loss and Split-Brain Condition: The rapid node decommissioning triggered split-brain protections in the remaining nodes to avoid data loss. The split-brain protections resulted in a loss of quorum and inability to elect a leader.
- Persistent State Behavior: The cluster state persisted after the reboot, preventing standard leader election and requiring specialized recovery procedures.
- Unrelated Software Issue: During the quorum recovery, an unrelated memory mapping issue caused repeated crashes of the new leader, delaying the resolution.
4. Immediate Response and Actions Taken
- Incident Detection: The team was alerted at 07:30 UTC, quickly identifying the loss of cluster leadership.
- Initial Mitigation: A cluster reboot was attempted to re-establish a leader, but the split-brain state persisted, prompting collaboration with our database vendor.
- Quorum Recovery: Working with the vendor, we initiated a specialized cluster quorum recovery process.
- Discovery and Resolution of Additional Issue: Midway through the recovery, an unrelated memory mapping issue caused leader crashes. After identifying and fixing this, the quorum recovery was completed successfully.
- Data Backfill: Post-recovery, buffered data was backfilled, restoring full data integrity by 19:14 UTC.
5. Lessons Learned
- Refine Automated Upgrade Process: Further adjustments to the automated upgrade process will add rate limits and checks to prevent overly rapid node decommissioning. This will ensure decommissioning remains controlled to avoid quorum issues.
- Enhanced Cluster Recovery Protocols: Situations that could trigger split-brain conditions will have specific documented recovery protocols, including collaboration steps with the vendor, to streamline future responses.
- Unrelated Issue Identification: Monitoring of nodes during recovery will be enhanced to detect unrelated software issues early, allowing faster troubleshooting without delaying recovery.
6. Action Items