Summary
On April 14, 2026, starting at approximately 06:00 UTC, historic data in Bitmovin Observability became partially unavailable. API queries for timeframes before 06:00 UTC returned empty or incomplete results, and minute-level granularity in the Dashboard was unavailable for affected timeframes. Hourly granularity for historic data in the Dashboard remained unaffected. API queries for data after 06:00 UTC were fully operational throughout the incident.
Root Cause
On April 13, 2026, we upgraded our database software to a new version. This version had been running successfully in our QA environment for over one month prior to the production rollout. On April 14, our daily retention job, which removes data that has exceeded its retention period, executed normally but triggered a pre-existing bug in the new database version. This bug caused a metadata corruption that led the database to misidentify the storage location of existing data. As a result, queries against historic data returned empty results even though the underlying data was still intact on disk.
The bug was specific to tables that had been created on an older version of the database and subsequently upgraded, which is why it did not surface in our QA environment. We worked together with our database vendor to identify the root cause and determine the best recovery path. The vendor has confirmed the issue and developed a fix, which we expect to be released today (April 15, 2026).
Impact
API queries for timeframes before 06:00 UTC on April 14, 2026 returned empty or incomplete results.
Minute-level granularity in the Observability Dashboard was unavailable for affected timeframes.
Session details and unique viewers metrics were unavailable for affected timeframes.
Data exports for timeframes before 06:00 UTC were also affected.
During the final recovery steps, brief data inconsistencies were visible in the Dashboard while recovered data was being reconciled with live data.
No data was lost.
All real-time monitoring, alerting, and data ingestion remained fully operational throughout the incident. Hourly granularity for historic data in the Dashboard was unaffected. All functionality related to data after 06:00 UTC, including API queries, exports, and Dashboard views at all granularities, was fully operational. The impact was limited to minute-level granularity, session details, unique viewers, and exports for timeframes before 06:00 UTC.
Timeline (all times UTC)
Time | Event |
April 13 | Database software upgraded to new version |
April 14, 06:00 | Daily retention job runs, triggering the metadata corruption |
April 14, 06:35 | Routine data integrity check triggers an alarm, investigation begins |
April 14, 07:08 | Issue escalated to database vendor support |
April 14, 07:18 | Database vendor joins investigation call |
April 14, 09:20 | Retention job identified as the trigger for the issue |
April 14, 10:00 | Issue successfully reproduced |
April 14, 12:19 | Root cause identified and bug fixed. Analysis concludes that recovery via code change alone is not feasible. Recovery plan using backups and secondary data stores initiated |
April 14, 13:00 | Data recovery begins for the 30 day retention data store |
April 14, 19:10 | 30 day retention data store recovery completed. 90 day retention data store recovery begins |
April 15, 12:20 | 90 day retention data store recovery completed. Incident resolved |
Mitigation & Recovery
Once the root cause was identified, we worked with our database vendor who developed a fix to prevent future occurrences. Since the corrupted metadata was already present in the system, recovery via a code change alone was not feasible. We performed a full data recovery using backups and secondary data stores. Recovery was carried out in two phases: first for the 30 day retention data store, then for the 90 day retention data store. Both were completed successfully with all historic data fully restored.
Preventive Measures
The database vendor has developed a fix that prevents the metadata corruption from occurring during partition deletions. We expect the fix to be released today (April 15, 2026).
We will apply the fix to our production systems once it is available.
We will ensure our QA environment includes tables that mirror production conditions, including tables originally created on older database versions and subsequently upgraded, to catch similar edge cases before production rollouts.