Bitmovin - Delayed Observability Data Ingestion (US Region) – Incident details

Delayed Observability Data Ingestion (US Region)

Resolved
Major outage
Started about 1 hour agoLasted about 1 hour

Affected

Observability/Analytics

Degraded performance from 4:37 PM to 5:35 PM

Data Ingress

Degraded performance from 4:37 PM to 5:35 PM

Query Service

Degraded performance from 4:37 PM to 5:35 PM

Updates
  • Resolved
    Resolved

    This incident is now resolved. As of 17:32 UTC, all buffered data has been successfully backfilled into our central Observability data store, and ingestion has been operating normally on the alternative transport protocol since the failover at 16:52 UTC.

    There was no data loss. Records buffered during the connectivity issue are now fully available for querying.

    The incident began at 16:26 UTC and was fully resolved at 17:32 UTC. We apologize for any inconvenience and will follow up with a post-incident review once we have performed a full root cause analysis.

  • Identified
    Identified

    We have identified the cause of the connectivity issue between our US datacenter and our central Observability data store and have failed over to an alternative transport protocol. Real-time ingestion of US analytics data has been fully restored.

    We are now draining the backlog of buffered data into the database. Some recently buffered records may appear with a short delay until this process completes. No data loss is expected.

    We will continue to monitor the system and will post a final update once all buffered data has been backfilled.

  • Investigating
    Investigating

    We are currently investigating connectivity issues between our US datacenter and our central Observability data store happening since 16:26 UTC. As a result, approximately 20% of analytics requests originating in the US are not being ingested in real time.

    Affected data is being buffered and will be inserted into the database once connectivity is restored, so no data is being lost. Querying of previously ingested data is unaffected.

    We are working with our cloud providers to identify the root cause and will provide an update as soon as we have more information.