Bitmovin - Notice history

100% - uptime

Encoding API - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 99.91%
Aug 2025
Sep 2025
Oct 2025

Encoding Scheduler - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025

Notification Service - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025
100% - uptime

Data Ingress - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025

Query Service - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025

Export Service - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025

Alerting Service - Operational

100% - uptime
Aug 2025 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2025
Sep 2025
Oct 2025

Notice history

Oct 2025

Encoding Failures in us-east-1 Due to EC2 Instance Creation Issues
  • Resolved
    Resolved

    The AWS incident affecting EC2 instance creation in us-east-1 has been resolved by AWS as of Oct 20 21:48 UTC

  • Update
    Update

    The AWS incident affecting EC2 instance creation in us-east-1 is still ongoing. As a result, encoding jobs in this region may continue to fail to start.

    There are currently no actions we can take on our side until AWS resolves the underlying issue. Therefore, we will pause posting further updates until AWS has marked the incident as resolved.

    For the latest information, we recommend monitoring the AWS Health Dashboard: https://health.aws.amazon.com/health/status

    In the meantime, we continue to recommend switching to another unaffected cloud region or using the fallbackRegion setting as described in our Encoding Incident Operational Playbook: https://developer.bitmovin.com/encoding/docs/encoding-incident-operational-playbook#1-add-fallback-regions-when-creating-encodings

  • Update
    Update

    The incident is still ongoing. Our encoding service in us-east-1 continues to be impacted by AWS EC2 instance creation issues.

    We are waiting for further updates from AWS and will continue to monitor their recovery efforts. For more detailed and up-to-date information, we recommend customers check the AWS Health Dashboard: https://health.aws.amazon.com/health/status

    In the meantime, we continue to recommend switching to another unaffected cloud region or using the fallbackRegion setting as described in our Encoding Incident Operational Playbook: https://developer.bitmovin.com/encoding/docs/encoding-incident-operational-playbook#1-add-fallback-regions-when-creating-encodings

  • Monitoring
    Monitoring

    Since 12:30 UTC on October 20, our encoding service in the us-east-1 region has again been impacted by an AWS issue preventing the creation of new EC2 instances.

    As a result, new encoding jobs in this region may fail to start. Running encodings on existing capacity are not affected.

    Workaround:
    We recommend all customers switch to another cloud region that is not affected or make use of the fallbackRegion setting when creating encodings, as described in our Documentation: https://developer.bitmovin.com/encoding/docs/encoding-incident-operational-playbook#1-add-fallback-regions-when-creating-encodings.

    We are monitoring AWS’s recovery efforts closely and will provide further updates as more information becomes available.

Elevated API Error Rate on Muxing creation
  • Postmortem
    Postmortem

    Incident Post-Mortem – Encoding Cleanup Gap-Lock

    Date: 2025-10-16
    Duration: 07:49–08:30 UTC (41 minutes)
    Customer Impact: Calls to the https://api.bitmovin.com/v1/encoding/encodings/{encoding_id}/muxings/ endpoints failed, leading to workflow stoppages. The failure rate remained below monitoring thresholds, so the issue was not immediately detected.

    Summary

    At 07:49 UTC, our encoding cleanup process (responsible for deleting old encodings outside of retention) triggered a gap-lock on one of our database tables. While the long-running delete transaction was active, the API was unable to insert new records into the table.

    This directly impacted the muxings endpoints, which failed during this period and caused encoding workflows to stop. Because the error rate was below our monitoring alert thresholds, the incident went unnoticed until the team was alerted at 08:20 UTC.

    Root Cause

    The issue was caused by outdated MySQL table statistics, which led the MySQL query optimizer to select an inefficient execution plan:

    • Instead of using an index scan on the large table, the query optimizer chose a full table scan.

    • Under REPEATABLE_READ isolation, this resulted in gap-locks across wide portions of the table.

    • In practice, this behaved like a table-level lock, blocking all inserts until the cleanup query was stopped.

    Timeline

    • 07:49 UTC – Encoding cleanup process starts. Optimizer chooses full table scan → gap-locks prevent inserts.

    • 07:49–08:20 UTC – Calls to /encoding/encodings/{encoding_id}/muxings/ fail, workflows stop. Monitoring does not alert as errors are below configured threshold.

    • 08:20 UTC – Engineering team is alerted.

    • 08:20–08:30 UTC – Cleanup process identified as root cause. Process is stopped, lock released.

    • 08:30 UTC – Incident resolved, API resumes normal operation.

    Immediate Actions

    After resolving the incident, we immediately stopped the cleanup process and investigated why it suddenly caused issues despite having run successfully for years.

    Since then, we have:

    • Tightened timeouts for deletion commands in the cleanup process.

    • Changed to a more forgiving isolation level to prevent broad blocking locks.

    • Adjusted the cleanup process to use smaller batch sizes and higher concurrency, which reduced database impact and increased throughput.

    • Added significantly more monitoring to the cleanup process to better track query performance and catch anomalies early.

    Next Steps

    • Update MySQL statistics so the query optimizer selects the correct execution plan.

      • Expected completion: by end of this week (2025-10-24)

    • Rework alerting strategy to detect low-rate but workflow-blocking errors earlier.

      • Expected completion: within 3 weeks (by 2025-11-06)

  • Resolved
    Resolved

    We experienced elevated API error rates on the encoding endpoints between 07:50 and 08:30 UTC. The issue, caused by the encoding service during muxing creation, was identified and promptly resolved. The engineering team is preparing a post-mortem and will share it here once available.

  • Investigating
    Investigating
    We are currently investigating this incident.

Sep 2025

Encoding failures on AWS
  • Update
    Update

    We have identified the root cause of the encoding failures: a misconfiguration in our S3 bucket.
    The S3 configuration has been corrected, and encoding jobs on AWS are now recovering. We are seeing encoding tasks completing successfully again.
    We are actively monitoring the system to confirm full recovery.

  • Resolved
    Resolved

    Root Cause Analysis: Encoding Failures on AWS

    Summary

    On September 4, 2025, between 06:31 AM and 08:10 AM CEST, encoding jobs running on AWS failed due to an issue accessing our S3 storage.

    Root Cause

    The incident was caused by an error during a routine S3 key rotation. The old access key was deleted before the new key was in use, which temporarily prevented our encoding service from accessing storage.

    Impact

    • Only encoding jobs running on AWS were affected.

    • Encodings on other cloud providers and all other Bitmovin services were not impacted.

    Resolution

    The configuration was corrected at 08:10 AM CEST, restoring access to S3. Encoding operations on AWS recovered immediately and have been stable since.

    Preventive Measures

    To prevent this from happening again, we are:

    • Updating our key rotation procedure to ensure keys are not deleted prematurely.

    • Automating the key rotation process to reduce the chance of operator error.

  • Investigating
    Investigating

    We are currently experiencing failures with encodings on AWS.
    Our engineering team is currently investigating the issue

Aug 2025

No notices reported this month

Aug 2025 to Oct 2025

Next