Per-Title encoding jobs failing for certain customers
Incident Report for Bitmovin Inc
Postmortem

Summary

Some Per-Title encodings jobs started to fail due to a race condition in our logic that manifested after putting an optimization in place due to other incidents ongoing at the same time.

Date

The issue occurred between

  • November 04, 12:00, and November 06, 19:30. All times are UTC.

Root Cause

An optimization of faster message processing to avoid situations where the finished encoding status quickly propagated back to our Bitmovin API was manifesting a race condition on resolving the analysis details of the Per-Title input file. When this race condition triggered apparently two entries of the analysis details were written to our database which in the later processing caused it to fail as only one entry was expected.

Implications

Per-Title Encoding jobs started to fail due to a race condition. Encoding jobs without Per-Title configuration were working normally.

Remediation

On Nov 4th at around 16:00 the Bitmovin platform engineering team was notified of suddenly failing Per-Title encoding jobs. The team was investigating the root cause which turned out to be an optimization deployed for the performance issues of the other ongoing incidents. The optimization was rolled back leading to fewer failed Per-Title encoding jobs. On Nov 6th 19:30 the Bitmovin platform team deployed a final fix that resolved the race condition.

Timeline

  • Nov 4th 12:00 Bitmovin got alerted of some failed Per-Title encoding jobs
  • Nov 4th 16:00 Bitmovin’s platform engineering team was notified
  • Nov 4th 17:00 Bitmovin’s platform engineering team started the in-depth investigation
  • Nov 4th 19:55 The engineering team had to perform a rollback of an optimization that was suspected to be responsible for the failing Per-Title encoding jobs. After that fix, the number of successful jobs increased but still, some jobs were failing.
  • Nov 6th 19:30 The engineering team deployed the final fix which solved the race condition for good.

Prevention

Bitmovin’s platform engineering team will put additional monitoring and alerting in place to detect newly introduced errors in existing workflows faster and limit customer impact. Specifically, changes in the encoding job error rate after deploying optimizations during incidents will be alerted more rigorously to detect such changes quicker.

Posted Nov 08, 2022 - 15:17 UTC

Resolved
Some Per-Title encodings started to fail due to a race condition in our logic that manifested after putting an optimization in place due to other incidents ongoing at the same time.
Posted Nov 04, 2022 - 11:00 UTC