Some Per-Title encodings jobs started to fail due to a race condition in our logic that manifested after putting an optimization in place due to other incidents ongoing at the same time.
The issue occurred between
An optimization of faster message processing to avoid situations where the finished encoding status quickly propagated back to our Bitmovin API was manifesting a race condition on resolving the analysis details of the Per-Title input file. When this race condition triggered apparently two entries of the analysis details were written to our database which in the later processing caused it to fail as only one entry was expected.
Per-Title Encoding jobs started to fail due to a race condition. Encoding jobs without Per-Title configuration were working normally.
On Nov 4th at around 16:00 the Bitmovin platform engineering team was notified of suddenly failing Per-Title encoding jobs. The team was investigating the root cause which turned out to be an optimization deployed for the performance issues of the other ongoing incidents. The optimization was rolled back leading to fewer failed Per-Title encoding jobs. On Nov 6th 19:30 the Bitmovin platform team deployed a final fix that resolved the race condition.
Bitmovin’s platform engineering team will put additional monitoring and alerting in place to detect newly introduced errors in existing workflows faster and limit customer impact. Specifically, changes in the encoding job error rate after deploying optimizations during incidents will be alerted more rigorously to detect such changes quicker.