Team Edition Updating due to issues with underlying hardware of databases

Incident Report for VidiNet Platform

Postmortem

Incident Summary

Wednesday 15/2 12:00 CET some VidiCore as a Service got forced into updating the state outside the maintenance window. The update took longer time than expected.

We were unsuccessful in stopping the updates for all systems, which led to some systems getting updated outside the maintenance window.

Leadup

Our schema of applying infrastructural changes step by step to avoid getting throttled by AWS was too slow. We could not stop some systems from getting updated outside the maintenance window.

Maintenance involving rotating hardware and database maintenance upgrading RDS Aurora Postgres from version 10 to 11.

Pre-studies and dry runs performed on databases of all sizes before the planning maintenance window were successful.

Impact

Most systems were updated as expected during the maintenance window.

Some systems did not get updated, instead got forcefully moved into updating state later that day. Leaving these systems unresponsive during the time of the update.

Detection

Our alarms triggered that multiple systems were in updating state beyond our threshold value (15 minutes)

We identified that these systems were forcefully thrown into the updating state which should have happened earlier that morning.

Response

System updating is a natural state for all VidiCore SaaS systems, it is not unusual that they run longer than 15 minutes.

Support contacted us that customers are facing outages on their systems.

Recovery

The updating systems could not be stopped. We were monitoring progress until it finished.

All systems recovered after the update was done, and on-premise VSA clients had to be manually restarted for a fully functional solution.

Timeline

All times are CET.

Wednesday 15/2.

08:00 - 08:45 - Pipelines with instruction of rotating hardware and running database maintenance initiated.

08:45 - We communicated that maintenance is over.

09:30 - We noticed that not all systems had been updated. We started planning another maintenance window for applying the changes to the rest of the systems.

12:35 - Some remaining systems were forced into updating state applying the previously planned changes.

16:00 - All systems finished updating.

Root Cause

Our step-by-step execution of applying infrastructural changes was running for longer than expected. This resulted in some updates being triggered outside the maintenance window.

Lessons Learned and Corrective Actions

VidiCore as a Service maintenance pipeline must update systems sequentially to avoid getting throttled by AWS.
Individual systems will be separated into several maintenance windows to get a better idea of when the actual downtime will occur.
All operations on AWS RDS Aurora Always take longer than expected. Vidispine will collaborate with AWS Engineers in order to get a better understanding of best practices to avoid unnecessary long/slow operations.
Extended maintenance windows for critical system components must be planned more carefully.

Posted Feb 16, 2023 - 14:49 CET

Resolved

This incident has been resolved.

Posted Feb 15, 2023 - 16:46 CET

Update

Issues resolved. Post mortem in progress.

Posted Feb 15, 2023 - 16:26 CET

Update

We are continuing to investigate this issue.

Posted Feb 15, 2023 - 16:25 CET

Update

We are continuing to investigate this issue.

Posted Feb 15, 2023 - 15:49 CET

Update

We have found the root cause and are continuously monitoring the updating systems.

Posted Feb 15, 2023 - 14:27 CET

Update

We are continuing to investigate this issue.

Posted Feb 15, 2023 - 14:26 CET

Investigating

Issues with underlying hardware powering databases on some Team Edition causes systems to be stuck in “Updating”
We are monitoring this and expect it to be resolved soon.

Posted Feb 15, 2023 - 14:25 CET

This incident affected: Core service orchestrator.