Wednesday 15/2 12:00 CET some VidiCore as a Service got forced into updating the state outside the maintenance window. The update took longer time than expected.
We were unsuccessful in stopping the updates for all systems, which led to some systems getting updated outside the maintenance window.
Our schema of applying infrastructural changes step by step to avoid getting throttled by AWS was too slow. We could not stop some systems from getting updated outside the maintenance window.
Maintenance involving rotating hardware and database maintenance upgrading RDS Aurora Postgres from version 10 to 11.
Pre-studies and dry runs performed on databases of all sizes before the planning maintenance window were successful.
Most systems were updated as expected during the maintenance window.
Some systems did not get updated, instead got forcefully moved into updating state later that day. Leaving these systems unresponsive during the time of the update.
Our alarms triggered that multiple systems were in updating state beyond our threshold value (15 minutes)
We identified that these systems were forcefully thrown into the updating state which should have happened earlier that morning.
System updating is a natural state for all VidiCore SaaS systems, it is not unusual that they run longer than 15 minutes.
Support contacted us that customers are facing outages on their systems.
The updating systems could not be stopped. We were monitoring progress until it finished.
All systems recovered after the update was done, and on-premise VSA clients had to be manually restarted for a fully functional solution.
All times are CET.
Wednesday 15/2.
08:00 - 08:45 - Pipelines with instruction of rotating hardware and running database maintenance initiated.
08:45 - We communicated that maintenance is over.
09:30 - We noticed that not all systems had been updated. We started planning another maintenance window for applying the changes to the rest of the systems.
12:35 - Some remaining systems were forced into updating state applying the previously planned changes.
16:00 - All systems finished updating.
Our step-by-step execution of applying infrastructural changes was running for longer than expected. This resulted in some updates being triggered outside the maintenance window.