us-east-1 region, SOLR and Zookeeper issues
Incident Report for VidiNet Platform
Postmortem

The incident

Zookeeper instances failed to cluster during maintenance. Quorum Lost

Even though Zookeper tasks were restarted separately and 2 zookeeper tasks were kept up all the time, during maintenance the current setup only allows for the failure of 1 tasks without loosing the quorum. however a second task entered in an unhealthy while one was already down for update, the zookeeper quorum was lost and SOLR tasks stopped being able to communicate with zookeeper.

Even though the error log was completely new to us, the issue was quickly detected due to better logging and monitoring systems implemented during the maintenance.

Mitigation steps:

  • Zookeeper tasks restarted.
  • Monitored systems with errors, and scaled solrs to 0 tasks
  • Slowly scaled them back up and monitored logs for errors.
  • Implemented new log based alarms for newly detected error log
  • Improved internal documentation to better act on this kind of errors and avoid loosing quorum.
  • Re-index triggered on all systems

How to avoid this in the future:

  • Scale Zookeeper to 7 Tasks, this will allow failure of 3 tasks without loosing the quorum (Ticket 208166).
  • Update Zookeeper to newer version
Posted Aug 18, 2022 - 17:04 CEST

Resolved
The incident has been resolved, we have triggered re-index to all systems. Should you have any further issues please reach out to our support team.
Posted Aug 18, 2022 - 14:22 CEST
Monitoring
Services are now stable and we are actively monitoring the results. Unfortunately Re-Index for systems in this region might be required.
Posted Aug 18, 2022 - 13:47 CEST
Update
We are continuing to work on mitigating the issue and bringing back stability to services. Systems performance and search results might be affected for the time being.
Posted Aug 18, 2022 - 12:39 CEST
Identified
The issue has been identified and we are currently applying measures to mitigate it.
Posted Aug 18, 2022 - 12:31 CEST
This incident affected: Core service orchestrator.