us-west-1 region, SOLR and Zookeeper issues
Incident Report for VidiNet Platform
Postmortem

The incident

Zookeeper instances failed to cluster during maintenance. Quorum Lost

During the update all remaining zookeepers instances stopped working as soon as the first one was restarted.

Even though Zookeper tasks were restarted separately and 2 zookeeper tasks were kept up all the time, the current setup only allows for the failure of 1 tasks without loosing the quorum. however both instances entered in an unhealthy state while one was already down for update, the zookeeper quorum was lost and SOLR tasks stopped being able to communicate with zookeeper.

Unfortunately in this particular case there was no indication that both tasks could fail, therefore the reason for this maintenance.

Zookeeper was updated with a newer and more verbose image and all systems are now logging adequately.

Mitigation steps:

  • Zookeeper tasks restarted.
  • Monitored systems with errors, and scaled solrs to 0 tasks
  • Slowly scaled them back up and monitored logs for errors.
  • Implemented new log based alarms for newly detected error log
  • Improved internal documentation to better act on this kind of errors and avoid loosing quorum.
  • Re-index triggered on all systems

How to avoid this in the future:

  • Scale Zookeeper to 7 Tasks, this will allow failure of 3 tasks without loosing the quorum (Ticket 208166).
  • Update Zookeeper to newer version
Posted Aug 18, 2022 - 17:11 CEST

Resolved
The incident has been resolved, Re-Index has started on all services. Should you find any further issues please reach out to our support team.
Posted Aug 18, 2022 - 17:03 CEST
Identified
The issue has been identified and we are working on bringing services back up.
Posted Aug 18, 2022 - 15:56 CEST