eu-west-1 region, SOLR and Zookeeper issues
Incident Report for VidiNet Platform
Postmortem

The incident

OPS Engineer failed to notice one of zookeeper tasks was not healthy before conducting maintenance on Aug 15th

Even though Zookeper tasks were restarted separately and 2 zookeeper tasks were kept up all the time, the current setup only allows for the failure of 1 tasks without loosing the quorum. By restarting one of the healthy tasks the quorum was lost and SOLR tasks stopped being able to communicate with zookeeper.

There were no alarms related with this kind of issues which delayed its detection.

Mitigation steps:

  • Zookeeper tasks restarted.
  • Monitored systems with errors, and scaled solrs to 0 tasks
  • Slowly scaled them back up and monitored logs for errors.
  • Implemented new log based alarms

How to avoid this in the future:

  • Scale Zookeeper to 7 Tasks, this will allow failure of 3 tasks without loosing the quorum (Ticket 208166).
  • Update Zookeeper to newer version
Posted Aug 16, 2022 - 15:30 CEST

Resolved
The incident has been resolved, please reach out to our support team if you still find any issues.
Posted Aug 16, 2022 - 12:17 CEST
Monitoring
A fix has been implemented, region seems to be stable again. We are monitoring the results.
Posted Aug 16, 2022 - 11:59 CEST
Identified
We have identified the root cause and are working on bringing services back to a stable state.
Posted Aug 16, 2022 - 11:31 CEST
Investigating
We are currently investigating the issue
Posted Aug 16, 2022 - 11:30 CEST
This incident affected: Core service orchestrator.