The incident

OPS Engineer failed to notice one of zookeeper tasks was not healthy before conducting maintenance on Aug 15th

Even though Zookeper tasks were restarted separately and 2 zookeeper tasks were kept up all the time, the current setup only allows for the failure of 1 tasks without loosing the quorum. By restarting one of the healthy tasks the quorum was lost and SOLR tasks stopped being able to communicate with zookeeper.

There were no alarms related with this kind of issues which delayed its detection.

Mitigation steps:

Zookeeper tasks restarted.
Monitored systems with errors, and scaled solrs to 0 tasks
Slowly scaled them back up and monitored logs for errors.
Implemented new log based alarms

How to avoid this in the future:

Scale Zookeeper to 7 Tasks, this will allow failure of 3 tasks without loosing the quorum (Ticket 208166).
Update Zookeeper to newer version

Posted Aug 16, 2022 - 15:30 CEST

Resolved

The incident has been resolved, please reach out to our support team if you still find any issues.

Posted Aug 16, 2022 - 12:17 CEST

Monitoring

A fix has been implemented, region seems to be stable again. We are monitoring the results.

Posted Aug 16, 2022 - 11:59 CEST

Identified

We have identified the root cause and are working on bringing services back to a stable state.

Posted Aug 16, 2022 - 11:31 CEST

Investigating

We are currently investigating the issue

Posted Aug 16, 2022 - 11:30 CEST

This incident affected: Core service orchestrator.