Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

guru-aot · 2023-10-05T18:26:54Z

Redis cluster not starting gracefully when there is disruption in openshift, the main reason which we found was, the restart make the slave node to be in intermediate position as they are not finding the master node.

https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover

Additional Context

The only option we are following right now is by deleting the existing redis cluster using the make delete-redis command and re initializing and starting a new one.
The problem on the above approach is, its deletes the PVC of the redis and the queues that are waiting will all be lost.

Acceptance Criteria

Find a solution either by starting the redis in standalone or any configuration to restart redis geacefully during any disruption in the environment. (review and document)
Investigate in the community as this is a common problem (rocket chat)
Also check why there is a restart of workers and API needed while this restart of redis happens, find if we can create a depends on in redis restart to automatically restart API and workers.
Investigate the automatic recovery of redis cluster if the cluster is setup with only 3 masters (3pods) unlike the current setup with 3 masters and 3 slaves(6pods)

The text was updated successfully, but these errors were encountered:

cditcher · 2023-10-17T20:29:44Z

This appears to be a common problem. For reference: https://chat.developer.gov.bc.ca/channel/redis

dheepak-aot · 2023-10-25T17:53:32Z

Added an AC as per discussion in dev call

guru-aot · 2023-11-01T20:53:19Z

Restart of api and workers pods are not necessary, but restarting the queue consumers is required.

Also wiki has been appropriately updated
https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover

andrewsignori-aot · 2023-11-01T23:03:20Z

Also check why there is a restart of workers and API needed while this restart of redis happens, find if we can create a depends on in redis restart to automatically restart API and workers.

What is the technical reason to restart queue-consumers? Is there something that can be changed to avoid manual intervention?

Created a make command from the below recommendation from the Redis channel in rocket chat and it seems to recover our Redis without deleting it completely in the environment and creating them again. ![image](https://github.com/bcgov/SIMS/assets/62901416/0dc17383-c929-4224-85eb-23c586a94a1e) Note: This command had to be run every time, when we put down the Redis manually or any disruptions that restart the Redis pods. Parallel to this, if we have a Sysdig alert which checks the health of the Redis pods in the environment, would help us in any delays to the application queues processing. Also wiki has been appropriately updated https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover

guru-aot · 2023-11-02T22:24:38Z

starting

I am not sure, the queue consumers are not able to connect with the Redis and the dashboard was not loading. Once we restart we were able to recover the dashboard.

guru-aot added User Story Devops Devops labels Oct 5, 2023

michesmith changed the title ~~Redis cluster not starting gracefully when there is disruption in openshift~~ Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift Oct 18, 2023

michesmith added this to the Sprint 53 milestone Oct 24, 2023

guru-aot self-assigned this Oct 25, 2023

michesmith closed this as completed Nov 3, 2023

michesmith modified the milestones: Sprint 54, 2.0 Part-Time Students MVP May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

guru-aot commented Oct 5, 2023 •

edited

Loading

cditcher commented Oct 17, 2023

dheepak-aot commented Oct 25, 2023

guru-aot commented Nov 1, 2023 •

edited

Loading

andrewsignori-aot commented Nov 1, 2023

guru-aot commented Nov 2, 2023

Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

Comments

guru-aot commented Oct 5, 2023 • edited Loading

cditcher commented Oct 17, 2023

dheepak-aot commented Oct 25, 2023

guru-aot commented Nov 1, 2023 • edited Loading

andrewsignori-aot commented Nov 1, 2023

guru-aot commented Nov 2, 2023

guru-aot commented Oct 5, 2023 •

edited

Loading

guru-aot commented Nov 1, 2023 •

edited

Loading