-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382
Comments
This appears to be a common problem. For reference: https://chat.developer.gov.bc.ca/channel/redis |
Added an AC as per discussion in dev call |
Restart of api and workers pods are not necessary, but restarting the queue consumers is required. Also wiki has been appropriately updated |
What is the technical reason to restart queue-consumers? Is there something that can be changed to avoid manual intervention? |
Created a make command from the below recommendation from the Redis channel in rocket chat and it seems to recover our Redis without deleting it completely in the environment and creating them again. ![image](https://github.com/bcgov/SIMS/assets/62901416/0dc17383-c929-4224-85eb-23c586a94a1e) Note: This command had to be run every time, when we put down the Redis manually or any disruptions that restart the Redis pods. Parallel to this, if we have a Sysdig alert which checks the health of the Redis pods in the environment, would help us in any delays to the application queues processing. Also wiki has been appropriately updated https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover
I am not sure, the queue consumers are not able to connect with the Redis and the dashboard was not loading. Once we restart we were able to recover the dashboard. |
Redis cluster not starting gracefully when there is disruption in openshift, the main reason which we found was, the restart make the slave node to be in intermediate position as they are not finding the master node.
https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover
Additional Context
make delete-redis
command and re initializing and starting a new one.Acceptance Criteria
The text was updated successfully, but these errors were encountered: