Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift #2382

Closed
4 tasks done
guru-aot opened this issue Oct 5, 2023 · 5 comments
Assignees

Comments

@guru-aot
Copy link
Collaborator

guru-aot commented Oct 5, 2023

Redis cluster not starting gracefully when there is disruption in openshift, the main reason which we found was, the restart make the slave node to be in intermediate position as they are not finding the master node.

https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover

Additional Context

  • The only option we are following right now is by deleting the existing redis cluster using the make delete-redis command and re initializing and starting a new one.
  • The problem on the above approach is, its deletes the PVC of the redis and the queues that are waiting will all be lost.

Acceptance Criteria

  • Find a solution either by starting the redis in standalone or any configuration to restart redis geacefully during any disruption in the environment. (review and document)
  • Investigate in the community as this is a common problem (rocket chat)
  • Also check why there is a restart of workers and API needed while this restart of redis happens, find if we can create a depends on in redis restart to automatically restart API and workers.
  • Investigate the automatic recovery of redis cluster if the cluster is setup with only 3 masters (3pods) unlike the current setup with 3 masters and 3 slaves(6pods)
@cditcher
Copy link

This appears to be a common problem. For reference: https://chat.developer.gov.bc.ca/channel/redis

@michesmith michesmith changed the title Redis cluster not starting gracefully when there is disruption in openshift Review and Document Findings for Redis cluster not starting gracefully when there is disruption in openshift Oct 18, 2023
@michesmith michesmith added this to the Sprint 53 milestone Oct 24, 2023
@dheepak-aot
Copy link
Collaborator

Added an AC as per discussion in dev call

@guru-aot guru-aot self-assigned this Oct 25, 2023
@guru-aot
Copy link
Collaborator Author

guru-aot commented Nov 1, 2023

Restart of api and workers pods are not necessary, but restarting the queue consumers is required.

Also wiki has been appropriately updated
https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover

@andrewsignori-aot
Copy link
Collaborator

  • Also check why there is a restart of workers and API needed while this restart of redis happens, find if we can create a depends on in redis restart to automatically restart API and workers.

What is the technical reason to restart queue-consumers? Is there something that can be changed to avoid manual intervention?

guru-aot added a commit that referenced this issue Nov 2, 2023
Created a make command from the below recommendation from the Redis
channel in rocket chat and it seems to recover our Redis without
deleting it completely in the environment and creating them again.


![image](https://github.com/bcgov/SIMS/assets/62901416/0dc17383-c929-4224-85eb-23c586a94a1e)

Note: This command had to be run every time, when we put down the Redis
manually or any disruptions that restart the Redis pods. Parallel to
this, if we have a Sysdig alert which checks the health of the Redis
pods in the environment, would help us in any delays to the application
queues processing.

Also wiki has been appropriately updated

https://github.com/bcgov/SIMS/wiki/DevOps-and-Running-the-Application#redis-cluster-failing-to-recover
@guru-aot
Copy link
Collaborator Author

guru-aot commented Nov 2, 2023

  • starting

I am not sure, the queue consumers are not able to connect with the Redis and the dashboard was not loading. Once we restart we were able to recover the dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants