-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1875 - Implementing resiliency measures - PodDisruptionBudget #2237
Conversation
guru-aot
commented
Aug 31, 2023
•
edited
Loading
edited
- Ensure the pod restart, bandwidth and disruption budgets are set.
- Vertical Pod Autoscaler values are changed to release resources when there is less need. (Sample sanity load testings are done as part of this in dev namespace before changing)
- As a part of next PR or next story, PodDisruptionBudget for workers and queue-consumers are set , but once the liveliness and readiness probe are set it will work automatically.
devops/openshift/web-deploy.yml
Outdated
selector: | ||
matchLabels: | ||
app: ${NAME} | ||
minAvailable: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if wrong, I see the kubernetes documentation suggesting differently for frontends. https://kubernetes.io/docs/tasks/run-application/configure-pdb/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the documentation you have shared the minAvailable should not be reduced by the serving capacity by more than 10%. In our scenario the HorizontalPodAutoscaler we have a maxReplica of 10 and minReplicas as 2 and PDB maxUnavailable as 1, this means the overall reducing capacity is still at 10% which actually follows the documentation.
devops/openshift/api-deploy.yml
Outdated
selector: | ||
matchLabels: | ||
app: ${NAME} | ||
minAvailable: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per openshift/kubernetes, maxUnavailable
and minAvailable
are mutually exclusive. Should we use both?
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was in an assumption, we can use both minAvailable and maxUnavailable in a PodDisruptionBudget, in our current scenario we set minAvailable to 2 and maxUnavailable to 1, it ensures there are always at least 2 pods available, and only 1 pod can be undergoing disruption at any given time.
But after analyzing the documents, its right to specify only maxUnavailable 1, as this will give the option for node drain and the replica will return back to 2 once the disruption is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if my understanding is right, but document says only one of the properties can be used for a single pdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you for the explanation.
devops/openshift/web-deploy.yml
Outdated
- name: MEMORY_LIMIT | ||
value: "256M" | ||
value: "512M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMS-Api will do way more processing than the Web POD, why we would justify more memory allocation to Web than API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially we had only one pod API with more memory, but due to the change in replication controller to have max replicas of 10 we changed it to smaller numbers. In the case of web, even though there are more processing done by API, the web had to render the application faster, so for safer side, I had the values in the vertical pod bumped up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The POD will not render anything, it will just allow the download of static files (HTML, js, CSS, ect.). I did not follow the explanation.
devops/openshift/api-deploy.yml
Outdated
matchLabels: | ||
app: ${NAME} | ||
maxUnavailable: 1 | ||
disruptionsAllowed: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind elaborating on how the disruptionsAllowed
configuration affects the PDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
disruptionsAllowed is a status configuration, I thought i can configure them. Removing it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure about what it was after reading some docs that why I asked 😉
selector: | ||
matchLabels: | ||
app: ${NAME} | ||
minAvailable: 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a source of recommendation for Redis PDB configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As all the databases we use are statefulsets and I followed the documentation below
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#identify-an-application-to-protect
The numbers I have put in there are values I have like maxUnavailable for each pods, when the disruption happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if there is a recommendation from Gov or if we can get something from RocketChat.
@@ -182,6 +182,17 @@ objects: | |||
resources: | |||
requests: | |||
storage: ${PVC_SIZE} | |||
- apiVersion: policy/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a source of recommendation for Patroni PDB configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -182,6 +182,17 @@ objects: | |||
resources: | |||
requests: | |||
storage: ${PVC_SIZE} | |||
- apiVersion: policy/v1 | |||
kind: PodDisruptionBudget | |||
metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we would apply these changes to PROD? Is there a plan for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to deploy the PDB configuration in our namespace manually, than running the yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, please take a look at the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. I am approving the PR but I would like to have some extra info about the Redis/Patroni configurations, for instance, a double check on how gov in general have it configured. Can we do it?
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations. Still have some clarification for my understanding for which I will sync up later.