Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1875 - Implementing resiliency measures - PodDisruptionBudget #2237

Merged
merged 8 commits into from
Sep 5, 2023

Conversation

guru-aot
Copy link
Collaborator

@guru-aot guru-aot commented Aug 31, 2023

  • Ensure the pod restart, bandwidth and disruption budgets are set.
  • Vertical Pod Autoscaler values are changed to release resources when there is less need. (Sample sanity load testings are done as part of this in dev namespace before changing)
  • As a part of next PR or next story, PodDisruptionBudget for workers and queue-consumers are set , but once the liveliness and readiness probe are set it will work automatically.

@guru-aot guru-aot marked this pull request as ready for review September 1, 2023 14:59
@guru-aot guru-aot self-assigned this Sep 1, 2023
@guru-aot guru-aot added the Devops Devops label Sep 1, 2023
@guru-aot guru-aot changed the title initial-commit #1875 - Implementing resiliency measures - PodDisruptionBudget Sep 1, 2023
selector:
matchLabels:
app: ${NAME}
minAvailable: 2
Copy link
Collaborator

@dheepak-aot dheepak-aot Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if wrong, I see the kubernetes documentation suggesting differently for frontends. https://kubernetes.io/docs/tasks/run-application/configure-pdb/

image

Copy link
Collaborator Author

@guru-aot guru-aot Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation you have shared the minAvailable should not be reduced by the serving capacity by more than 10%. In our scenario the HorizontalPodAutoscaler we have a maxReplica of 10 and minReplicas as 2 and PDB maxUnavailable as 1, this means the overall reducing capacity is still at 10% which actually follows the documentation.

selector:
matchLabels:
app: ${NAME}
minAvailable: 2
Copy link
Collaborator

@dheepak-aot dheepak-aot Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per openshift/kubernetes, maxUnavailable and minAvailable are mutually exclusive. Should we use both?

https://kubernetes.io/docs/tasks/run-application/configure-pdb/
image

Copy link
Collaborator Author

@guru-aot guru-aot Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was in an assumption, we can use both minAvailable and maxUnavailable in a PodDisruptionBudget, in our current scenario we set minAvailable to 2 and maxUnavailable to 1, it ensures there are always at least 2 pods available, and only 1 pod can be undergoing disruption at any given time.
But after analyzing the documents, its right to specify only maxUnavailable 1, as this will give the option for node drain and the replica will return back to 2 once the disruption is complete.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if my understanding is right, but document says only one of the properties can be used for a single pdb.

Copy link
Contributor

@ann-aot ann-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@andrepestana-aot andrepestana-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you for the explanation.

- name: MEMORY_LIMIT
value: "256M"
value: "512M"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMS-Api will do way more processing than the Web POD, why we would justify more memory allocation to Web than API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially we had only one pod API with more memory, but due to the change in replication controller to have max replicas of 10 we changed it to smaller numbers. In the case of web, even though there are more processing done by API, the web had to render the application faster, so for safer side, I had the values in the vertical pod bumped up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POD will not render anything, it will just allow the download of static files (HTML, js, CSS, ect.). I did not follow the explanation.

matchLabels:
app: ${NAME}
maxUnavailable: 1
disruptionsAllowed: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating on how the disruptionsAllowed configuration affects the PDB?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/tasks/run-application/configure-pdb/

image

disruptionsAllowed is a status configuration, I thought i can configure them. Removing it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about what it was after reading some docs that why I asked 😉

selector:
matchLabels:
app: ${NAME}
minAvailable: 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a source of recommendation for Redis PDB configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As all the databases we use are statefulsets and I followed the documentation below
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#identify-an-application-to-protect

The numbers I have put in there are values I have like maxUnavailable for each pods, when the disruption happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there is a recommendation from Gov or if we can get something from RocketChat.

@@ -182,6 +182,17 @@ objects:
resources:
requests:
storage: ${PVC_SIZE}
- apiVersion: policy/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a source of recommendation for Patroni PDB configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -182,6 +182,17 @@ objects:
resources:
requests:
storage: ${PVC_SIZE}
- apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we would apply these changes to PROD? Is there a plan for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to deploy the PDB configuration in our namespace manually, than running the yaml.

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, please take a look at the comments.

Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. I am approving the PR but I would like to have some extra info about the Redis/Patroni configurations, for instance, a double check on how gov in general have it configured. Can we do it?

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

Backend Unit Tests Coverage Report

Totals Coverage
Statements: 17.94% ( 2177 / 12138 )
Methods: 8.16% ( 126 / 1544 )
Lines: 20.8% ( 1913 / 9195 )
Branches: 9.86% ( 138 / 1399 )

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

E2E Workflow Workers Coverage Report

Totals Coverage
Statements: 46.73% ( 300 / 642 )
Methods: 40% ( 32 / 80 )
Lines: 51.02% ( 251 / 492 )
Branches: 24.29% ( 17 / 70 )

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

E2E Queue Consumers Coverage Report

Totals Coverage
Statements: 72.5% ( 406 / 560 )
Methods: 63.38% ( 45 / 71 )
Lines: 74.53% ( 357 / 479 )
Branches: 40% ( 4 / 10 )

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

E2E SIMS API Coverage Report

Totals Coverage
Statements: 53.27% ( 3931 / 7379 )
Methods: 49.48% ( 472 / 954 )
Lines: 58.32% ( 3210 / 5504 )
Branches: 27.04% ( 249 / 921 )

Copy link
Collaborator

@dheepak-aot dheepak-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. Still have some clarification for my understanding for which I will sync up later.

@guru-aot guru-aot merged commit dcb86ec into main Sep 5, 2023
@guru-aot guru-aot deleted the #1875-PodDisruptionBudget branch September 5, 2023 15:19
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:31 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:34 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:34 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:34 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:34 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:37 — with GitHub Actions Inactive
@guru-aot guru-aot temporarily deployed to DEV September 5, 2023 15:37 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Devops Devops
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants