Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAW Dev: Re-size workloads scheduled on system nodepools #1992

Closed
Jose-Matsuda opened this issue Nov 27, 2024 · 6 comments
Closed

AAW Dev: Re-size workloads scheduled on system nodepools #1992

Jose-Matsuda opened this issue Nov 27, 2024 · 6 comments
Assignees

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Nov 27, 2024

Take the pod workloads that you can see on the system workloads and after consulting with grafana over an extended period of time (the option) suggest and change workload sizes.

It would be nice to have a table kind of like this

Resource Current CPU Suggested CPU Current Mem Suggested Mem
toleration-injector 500m 1m 128Mi 20Mi

Follow up issue for general nodepool #1997

@jacek-dudek
Copy link

I created excel tables of cpu and memory utilization averages of all pods on the system nodes in aaw-dev.
I used the dashboard named: General / Kubernetes / Compute Resources / Node (Pods)
The averages are evaluated over the last 7 days.
I'm resolving some formatting issues and will be posting the tables shortly.

@jacek-dudek
Copy link

Here are the tables for cpu and memory usage of pods running on system nodes and suggested requests:
resource-utilization-on-aaw-dev-system-nodes.xlsx

@jacek-dudek
Copy link

Currently tracking down the parent objects and manifests corresponding to all the pods listed in the tables.

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Dec 3, 2024

CPU-wise the sum of all the CPU requests in the column goes up to 3.48 vCPU, this would fit on two system nodes sized to using 2 D2s, still want to better size the requests of course, but I will note an inaccuracy in the toleration-injector, which itself only has limits and not requests, this means that theyll schedule wherever and won't reserve the stated 0.5 CPU, so after removing that the actual vCPU request goes up to 2.98 and this may be the case for some others.

Also need to be careful of making sure the memory requests are honoured as well, as if we move to a D2 we get 8Gb of memory

@Jose-Matsuda
Copy link
Contributor Author

Took a bit of time to go over the pods listed in Jacek's excel file and made a few notes;

Daemonsets you can easily influence

statcan-system/sysctl, azure-blob-csi-system/csi-blob-node: both of these are already well-sized already though

Daemonsets deployed via helm

These ones I'm not too sure on as some may just be deployed by CNS
kube-prometheus-stack-prometheus-node-exporter, fluentd-operator-fluentd-operator, aad-pod-identity-nmi

Daemonsets with no traceable (might just be CNS / AKS)

azure-ip-masq-agent, azure-npm, cloud-node-manager, csi-azuredisk-node, csi-azurefile-node, istio-cni-node, kube-proxy
These daemonsets you don't need to touch or try to modify


Deployments that you don't need to change (no requests)

cert-manager-anything, kube-prometheus-stack-kube-state-metrics, statcan-system/toleration-injector (only has limits)

Likely dont have the ability to change

coredns(100m,70Mi), coredns-autoscaler(20m,10Mi), konnectivity-agent(20,20), metrics-server(5m,30Mi),

Resources deployed by helm

gatekeeper-audit(100m, 1546Mi), gatekeeper-controller-manager(100m, 1546Mi),

Argo resources

statcan-system/sidecar-terminator(10m,200M)

@jacek-dudek
Copy link

My discovery work so far. Identified four argocd managed workloads and a bunch of aks managed workloads. Not sure about the helm managed ones yet.
object-hierarchy-and-labels-and-annotations.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants