AAW Infra: scale down gerenal nodepool #1965

Souheil-Yazji · 2024-09-13T15:00:43Z

Is your feature request related to a problem? Please link issue ticket

Let's look at scaling down the General Nodepool to saturate the nodes better.
Currently, with some observation using grafana/lens, I can see that we are looking at maximum utilization of ~30% for CPU/mem in terms of node resource usage.

Another issue is that most of the daemonsets that don't necessarily need to be on the general nodes do have a pod on them. This is costly at no added value.

Describe the solution you'd like

Investigate the resource saturation on the average general nodepool node.
Investigate the daemonsets which deploy to general nodes
Identify possible clean up for those daemonsets, we can use taints/toleration to prevent them from scheduling pods to the general nodes
--- Maybe for next sprint ---
Apply step 3
Review of Requests and Limits: this is needed because the pod scheduler will always try to honor the requests
Collect metrics on new resource utilization
Adjust the size of the nodes by changing the VMSS VM type
Save lots of money.

Describe alternatives you've considered

NA

Additional context

jacek-dudek · 2024-11-15T18:39:40Z

Resource saturation per node on dev and prod clusters:
cluster: aaw-dev-cc-00-aks
nodepool name: general
machine type: Standard D8s v3 (8cores, 32GiB)
autoscaling: enabled
minimum node count: 0
maximum node count: 8

metrics sourced from azure dashboard:
time period: last 7 days
cpu metric: percentage of total cpu utilized on node
mem metric: percentage of memory working set utilized on node

node: vmss000000
uptime: 100%
cpu: avg of 10%
mem: 70%

node: vmss000001
uptime: 100%
cpu: avg of 30%
mem: 80%

node: vmss000005
uptime: 100%
cpu: avg 13% with a peak of 17%
mem: 95%

node: vmss000007
uptime: 100%
cpu: avg of 13% with a peak of 16%
mem:?

node: vmss00004r
uptime: 100%
cpu: avg of 30% with a peak of 36%
mem: avg of 105%

node: vmss00004s
uptime: 100%
cpu: avg of 16%
mem: 79%

node: vmss00004t
uptime: 100
cpu: 12% ramping up to 18% for a day
mem: 53%

cluster: aaw-prod-cc-00-aks
nodepool name: general
machine type:
Cannot drill down to nodepool level in azure, my permissions seem screwed up.
But from grafana metrics I'm seeing 32cores and 126GiB of memory per node.

metrics sourced from grafana dashboard: Kubernetes/Compute Resources/Node (Pods)
time period: last 7 days
cpu metric: total cpu utilization on node in cores
mem metric: total mem utilization on node in Gibibytes

node: vmss00001c
cpu: avg of 0.75cores with peaks of 1.5cores of 32cores total
mem: avg of 4.7GiB of 126GiB total

node: vmss00001n
cpu: avg of 0.6cores with peaks of 1core of 32cores total
mem: avg of 33GiB of 126GiB total

node: vmss00001r
cpu: avg of 2.3cores of 32cores total
mem: avg of 8GiB of 126GiB total

node: vmss00001v
cpu: avg of 0.4cores with peaks of 1core of 32cores total
mem: avg of 21GiB of 126GiB total

node: vmss00001z
cpu: avg of 0.6cores with peaks of 1.5cores of 32cores total
mem: avg of 8GiB of 126GiB total

jacek-dudek · 2024-11-18T17:35:40Z

Inspected the daemonsets getting deployed on dev and prod clusters. The same daemonsets are deployed to all general nodes on dev and all general nodes on prod clusters. These are the daemonsets:

aad-pod-identity-nmi
csi-blob-node
fluentd-operator-fluentd-operator
azure-ip-masq-agent
azure-npm
cloud-node-manager
csi-azuredisk-node
csi-azurefile-node
istio-cni-node
kube-proxy
kube-prometheus-stack-prometheus-node-exporter
sysctl

jacek-dudek · 2024-11-19T14:48:23Z

For my analysis I would like to get this metric: For a given node, the sum of cpu (mem) resource requests over all pods hosted on that node at a given time. Currently browsing existing Grafana dashboards for something like it. I will have a look at prometheus api to see how involved it would be to construct a new one.

jacek-dudek · 2024-11-20T14:56:00Z

Summarizing which existing dashboards hosted on grafana are applicable to node utilization analysis:

General / Kubernetes / Compute Resources / Node (Pods)
  This is the most useful one.
  It displays cumulative cpu and memory utilization on a given node.
  But doesn't display cumulative resource requests so we can't compare easily.
  It has an accompanying table that lists actual use as percentage of resource requests, 
   but it's broken down by pods it only shows one set of values for each pod, so it's a snapshot or an average.

These ones are not particularly useful for studying nodel utilization for issues listed.
General / Node Utilization metrics
no data
General / Pod Utilization metrics
it doesn't let me filter data by node
it doesn't sum actual resource usage
it doesn't sum resource requests
General / Namespace utilization metrics
no data
not clear what's being measured
General / Kubernetes / Scheduler
no data
General / Kubernetes / Compute Resources / Workloads
displays actual cpu and mem utilization by individual workload but doesn't sum them
General / Kubernetes / Compute Resources / Pods
displays actual resource use and resource requests by pod but doesn't sum them
General / Kubernetes / Compute Resources / Namespace (Workloads)
displays actual resource use and resource requests by namespace and workload but not by node
General / Kubernetes / Compute Resources / Cluster
displays cumulative resource usage across whole cluster
doesn't break it down by node
has a table that shows resource requests and actual use as percentage of requests, but it's broken down by namespace, not node, and only shows one set of values, so a snapshot or an average.

jacek-dudek · 2024-11-20T14:59:57Z

Here's the resource request totals for pods on specified nodes (on the dev cluster) as percentage of node resource capacity:
Values obtained from K9s using <insert command here for reproducibility>

  Cpu and memory request totals of all pods deployed by node.
  At least that's for the pods running at the time I took the notes (10PM)
  cluster: aaw-dev-cc-00-aks
  machine:         cpu:  mem:
  vmss000000   65%   67%
  vmss000001   65%   67%
  vmss000005   85%   92%
  vmss000007   68%   43%
  vmss00004r   87%   79%
  vmss00004s   65%   67%
  vmss00004t   19%   21%

jacek-dudek · 2024-11-25T17:58:30Z

I've been looking into prometheus and grafana to create a node utilization dashboard that's more effective. Here's a mockup of the two timeseries graphs that I want to have displayed.

The first would give us an idea of how pods are being distributed across nodes in a nodepool when workloads get scaled up and down. It would give us an indication if node util is constrained by pod numbers.

The second would give a side-by-side comparison of selected nodes and track their resource utilization along with cumulative resource requests and resource limits. It would let us see how nodepools are upscaled and how new nodes are utilized.

Souheil-Yazji · 2024-11-25T18:13:32Z

@Jose-Matsuda can you please follow up with @jacek-dudek on access to Grafana (dev & prod)

jacek-dudek · 2024-11-27T14:27:53Z

Obtained editor permissions on Grafana. Created new dashboard named "Node resource utilization - comparative view".
It polls metrics named: kube_pod_container_resource_requests, kubelet_active_pods.
It's a work in progress. Two issues:
(1) The data I'm getting back looks suspect because there's very little changing over time.
(2) I need to figure out how to create drop-down boxes and tie them to variables in the queries so that I can select specific nodes to get info on.

jacek-dudek · 2024-11-27T14:54:45Z

Summary of resource utilization on general nodepool based on statistics obtained from existing dashboards:
CPU utilization average: approximately 5% of CPU capacity
memory utilization average: approximately 12% of CPU capacity

Conclusion: I think we can safely scale down our virtual machine models from ones with 32cores and 126GiB of memory to models with 8cores and 32GiB of memory.

Jose-Matsuda · 2024-12-03T13:04:13Z

Summary of resource utilization on general nodepool based on statistics obtained from existing dashboards: CPU utilization average: approximately 5% of CPU capacity memory utilization average: approximately 12% of CPU capacity

Conclusion: I think we can safely scale down our virtual machine models from ones with 32cores and 126GiB of memory to models with 8cores and 32GiB of memory.

We are already on the machines with 8cores and 32GiB of memory as defined here (i also double checked on the portal itself and yes its the Standard_D8s_v3)

Do you think we can move to a Standard_D4ds_v5 or even Standard_D2ds_v5? I'm thinking for aaw dev we might be able to squeeze on the 2CPU 8GB machines, but possibly only after re-sizing the workloads.

I think Zone dev we are already on 4CPU machines so getting on that would be easy

Jose-Matsuda · 2024-12-30T15:15:12Z

Summary of resource utilization on general nodepool based on statistics obtained from existing dashboards: CPU utilization average: approximately 5% of CPU capacity memory utilization average: approximately 12% of CPU capacity
Conclusion: I think we can safely scale down our virtual machine models from ones with 32cores and 126GiB of memory to models with 8cores and 32GiB of memory.

We are already on the machines with 8cores and 32GiB of memory as defined here (i also double checked on the portal itself and yes its the Standard_D8s_v3)

Do you think we can move to a Standard_D4ds_v5 or even Standard_D2ds_v5? I'm thinking for aaw dev we might be able to squeeze on the 2CPU 8GB machines, but possibly only after re-sizing the workloads.

I think Zone dev we are already on 4CPU machines so getting on that would be easy

@jacek-dudek what do you think about this comment? Since #1997 has been completed we can make the terraform change for AAW-dev and observe, similar to what's in #2002 except we see how many machines are spun up since at the end of the day that's what costs the $'s

jacek-dudek · 2024-12-31T04:32:06Z

I cannot seem to access that repo at the moment Jose. I will go through the MS documentation for now.
On the azure portal I still don't have access to the nodepool information for aaw-prod, same if I try to access node info through k9s. That might be why that issue was still in review.

The way I was getting the cpu and memory stats for the nodes was by looking at the Kubernetes/Compute Resources/Node (Pods) dashboard in Grafana, selecting a single node, and selecting the max capacity dataset. And that was displaying a line going across roughly at 32 cores. Similarly for memory, it looked like approximately 126GiB. Any idea why it would display that?

Jose-Matsuda · 2024-12-31T15:59:58Z

@jacek-dudek good question, i took a look in the grafana queries
taking a look at cloudmain sys for example;

which i know is a d-16

it seems as if that sum doubles it;

Running just the base query of
(kube_node_status_capacity{cluster="", node=~"aks-cloudmainsys-14230113-vmss00001a", resource="cpu"})

will result in the correct number

jacek-dudek · 2025-01-07T03:47:28Z

So, just eyeballing the resource utilization metrics in azure dashboard for the general nodepool, it seems the main issue now is that CPUs are underutilized on each node relative to memory consumption. So maybe a better choice now would be a memory optimized machine. For example this type: Standard_E2ps_v6. Specs and pricing on this one is:

jacek-dudek · 2025-01-07T04:16:32Z

Linking the pull request I made in gitlab: https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/terraform-advanced-analytics-workspaces-infrastructure/-/merge_requests/324

jacek-dudek · 2025-01-07T16:40:28Z

I still don't understand why there are multiple nodes running per availability zone and all are still accepting pods, and none have memory or disk pressure and none are utilizing CPU at under 50%. Doesn't that mean that the nodepool was scaled up while existing nodes were sufficient to run existing workloads? Here's the status of the two nodes running in availability zone 3:

Souheil-Yazji · 2025-01-08T16:18:35Z

@jacek-dudek include the details on how you arrived to the conclusion that Standard_E2ps_v6 is better than the Standard_E2ds_v5 machine, including the tool used to perform this analysis.

Souheil-Yazji · 2025-01-08T16:19:20Z

I still don't understand why there are multiple nodes running per availability zone and all are still accepting pods, and none have memory or disk pressure and none are utilizing CPU at under 50%. Doesn't that mean that the nodepool was scaled up while existing nodes were sufficient to run existing workloads? Here's the status of the two nodes running in availability zone 3

Sounds like this might be georedundancy, but this should be the case for dev clusters.

jacek-dudek · 2025-01-09T03:28:29Z

Elaborating on how I chose the VM model:

I examined the resource utilization summary in azure portal for the general nodepool. I noticed that utilization is skewed toward memory, with utilization of around 65% while CPUs are utilized at around 10%.

I concluded that regardless of the VM size we ultimately choose, we would better served by a memory-optimized model, as our workloads on this nodepool seem to be more memory intensive.

I used this vm selection tool to narrow down the VM models:
https://azure.microsoft.com/en-us/pricing/vm-selector/

The existing machines that we're using are model Standard_D8s_v3, which has 8 cores and 32 GiB of memory. In that configuration, with that ratio of 1:4 of cores to memory, we're observing the resource utilization mentioned above. The machines that are being suggested further up in the issue comments, Standard_D4ds_v5 or Standard_D2ds_v5, offer the same ratios of 1:4 of cores to memory.

Whereas the memory optimized ones, including Standard_E2ds_v5 and Standard_E4ds_v5 offer a ratio of 1:8 of cores to memory. I would expect that using the memory optimized ones would balance out the utilization of cores to memory while causing no other issues. All other features in the series are the same, see below:

jacek-dudek · 2025-01-09T18:12:37Z

In conclusion, if we chose the Standard_D4ds_v5 model then we would have a budget of 4 cores and 16 GiB of memory. That would leave us under-budgeted in terms of memory, and we would likely have to scale up another node or more. Whereas if we chose the Standard_E4ds_v5 model then we would have a budget of 4 cores and 32 GiB of memory. So the memory utilization would hopefully remain the same (I know that memory can be used for paging when cpu is overutilized, but we still wouldn't be close to overutilizing the cpus.), while the cpu utilization would be twice as good as presently, and we wouldn't be scaling up additional nodes.

Souheil-Yazji · 2025-02-04T18:16:32Z

@jacek-dudek
Let's summarize this for the Elab and bring it up for the team to review, then we can proceed with making these changes on Monday.

Souheil-Yazji added kind/feature New feature or request area/engineering Requires attention from engineering: focus on foundational components or platform DevOps priority/soon labels Sep 13, 2024

Souheil-Yazji assigned EveningStarlight Oct 30, 2024

Jose-Matsuda assigned jacek-dudek Nov 13, 2024

Jose-Matsuda mentioned this issue Nov 27, 2024

AAW Dev: Resource Utilization #1998

Open

10 tasks

EveningStarlight removed their assignment Dec 11, 2024

jacek-dudek mentioned this issue Feb 19, 2025

Switching to new machine types on aaw-dev cluster #2022

Open

Souheil-Yazji closed this as completed Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AAW Infra: scale down gerenal nodepool #1965

AAW Infra: scale down gerenal nodepool #1965

Souheil-Yazji commented Sep 13, 2024 •

edited

Loading

jacek-dudek commented Nov 15, 2024

jacek-dudek commented Nov 18, 2024

jacek-dudek commented Nov 19, 2024

jacek-dudek commented Nov 20, 2024 •

edited

Loading

jacek-dudek commented Nov 20, 2024 •

edited by Souheil-Yazji

Loading

jacek-dudek commented Nov 25, 2024 •

edited

Loading

Souheil-Yazji commented Nov 25, 2024

jacek-dudek commented Nov 27, 2024

jacek-dudek commented Nov 27, 2024

Jose-Matsuda commented Dec 3, 2024 •

edited

Loading

Jose-Matsuda commented Dec 30, 2024

jacek-dudek commented Dec 31, 2024

Jose-Matsuda commented Dec 31, 2024

jacek-dudek commented Jan 7, 2025

jacek-dudek commented Jan 7, 2025

jacek-dudek commented Jan 7, 2025 •

edited

Loading

Souheil-Yazji commented Jan 8, 2025

Souheil-Yazji commented Jan 8, 2025

jacek-dudek commented Jan 9, 2025

jacek-dudek commented Jan 9, 2025 •

edited

Loading

Souheil-Yazji commented Feb 4, 2025

AAW Infra: scale down gerenal nodepool #1965

AAW Infra: scale down gerenal nodepool #1965

Comments

Souheil-Yazji commented Sep 13, 2024 • edited Loading

Is your feature request related to a problem? Please link issue ticket

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jacek-dudek commented Nov 15, 2024

jacek-dudek commented Nov 18, 2024

jacek-dudek commented Nov 19, 2024

jacek-dudek commented Nov 20, 2024 • edited Loading

jacek-dudek commented Nov 20, 2024 • edited by Souheil-Yazji Loading

jacek-dudek commented Nov 25, 2024 • edited Loading

Souheil-Yazji commented Nov 25, 2024

jacek-dudek commented Nov 27, 2024

jacek-dudek commented Nov 27, 2024

Jose-Matsuda commented Dec 3, 2024 • edited Loading

Jose-Matsuda commented Dec 30, 2024

jacek-dudek commented Dec 31, 2024

Jose-Matsuda commented Dec 31, 2024

jacek-dudek commented Jan 7, 2025

jacek-dudek commented Jan 7, 2025

jacek-dudek commented Jan 7, 2025 • edited Loading

Souheil-Yazji commented Jan 8, 2025

Souheil-Yazji commented Jan 8, 2025

jacek-dudek commented Jan 9, 2025

jacek-dudek commented Jan 9, 2025 • edited Loading

Souheil-Yazji commented Feb 4, 2025

Souheil-Yazji commented Sep 13, 2024 •

edited

Loading

jacek-dudek commented Nov 20, 2024 •

edited

Loading

jacek-dudek commented Nov 20, 2024 •

edited by Souheil-Yazji

Loading

jacek-dudek commented Nov 25, 2024 •

edited

Loading

Jose-Matsuda commented Dec 3, 2024 •

edited

Loading

jacek-dudek commented Jan 7, 2025 •

edited

Loading

jacek-dudek commented Jan 9, 2025 •

edited

Loading