KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta#5390
Conversation
wongchar
commented
Jun 9, 2025
- One-line PR description: Promoting CPUManager feature prefer-align-cpus-by-uncorecache to beta
- Issue link: Split L3 Cache Topology Awareness in CPU Manager #5109
- Other comments:
|
Welcome @wongchar! |
|
Hi @wongchar. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c53c16d to
02d43d2
Compare
|
/ok-to-test |
723bade to
f67d977
Compare
ffromani
left a comment
There was a problem hiding this comment.
We need to fill the beta graduation criterias. I can think of:
- tests to ensure the compatibility with the other relevant cpumanager options
- tests to ensure and report the incompatibility with other relevant cpumanager options. How this incompatibility should surface?
- review if we have missing features: is kubernetes/kubernetes#131850 a beta requirement?
7375c52 to
4af3b23
Compare
agreed. updated graduation criteria |
71c843c to
f68fee3
Compare
soltysh
left a comment
There was a problem hiding this comment.
There are several comments that need to be addressed.
| - E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | ||
| - Providing a metric to verify uncore cache alignment will be required to graduate to beta. | ||
|
|
||
| #### Beta |
There was a problem hiding this comment.
Earlier in the document in unit tests section you've listed new tests to be added, do we need to update that section with appropriate links?
There was a problem hiding this comment.
Similarly, where e2e tests added for this functionality? It seems this PR added some, can you update that section accordingly in that case?
There was a problem hiding this comment.
regarding e2e tests, this work started pre- #5242 . Let's add them.
We have indeed some e2e tests but these only cover the metrics reporting.
There was a problem hiding this comment.
This comment of mine still holds. I don't see test section filled in according to template.
There was a problem hiding this comment.
added e2e tests for metrics.
additional e2e tests needed and added to beta graduation scope
|
|
||
| - [x] Feature gate (also fill in values in `kep.yaml`) | ||
| - Feature gate name: `CPUManagerAlphaPolicyOptions` | ||
| - Feature gate name: `CPUManagerBetaPolicyOptions` |
There was a problem hiding this comment.
Below in the question ###### Are there any tests for feature enablement/disablement? can you link those tests?
There was a problem hiding this comment.
In ###### What specific metrics should inform a rollback? I'm missing explicit metric(s) being called out.
There was a problem hiding this comment.
###### What steps should be taken if SLOs are not being met to determine the problem? is missing answer
There was a problem hiding this comment.
In ###### How can a rollout or rollback fail? Can it impact already running workloads? is there a possibility that a kubelet restart will fail after enabling this feature, if so what, and how to react to it?
There was a problem hiding this comment.
In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? I suggest checking out https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md and answering that question using data from that file.
There was a problem hiding this comment.
Updated the PRR.
I might not still have a clear understanding of the SLO you are looking for. Originally mentioned latency which seems to be the objective in the link provided, but referencing other CPUManager policy options KEPs, they seem to mention tracking the provided metric for feature enablement.
Let me know if this is not what you were looking for and if you can provide more context.
Thanks
There was a problem hiding this comment.
In
###### What specific metrics should inform a rollback?I'm missing explicit metric(s) being called out.
we can use kubelet_container_aligned_compute_resources_count
There was a problem hiding this comment.
What steps should be taken if SLOs are not being met to determine the problem?
My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms).
So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.
We can refine further but this should be a good starting point.
cdf199d to
388cb73
Compare
|
|
||
| - [x] Feature gate (also fill in values in `kep.yaml`) | ||
| - Feature gate name: `CPUManagerAlphaPolicyOptions` | ||
| - Feature gate name: `CPUManagerBetaPolicyOptions` |
There was a problem hiding this comment.
What steps should be taken if SLOs are not being met to determine the problem?
My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms).
So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.
We can refine further but this should be a good starting point.
| #### Beta | ||
|
|
||
| - Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment | ||
| - Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order |
There was a problem hiding this comment.
this was raised by me during the alpha stage but as discussion point. While this seemed (and seems) logical to me, I don't have any hard data to back the claim this improves over the current algorithm.
There was a problem hiding this comment.
I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.
| - Add test cases to ensure functional compatibility with existing CPUManager options | ||
| - Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
| - Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature | ||
| - Add metric for uncore cache alignment and incorporate to E2E tests |
There was a problem hiding this comment.
metrics and tests should be covered in kubernetes/kubernetes#130133 . Of course is possible there are gaps, if so feel free to point them out.
| - Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order | ||
| - Add test cases to ensure functional compatibility with existing CPUManager options | ||
| - Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
| - Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature |
There was a problem hiding this comment.
if we are up to benchmarking this can be a testing ground to justify the sorting change. I for myself I'm not pushing for this though, I'm just open to this option.
There was a problem hiding this comment.
helpful and welcome but no longer necessary, please see the above comment.
| Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default. | ||
| Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads. | ||
| In response, drain the node and restart the kubelet. |
There was a problem hiding this comment.
Thanks for elaborating. I think this is a overly cautious stance. In out case the feature is best-effort, does not require changes to the cpumanager state file, has no dependency on extra fields (cadvisor long provides the data we need) and kubelet restart must not affect the running workloads anyway. So the chance for a rollout or rollback to fail are admittedly non zero, but reasonnably tiny. This is especially true for rollback.
A rollout may fail in the sense that the feature being best-effort, it can cause the workload to go running without proper LLC alignment, because internal resource fragmentation or because the workload characteristics. This is nuanced: the workload will go actually running (alignment is best-effort) but it won't enjoy the LLC alignment and then the promised performance gain.
The metrics "container_aligned_compute_resources_count" and "container_aligned_compute_resources_failure_count" can let the user notify about these "failures" and take corrective measures, but rollback will not help because of the best-effort nature of the change.
| ###### What specific metrics should inform a rollback? | ||
|
|
||
| Increased pod startup time/latency | ||
| `AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary. |
There was a problem hiding this comment.
Generally correct, but please see above about the name of the metrics and the nature of the "rollback"
There was a problem hiding this comment.
The metrics name is not correct, it's a variable name, we need externally visible name.
| ###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
|
||
| Utilized proposed 'container_aligned_compute_resources_count' in PR#127155 to be extended for uncore cache alignment count. | ||
| No. |
There was a problem hiding this comment.
Nothing I can think of either
| ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | ||
|
|
||
| Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible. | ||
| CPUset allocation should be on the fewest amount of uncore caches as possible on the node. |
There was a problem hiding this comment.
I think we can reuse some of the content we discussed in https://github.com/kubernetes/enhancements/pull/5390/files#r2144953827 .
There was a problem hiding this comment.
updated SLO, thanks for the explanation!
ffromani
left a comment
There was a problem hiding this comment.
sig-node review.
From node perspective, the beta content matches the previous conversation and agreements (my memory needed a little boost):
- we will address the gap identified during the second alpha cycle
- we will address the gap identified from feedback (kubernetes/kubernetes#131850)
- we will add the necessary e2e test coverage
Hence, LGTM from node perspective for beta1.
| - Feature implemented behind a feature gate flag option | ||
| - E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | ||
| - Providing a metric to verify uncore cache alignment will be required to graduate to beta. | ||
| - Test cases created for feature |
There was a problem hiding this comment.
let's clarify that we did unit tests, added relevant metrics and added e2e tests for the metrics. It could read like:
- Add unit test coverage
- Added metrics to cover observability needs
- Added e2e tests for the metrics only
| #### Beta | ||
|
|
||
| - Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment | ||
| - Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order |
There was a problem hiding this comment.
I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.
| - Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order | ||
| - Add test cases to ensure functional compatibility with existing CPUManager options | ||
| - Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
| - Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature |
There was a problem hiding this comment.
helpful and welcome but no longer necessary, please see the above comment.
soltysh
left a comment
There was a problem hiding this comment.
Several comments still hold:
- missing test section filled in
- metric names and SLO
and a few minor, non-blocking nits. But the above 2 will need to be addressed to merge this.
| - E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | ||
| - Providing a metric to verify uncore cache alignment will be required to graduate to beta. | ||
|
|
||
| #### Beta |
There was a problem hiding this comment.
This comment of mine still holds. I don't see test section filled in according to template.
| ###### What specific metrics should inform a rollback? | ||
|
|
||
| Increased pod startup time/latency | ||
| `AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary. |
There was a problem hiding this comment.
The metrics name is not correct, it's a variable name, we need externally visible name.
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: | ||
| - name: "CPUManagerPolicyAlphaOptions" |
There was a problem hiding this comment.
At the bottom of this document there is metrics section that needs to be filled in.
| Unit tests will be implemented to test if the feature is enabled/disabled. | ||
| E2e node serial suite can be use to test the enablement/disablement of the feature since it allows the kubelet to be restarted. | ||
|
|
||
| E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled. |
There was a problem hiding this comment.
Do you have those tests already? Those are a requirement for beta promotion.
There was a problem hiding this comment.
tests will need to be added. Added e2e test coverage for beta scope
|
|
||
| E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled. | ||
| Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled. | ||
| See PR#130133 (https://github.com/kubernetes/kubernetes/pull/130133) |
There was a problem hiding this comment.
Nit, can we just link to a place in the code, not PR?
|
|
||
| Reference podresources API to determine CPU assignment and CacheID assignment per container. | ||
| Use proposed 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155). | ||
| Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155). |
There was a problem hiding this comment.
Same comment as above, please change that to pointer to code, not a PR.
| ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | ||
|
|
||
| Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible. | ||
| CPUset allocation should be on the fewest amount of uncore caches as possible on the node. |
soltysh
left a comment
There was a problem hiding this comment.
/approve
the PRR section
ffromani
left a comment
There was a problem hiding this comment.
/approve
/lgtm
my comments where addressed so I think we can move forward
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, soltysh, wongchar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |