Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/4800.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 4800
alpha:
approver: "@soltysh"
beta:
approver: "@soltysh"
51 changes: 30 additions & 21 deletions keps/sig-node/4800-cpumanager-split-uncorecache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -261,7 +262,7 @@ The `prefer-align-cpus-by-uncorecache` feature will be enabled and tested indivi
- `full-pcpus-only`
- Topology Manager NUMA Affinity

The following CPU Topologies are representative of various uncore cache architectures and will be added to policy_test.go and represented in the unit testing.
The following CPU Topologies are representative of various uncore cache architectures and will be added to [policy_test.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_test.go) and represented in the unit testing.

- 1P AMD EPYC 7702P 64C (smt-on/off) NPS=1, 16 uncore cache instances/socket
- 2P AMD EPYC 7303 32C (smt-on/off) NPS=1, 4 uncore cache instances/socket
Expand All @@ -278,19 +279,25 @@ N/A. This feature requires a e2e test for testing.

##### e2e tests

- For e2e testing, checks will be added to determine if the node has a split uncore cache topology. If node does not meet the requirement to have multiple uncore caches, the added tests will be skipped.
- e2e testing should cover the deployment of a pod that is following uncore cache alignment. CPU assignment can be determined by podresources API and programatically cross-referenced to syfs topology information to determine proper uncore cache alignment.
- For e2e testing, guaranteed pods will be deployed with various CPU size requirements on our own baremetal instances across different vendor architectures and confirming the CPU assignments to uncore cache core groupings. This feature is intended for baremetal only and not cloud instances.
- Update CI to test GCP instances of different architectures utilizing uncore cache alignment feature.

- [should update alignment counters when pod successfully run taking less than uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
- [should update alignment counters when pod successfully run taking a full uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
- [should not update alignment counters when pod successfully run taking more than a uncore cache group](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go):[SIG-node](https://testgrid.k8s.io/sig-node):[SIG-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)

### Graduation Criteria

#### Alpha

- Feature implemented behind a feature gate flag option
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta.
- Providing a metric to verify uncore cache alignment will be required to graduate to beta.
- Add unit test coverage
- Added metrics to cover observability needs
- Added e2e tests for metrics

#### Beta
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the document in unit tests section you've listed new tests to be added, do we need to update that section with appropriate links?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, where e2e tests added for this functionality? It seems this PR added some, can you update that section accordingly in that case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding e2e tests, this work started pre- #5242 . Let's add them.
We have indeed some e2e tests but these only cover the metrics reporting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment of mine still holds. I don't see test section filled in according to template.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added e2e tests for metrics.
additional e2e tests needed and added to beta graduation scope


- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
- Add test cases to ensure functional compatibility with existing CPUManager options
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
- Add E2E test coverage for feature

### Upgrade / Downgrade Strategy

Expand Down Expand Up @@ -330,13 +337,12 @@ you need any help or guidance.

To enable this feature requires enabling the feature gates for static policy in the Kubelet configuration file for the CPUManager feature gate and add the policy option for uncore cache alignment


###### How can this feature be enabled / disabled in a live cluster?

For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled.

- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `CPUManagerAlphaPolicyOptions`
- Feature gate name: `CPUManagerBetaPolicyOptions`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below in the question ###### Are there any tests for feature enablement/disablement? can you link those tests?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### What specific metrics should inform a rollback? I'm missing explicit metric(s) being called out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

###### What steps should be taken if SLOs are not being met to determine the problem? is missing answer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### How can a rollout or rollback fail? Can it impact already running workloads? is there a possibility that a kubelet restart will fail after enabling this feature, if so what, and how to react to it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? I suggest checking out https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md and answering that question using data from that file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PRR.
I might not still have a clear understanding of the SLO you are looking for. Originally mentioned latency which seems to be the objective in the link provided, but referencing other CPUManager policy options KEPs, they seem to mention tracking the provided metric for feature enablement.
Let me know if this is not what you were looking for and if you can provide more context.
Thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### What specific metrics should inform a rollback? I'm missing explicit metric(s) being called out.

we can use kubelet_container_aligned_compute_resources_count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What steps should be taken if SLOs are not being met to determine the problem?

My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms).

So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.

We can refine further but this should be a good starting point.

- Components depending on the feature gate: `kubelet`
- [x] Other
- Describe the mechanism: Change the `kubelet` configuration to set a `CPUManager` policy of static then setting the policy option of `prefer-align-cpus-by-uncorecache`
Expand All @@ -360,10 +366,9 @@ Feature will be enabled. Proper drain of node and restart of kubelet required. F

###### Are there any tests for feature enablement/disablement?

Option is not enabled dynamically. To enable/disable option, cpu_manager_state must be removed and kubelet must be restarted.
Unit tests will be implemented to test if the feature is enabled/disabled.
E2e node serial suite can be use to test the enablement/disablement of the feature since it allows the kubelet to be restarted.

E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have those tests already? Those are a requirement for beta promotion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests will need to be added. Added e2e test coverage for beta scope

Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled.
See [cpu_manager_metrics_test.go](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/cpu_manager_metrics_test.go)

### Rollout, Upgrade and Rollback Planning

Expand All @@ -373,12 +378,13 @@ This section must be completed when targeting beta to a release.

###### How can a rollout or rollback fail? Can it impact already running workloads?

Kubelet restarts are not expected to impact existing CPU assignments to already running workloads

This feature is a best-effort alignment of CPUs to uncore caches that requires a kubelet restart that must not affect running workloads. No changes needed to cpu_manager_state file.
A rollout may fail based upon existing workloads that create fragmented uncore caches on the node, potentially resulting in CPUset distribution across multiple caches based upon the CPU quantity requirements and the best-effort policy.
Metrics below can help the user track alignment, but a rollback will not help because the feature is not a strict alignment to uncore caches, but a best-effort to reduce shared uncore caches.

###### What specific metrics should inform a rollback?

Increased pod startup time/latency
`kubelet_container_aligned_compute_resources_count` and `container_aligned_compute_resources_failure_count` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -397,7 +403,7 @@ Reference CPUID info in podresources API to be able to verify assignment.
###### How can an operator determine if the feature is in use by workloads?

Reference podresources API to determine CPU assignment and CacheID assignment per container.
Use proposed 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See [kubelet/metrics/metrics.go](https://github.com/kubernetes/kubernetes/blob/8f1f17a04f62ab64ebe4f0b9d7f5f799bf56a0d9/pkg/kubelet/metrics/metrics.go#L135).

###### How can someone using this feature know that it is working for their instance?

Expand All @@ -409,16 +415,17 @@ Reference podresources API to determine CPU assignment.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible.
In default Kubernetes installation, 99th percentile per cluster-day <= X
This feature is best-effort and will not cause failed admission, but can introduce admission delay.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- Metrics
- `topology_manager_admission_duration_ms`: Which measures the the duration of the admission process performed by Topology Manager.
- `topology_manager_admission_duration_ms` can be used to determine pod admission time

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Utilized proposed 'container_aligned_compute_resources_count' in PR#127155 to be extended for uncore cache alignment count.
No.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing I can think of either


<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
Expand Down Expand Up @@ -526,6 +533,8 @@ For each of them, fill in the following information by copying the below templat

- The outlined sections were filled out was created 2024-08-27.

- 2025-06-09: Submitted PR to promote feature to beta

## Drawbacks

N/A
Expand Down
6 changes: 3 additions & 3 deletions keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ see-also:
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.33"
latest-milestone: "v1.34"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand All @@ -38,7 +38,7 @@ milestone:
# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: "CPUManagerPolicyAlphaOptions"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the bottom of this document there is metrics section that needs to be filled in.

- name: "CPUManagerPolicyBetaOptions"
components:
- kubelet
disable-supported: true
Expand Down