Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2902.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2902
alpha:
approver: "@johnbelamaric"
beta:
approver: "@johnbelamaric"
118 changes: 114 additions & 4 deletions keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,12 @@
- [Proposal](#proposal)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Compatibility with <code>full-pcpus-only</code> policy options](#compatibility-with-full-pcpus-only-policy-options)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
Expand All @@ -18,8 +23,11 @@
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
<!-- /toc -->

Expand Down Expand Up @@ -75,7 +83,7 @@ When enabled, this will trigger the `CPUManager` to evenly distribute CPUs acros
### Risks and Mitigations

The risks of adding this new feature are quite low.
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyBetaOptions` feature gate (which is disabled by default).

| Risk | Impact | Mitigation |
| -------------------------------------------------| -------| ---------- |
Expand Down Expand Up @@ -112,10 +120,39 @@ If none of the above conditions can be met, resort back to a best-effort fit of

NOTE: The striping operation after all CPUs have been evenly distributed will be performed such that the overall disribution of CPUs across those NUMA nodes remains as balanced as possible.

### Compatibility with `full-pcpus-only` policy options

| Compatibility | alpha | beta | GA |
| --- | --- | --- | --- |
| full-pcpus-only | x | x | x |

### Test Plan

We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.

[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Prerequisite testing updates

##### Unit tests

- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20250205` - 85.5% of statements

##### Integration tests

Not Applicable as Kubelet features don't have integration tests. We use a mix of `e2e_node` and `e2e` tests.

##### e2e tests

Currently no e2e tests are present for this particular policy option. E2E tests will be added as part of Beta graduation.

The plan is to add e2e tests to cover the basic flows for cases below:
1. `distribute-cpus-across-numa` option is enabled: The test will ensure that the allocated CPUs are distributed across NUMA nodes according to the policy.
1. `distribute-cpus-across-numa` option is disabled: The test will verify that the allocated CPUs are packed according to the default behavior.
1. Test how this option interacts with `full-pcpus-only` policy option (and test for it enabled and disabled).

### Graduation Criteria

#### Alpha
Expand Down Expand Up @@ -149,7 +186,9 @@ No changes needed
###### How can this feature be enabled / disabled in a live cluster?

- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `CPUManagerPolicyOptions`
- Feature gate name: `CPUManagerPolicyAlphaOptions`
- Feature gate name: `CPUManagerPolicyBetaOptions`
- Components depending on the feature gate: `kubelet`
- [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `distribute-cpus-across-numa`
- Will enabling / disabling the feature require downtime of the control
Expand All @@ -161,14 +200,14 @@ No changes needed
###### Does enabling the feature change any default behavior?

No. In order to trigger any of the new logic, three things have to be true:
1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
1. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
1. The `static` `CPUManager` policy must be selected
1. The new `distribute-cpus-across-numa` policy option must be selected

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled by either:
1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
1. Disabling the `CPUManagerPolicyBetaOptions` feature gate
1. Switching the `CPUManager` policy to `none`
1. Removing `distribute-cpus-across-numa` from the list of `CPUManager` policy options

Expand All @@ -182,12 +221,34 @@ No changes. Existing container will not see their allocation changed. New contai

- A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)

### Rollout, Upgrade and Rollback Planning

###### How can a rollout or rollback fail? Can it impact already running workloads?

- A rollout or rollback can fail if the feature gate and the policy option are not configured properly and kubelet fails to start.

###### What specific metrics should inform a rollback?

As part of graduation of this feature, we plan to add metric `cpu_manager_numa_allocation_spread` to see how the CPUs are distributed across NUMA nodes.
This can be used to see the CPU distribution across NUMA and will provide an indication of a rollback.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not Applicable. This policy option only affects pods that meet certain conditions and are scheduled after the upgrade. Running pods will be unaffected
by any change.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

### Monitoring Requirements

###### How can an operator determine if the feature is in use by workloads?

Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.

In addition to that, we can check the metric `cpu_manager_numa_allocation_spread` to determine how allocated CPUs are spread across NUMA node.

###### How can someone using this feature know that it is working for their instance?

In order to verify this feature is working, one should:
Expand All @@ -201,6 +262,31 @@ To verify the list of CPUs allocated to the container, one can either:
- `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
- Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).

Also, we can check `cpu_manager_numa_allocation_spread` metric. We plan to add metric to track how CPUs are distributed across NUMA zones
with labels/buckets representing NUMA nodes (numa_node=0, numa_node=1, ..., numa_node=N).

With packed allocation (default, option off), the distribution should mostly be in numa_node=1, with a small tail to numa_node=2 (and possibly higher)
in cases of severe fragmentation. Users can compare this spread metric with the `container_aligned_compute_resources_count` metric to determine
if they are getting aligned packed allocation or just packed allocation due to implementation details.

For example, if a node has 2 NUMA nodes and a pod requests 8 CPUs (with no other pods requesting exclusive CPUs on the node), the metric would look like this:

cpu_manager_numa_allocation_spread{numa_node="0"} = 8
cpu_manager_numa_allocation_spread{numa_node="1"} = 0


When the option is enabled, we would expect a more even distribution of CPUs across NUMA nodes, with no sharp peaks as seen with packed allocation.
Users can also check the `container_aligned_compute_resources_count` metric to assess resource alignment and system behavior.

In this case, the metric would show:
cpu_manager_numa_allocation_spread{numa_node="0"} = 4
cpu_manager_numa_allocation_spread{numa_node="1"} = 4


Note: This example is simplified to clearly highlight the difference between the two cases. Existing pods may slightly skew the counts, but the general
trend of peaks and troughs will still provide a good indication of CPU distribution across NUMA nodes, allowing users to determine if the policy option
is enabled or not.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

There are no specific SLOs for this feature.
Expand All @@ -212,13 +298,20 @@ None

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

None
Yes, as part of graduation of this feature to Beta, we plan to add `cpu_manager_numa_allocation_spread` metric
to provide data on how the CPUs are distributed across NUMA nodes.

###### Does this feature depend on any specific services running in the cluster?

This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
This has been available since `v1alpha2`.

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

No

### Scalability

###### Will enabling / using this feature result in any new API calls?
Expand Down Expand Up @@ -249,9 +342,26 @@ This delay should be minimal.

No, the algorithm will run on a single `goroutine` with minimal memory requirements.

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

No impact. The behavior of the feature does not change when API Server and/or etcd is unavailable since the feature is node local.

###### What are other known failure modes?

Because of existing distribution of CPU resource across, a distributed allocation might not be possible. E.g. If all Available CPUs are present
on the same NUMA node.

In that case we resort back to a best-effort fit of packing CPUs into NUMA nodes wherever they can fit.

###### What steps should be taken if SLOs are not being met to determine the problem?

## Implementation History

- 2021-08-26: Initial KEP created
- 2021-08-30: Updates to fill out more sections, answer PRR questions
- 2021-09-08: Change feature gate from `CPUManagerPolicyOptions` to `CPUManagerPolicyExperimentalOptions`
- 2021-10-11: Change feature gate from `CPUManagerPolicyExperimentalOptions` to `CPUManagerPolicyAlphaOptions`
- 2025-01-30: KEP update for Beta graduation of the policy option
- 2025-02-05: KEP update to the latest template
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,49 @@ title: CPUManager Policy Option to Distribute CPUs Across NUMA Nodes Instead of
kep-number: 2902
authors:
- "@klueska"
- "@swatisehgal" # For Beta graduation
owning-sig: sig-node
participating-sigs: []
status: implementable
creation-date: "2021-08-26"
last-updated: "2025-01-31"
reviewers:
- "@fromani"
- "@ffromani"
approvers:
- "@sig-node-tech-leads"
see-also:
- "keps/sig-node/2625-cpumanager-policies-thread-placement"
- "keps/sig-node/3545-improved-multi-numa-alignment/"
- "keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/"
- "keps/sig-node/4540-strict-cpu-reservation"
- "keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/"
- "keps/sig-node/4800-cpumanager-split-uncorecache/"
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.23"
latest-milestone: "v1.33"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.23"
beta: "v1.24"
stable: "v1.25"
beta: "v1.33"
stable: "v1.35"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: "CPUManagerPolicyOptions"
- name: "CPUManagerPolicyAlphaOptions"
- name: "CPUManagerPolicyBetaOptions"
components:
- kubelet
disable-supported: true

# The following PRR answers are required at beta release
metrics: []
metrics:
- cpu_manager_numa_allocation_spread