Skip to content

KEP-5526: Pod Level Resource Managers 1.36#5775

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
KevinTMtz:pod-level-resource-managers-1-36
Feb 10, 2026
Merged

KEP-5526: Pod Level Resource Managers 1.36#5775
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
KevinTMtz:pod-level-resource-managers-1-36

Conversation

@KevinTMtz
Copy link
Member

@KevinTMtz KevinTMtz commented Jan 8, 2026

This KEP change does not modify the original proposed behavior, only adds some specifics that surfaced during the implementation of the feature.

The most significant additions are a new property to the CPU and Memory managers state to keep track of the pod level exclusive resource allocation (requires PRR review perspective) and the introduction of a new term ResourceIsolationLevel that describes if the container resources belong to the container, the pod or the node (currently only used by the CPU manager to handle CPU quota enablement/disablement).

  • One-line PR description: Pod Level Resource Managers update to mainly add implementation specifics

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 8, 2026
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 8, 2026
@KevinTMtz
Copy link
Member Author

cc @pravk03 @swatisehgal

Added a glossary with the terms that may cause some confusion, refer to previous comments in original PR:

@KevinTMtz KevinTMtz changed the title KEP-5526: Pod Level Resource Managers KEP-5526: Pod Level Resource Managers update for 1.36 Jan 8, 2026
@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from 947d142 to 729da77 Compare January 9, 2026 00:50
@KevinTMtz
Copy link
Member Author

/assign @ndixita

@KevinTMtz
Copy link
Member Author

/assign @ffromani

@KevinTMtz
Copy link
Member Author

/assign @pravk03

@KevinTMtz
Copy link
Member Author

KevinTMtz commented Jan 12, 2026

/assign @swatisehgal

to the container's cgroup.

6. **[New Logic] Container and Pod Removal:** When a container is removed, its
individual assignment is deleted from the state. The pod-level resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the container resources also released back to pod's shared pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because of the static behavior of the CPU and Memory managers, the podSharedPool remains the same until the pod termination. Rewrote that section to be more clear about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior as written looks reasonnable for alpha. If ever becomes undesirable, we would need a clear use case to think about further changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keep the simplicity for alpha.

@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from 729da77 to bc51a40 Compare January 12, 2026 22:13
@KevinTMtz KevinTMtz changed the title KEP-5526: Pod Level Resource Managers update for 1.36 KEP-5526: Pod Level Resource Managers 1.36 Jan 12, 2026
@KevinTMtz KevinTMtz mentioned this pull request Jan 13, 2026
4 tasks
@jpbetz
Copy link
Contributor

jpbetz commented Jan 15, 2026

PRR looks good for alpha. The PRR approval is already codified by the approval last release (https://github.com/KevinTMtz/enhancements/blob/pod-level-resource-managers-1-36/keps/prod-readiness/sig-node/5526.yaml) so no other approval steps are required for PRR.

@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from bc51a40 to c43a33d Compare January 20, 2026 21:46
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP changes looks reasonnable and match our conversation in the PR.
The only point worth doubleclicking is the state file changes and the backward/forward upgrade and rollback flows.
While what we have is a good start, I'd like to make sure we are on par with the existing quality standards, and if there are cheap and reasonnable ways to improve. For example: is it time (and is it worth?) to finally begin adding an explicit version, using a field, encoded in the filename, or both, to the checkpoint file?

I'll think about it keeping practicality in mind - it will be very unfair to charge these improvements to this KEP, unless they are very cheap like wrapping new fields in a struct like we already did. I'll pause reviews momentarily to let other reviewers chime in.

to the container's cgroup.

6. **[New Logic] Container and Pod Removal:** When a container is removed, its
individual assignment is deleted from the state. The pod-level resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior as written looks reasonnable for alpha. If ever becomes undesirable, we would need a clear use case to think about further changes.

state information associated with a pod-level resource allocation. While its
initial implementation only encapsulates the overall `CPUSet` assigned to the
pod (`CPUSet cpuset.CPUSet`), using a struct instead of a raw type is a
strategic choice for future-proofing. It allows for the seamless addition of new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Is meant to be forward compatible. The deserialization code in older kubelets will just ignore unknown fields in the struct.

Comment on lines 1148 to 1151
To achieve this, the CPU Manager performs a check to determine the
`ResourceIsolationLevel` for each container. This logic evaluates the
container's resource assignment and its eligibility for exclusivity to
distinguish between:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we decided to keep this concept internal to the kubelet (which is fine). Just add a note this is an internal concept and not part of the user-facing API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added note, thank you.

Comment on lines +1236 to +1241
The `PodEntry` struct encapsulates the pod-level memory assignment, currently
containing the overall `MemoryBlocks` slice assigned to the pod. As with the CPU
Manager, using a struct here is a forward-looking design choice. It allows for
the future inclusion of additional metadata—such as the original topology hint
or request details—without disrupting the state file schema or requiring complex
migrations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need another pass to make sure I'm not neglecting anything, because memory manager is the lesser active resource manager so details fade quickly.

Comment on lines +1676 to +1680
* **Mitigation (Forward Compatibility):** The changes to the state file
are additive (new fields only), preserving the overall format. However,
because older Kubelets cannot gracefully handle these new fields, the
only safe rollback path involves draining the node and deleting the
state file before restarting the older Kubelet version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a more compatible approach with multiple checkpoints, but that made the code much more complex without obvious compatibility gains. Looking at the big picture, this seems the sweet spot of simplicity and robustness.

Comment on lines +1664 to 1667
* **Mitigation (Backward Compatibility):** The new Kubelet version will be
implemented to transparently read the old state file format. When it
admits a new pod using this feature, it will write the new pod-aware
entries, ensuring a seamless upgrade path without manual intervention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is easy and cheap, should be no bother.

compatible with the changes presented in this KEP. Standard init containers will
be allocated exclusive slices from the pod's managed resource pool. Because they
run sequentially, their resources are made reusable for subsequent app
containers. Restartable init containers (sidecars) that are `Guaranteed` will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could add that sidecar containers won't yield the resources as they are intended to run for the entire duration of pod's life

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworded to be more clear about it.

@ndixita
Copy link
Contributor

ndixita commented Jan 26, 2026

lgtm for partitioning logic, and overall design.

Pending items to review from my end: CPU Quota management section, and PRR questionnaire, to be done by EOD

@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from b909fd0 to 44e13e4 Compare January 27, 2026 00:18
@ndixita
Copy link
Contributor

ndixita commented Jan 28, 2026

Took another pass and lgtm!

In a follow up PR, let's add:

  1. A small note about topology manager in-memory state being cleared when a container is removed.
  2. Although this KEP isn't about supporting PLR + IPPR, for Beta we will need an e2e that covers cases for pods resizes on node with resource managers enabled. Let's please add this as a todo for Beta.

@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from 44e13e4 to b30db39 Compare January 28, 2026 21:22
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

My comments have been addresses, the KEP is in good shape and I don't have any low hanging fruits for improvements.

[EDIT] I still want to deep dive into memory manager fine details, but there are no red (or yellow) flags in the changes here so I prefer to move forward.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2026
@ffromani
Copy link
Contributor

1. A small note about topology manager in-memory state being cleared when a container is removed.

regarding this point, research in github history and in the KEP has been inconclusive so far. The most credible theory, stemming from
kubernetes/kubernetes#74357 , is that the lifecycle hooks where wired following the template of cpumanager. Note that cpumanager was later removed from that hook to prevent exclusive CPU loss on container restart (issue link pending). Therefore, there is a fair chance we keep the affinity setting in the topology manager in-memory state longer than necessary.

I'd welcome further investigation in this area but we need some more hard data.

@KevinTMtz
Copy link
Member Author

Took another pass and lgtm!

In a follow up PR, let's add:

  1. A small note about topology manager in-memory state being cleared when a container is removed.
  2. Although this KEP isn't about supporting PLR + IPPR, for Beta we will need an e2e that covers cases for pods resizes on node with resource managers enabled. Let's please add this as a todo for Beta.

For context, the "1. A small note about topology manager in-memory state being cleared when a container is removed." is relevant because of this discussion kubernetes/kubernetes#134345 (comment) from RestartAllContainers.

@KevinTMtz KevinTMtz force-pushed the pod-level-resource-managers-1-36 branch from b30db39 to 42f5e85 Compare February 2, 2026 20:17
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2026
@KevinTMtz
Copy link
Member Author

Amended the commit with the required updates to kep.yaml, no new changes to the KEP README.md file.

@ffromani
Copy link
Contributor

ffromani commented Feb 3, 2026

/lgtm

Amended the commit with the required updates to kep.yaml, no new changes to the KEP README.md file.

Confirmed by inter-diff

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026
exclusive allocations (for both `Guaranteed` sidecars and `Guaranteed`
application containers) have been reserved. This ensures that shared sidecars do
not interfere with the exclusive resources of application containers once they
start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on podSharedPool. This allows us to treat a Pod as a single NUMA unit while allowing internal flexibility, meanwhile we still maintains backward compatibility at container level.

@pravk03 for vis.

Unset | Set | Current behavior (alignment and exclusive allocation based on container resources)
Unset | Set | Current behavior (alignment and exclusive allocation based on individual container resources).
Set | Unset | No alignment or exclusive allocation. Containers run in the node's shared pool.
Set | Set | Alignment and exclusive allocation for containers that specify `Guaranteed` resources; pod-level resources are ignored for allocation decisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the difference between this category (pod is set, and container is set) with the above one (pod is unset, and container is set) when topology-manager-scope is set to container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When only container-level resources are used (pod is unset, and container is set) and it is wanted to obtain exclusive resources, all the containers must have a guaranteed QoS, thus requiring each one of them to have an equal request and limit. In summary, it is an all or nothing situation, all containers receive exclusive resources, or none of them.

When pod-level resources are used (pod is set, and container is set), there is no longer an obligation to obtain Guaranteed QoS for each container, only for the overall pod, but it still requires the containers that you need to have exclusive resources to be Guaranteed. To put an example, it enables a hybrid model where you can have a pod with Guaranteed QoS determined by its pod-level resources, and three containers (one of them has a Guaranteed QoS), so only the Guaranteed container will obtain exclusive resources, while the non-Guaranteed containers will access the shared node resources constrained by the pod-level limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. It makes sense to me, but can you add an example below to reflect this difference?


###### Special configurations

**Underutilization of Pod-Level Resources**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design, but please ensure a logging message for a warning to the users.

Copy link
Contributor

@ndixita ndixita Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message change would be a part of PLR implementation then @KevinTMtz. We can submit that as a follow up PR change as that wouldn't affect this KEP in particular.

sidecar behavior, if a restartable init container is individually
`Guaranteed`, it is allocated an exclusive slice that is reserved for
the entire pod lifecycle and is *not* added to the reusable set.
- **Standard Init Containers:** If individually `Guaranteed`, they are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, shouldn't we have a pod resource underutilization issue? Since standard Init Containers are running sequentially, and also before any restarable sidecar container, can we change the logic by allowing them to burst to the entire pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guaranteed init containers get the corresponding exclusive resources to their request and limit.

Non-guaranteed init containers are granted access to the entire pod-level resource budget minus any resources exclusively allocated to Guaranteed sidecars running at that time. So yes, they are able to burst to whatever is not exclusively allocated in the pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is once pod level resource is set, can we allow those standard init containers burst to entire pod resource and ignore its own resource request and limit since they are running sequentially anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behavior can be obtained if no request and limit is specified for init containers. I think it is better to keep consistency and respect the expected behavior of the container-level limit that the user has, and if the user wants the container to burst with available resources on the pod, the pod shared pool will take care of that if no container-level limit is specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. Let's document the best practice for this.

1. The `topology-manager-scope` is set to `pod`.
2. The sum of resource requests from all containers that are `Guaranteed`
exactly equals the total resource budget defined in `pod.spec.resources`.
3. There is at least one other container in the pod that does not qualify for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ephemeral container(s)? Today ephemeral containers borrow the node / pod resource, can we explicitly document this? With topology-manager-scope being introduced, should we allow it use pod shared resource or ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ephemeral containers are not offered any alignment or exclusive allocation by the resource managers, since they lack any guarantees for resources or execution. I think it is better to keep them separate as they currently are, and not receive any special handling. Do you have any specific concerns about ephemeral containers and resource managers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't have much concerns. Just want us to document it clearly in design and the user doc.

Copy link
Contributor

@ndixita ndixita Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 thanks for bringing this up. I have a few questions now which I missed earlier.

Consider a case: A pod with PLR set, has 1 guaranteed containers and 2 non-guaranteed containers.

Pod-level limit = 4 CPUs
container 1 limit = 2 CPUs
container 2 & 3 can share the other 2 CPUs from the pod shared pool and will have the corresponding cpuset set. If the pod is using all 4 CPUs, there's no way to inject an ephemeral container to such a pod. And there's no way to add enough buffer to the pod-level limit here to accommodate the ephemeral container under pod-level limits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I asked Kevin a follow up PR for ephemeral containers and other small changes. We just need to make decision where to borrow resource for the ephemeral container, pod shared pool? If no shared, can borrow from the first container? Or the container being debugged?

them. However, to completely eliminate any risk of parsing errors, the
safest and recommended rollback procedure involves draining the node and
deleting the state file.
* **Failure Mode:** Forward compatibility is not supported. An older
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK today's K8s roll back feature still requires node drain as the best practice. In this case I don't have much concern for alpha. Let's re-consider this in beta.

@dchen1107
Copy link
Member

/lgtm
/approve

Please send me the followup PR based on the comments later.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, KevinTMtz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026
@k8s-ci-robot k8s-ci-robot merged commit 5eb0625 into kubernetes:master Feb 10, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 10, 2026
@esotsal
Copy link
Contributor

esotsal commented Feb 11, 2026

/cc

Interesting KEP with good discussion, looking forward for the implementation.

@k8s-ci-robot k8s-ci-robot requested a review from esotsal February 11, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants