KEP-5526: Pod Level Resource Managers 1.36#5775
KEP-5526: Pod Level Resource Managers 1.36#5775k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
Added a glossary with the terms that may cause some confusion, refer to previous comments in original PR: |
947d142 to
729da77
Compare
|
/assign @ndixita |
|
/assign @ffromani |
|
/assign @pravk03 |
|
/assign @swatisehgal |
| to the container's cgroup. | ||
|
|
||
| 6. **[New Logic] Container and Pod Removal:** When a container is removed, its | ||
| individual assignment is deleted from the state. The pod-level resource |
There was a problem hiding this comment.
Are the container resources also released back to pod's shared pool?
There was a problem hiding this comment.
No, because of the static behavior of the CPU and Memory managers, the podSharedPool remains the same until the pod termination. Rewrote that section to be more clear about it.
There was a problem hiding this comment.
This behavior as written looks reasonnable for alpha. If ever becomes undesirable, we would need a clear use case to think about further changes.
There was a problem hiding this comment.
+1 to keep the simplicity for alpha.
729da77 to
bc51a40
Compare
|
PRR looks good for alpha. The PRR approval is already codified by the approval last release (https://github.com/KevinTMtz/enhancements/blob/pod-level-resource-managers-1-36/keps/prod-readiness/sig-node/5526.yaml) so no other approval steps are required for PRR. |
bc51a40 to
c43a33d
Compare
ffromani
left a comment
There was a problem hiding this comment.
The KEP changes looks reasonnable and match our conversation in the PR.
The only point worth doubleclicking is the state file changes and the backward/forward upgrade and rollback flows.
While what we have is a good start, I'd like to make sure we are on par with the existing quality standards, and if there are cheap and reasonnable ways to improve. For example: is it time (and is it worth?) to finally begin adding an explicit version, using a field, encoded in the filename, or both, to the checkpoint file?
I'll think about it keeping practicality in mind - it will be very unfair to charge these improvements to this KEP, unless they are very cheap like wrapping new fields in a struct like we already did. I'll pause reviews momentarily to let other reviewers chime in.
| to the container's cgroup. | ||
|
|
||
| 6. **[New Logic] Container and Pod Removal:** When a container is removed, its | ||
| individual assignment is deleted from the state. The pod-level resource |
There was a problem hiding this comment.
This behavior as written looks reasonnable for alpha. If ever becomes undesirable, we would need a clear use case to think about further changes.
| state information associated with a pod-level resource allocation. While its | ||
| initial implementation only encapsulates the overall `CPUSet` assigned to the | ||
| pod (`CPUSet cpuset.CPUSet`), using a struct instead of a raw type is a | ||
| strategic choice for future-proofing. It allows for the seamless addition of new |
There was a problem hiding this comment.
yes. Is meant to be forward compatible. The deserialization code in older kubelets will just ignore unknown fields in the struct.
| To achieve this, the CPU Manager performs a check to determine the | ||
| `ResourceIsolationLevel` for each container. This logic evaluates the | ||
| container's resource assignment and its eligibility for exclusivity to | ||
| distinguish between: |
There was a problem hiding this comment.
IIRC we decided to keep this concept internal to the kubelet (which is fine). Just add a note this is an internal concept and not part of the user-facing API.
There was a problem hiding this comment.
Added note, thank you.
| The `PodEntry` struct encapsulates the pod-level memory assignment, currently | ||
| containing the overall `MemoryBlocks` slice assigned to the pod. As with the CPU | ||
| Manager, using a struct here is a forward-looking design choice. It allows for | ||
| the future inclusion of additional metadata—such as the original topology hint | ||
| or request details—without disrupting the state file schema or requiring complex | ||
| migrations. |
There was a problem hiding this comment.
LGTM. Need another pass to make sure I'm not neglecting anything, because memory manager is the lesser active resource manager so details fade quickly.
| * **Mitigation (Forward Compatibility):** The changes to the state file | ||
| are additive (new fields only), preserving the overall format. However, | ||
| because older Kubelets cannot gracefully handle these new fields, the | ||
| only safe rollback path involves draining the node and deleting the | ||
| state file before restarting the older Kubelet version. |
There was a problem hiding this comment.
We discussed a more compatible approach with multiple checkpoints, but that made the code much more complex without obvious compatibility gains. Looking at the big picture, this seems the sweet spot of simplicity and robustness.
| * **Mitigation (Backward Compatibility):** The new Kubelet version will be | ||
| implemented to transparently read the old state file format. When it | ||
| admits a new pod using this feature, it will write the new pod-aware | ||
| entries, ensuring a seamless upgrade path without manual intervention. |
There was a problem hiding this comment.
this is easy and cheap, should be no bother.
c43a33d to
b909fd0
Compare
| compatible with the changes presented in this KEP. Standard init containers will | ||
| be allocated exclusive slices from the pod's managed resource pool. Because they | ||
| run sequentially, their resources are made reusable for subsequent app | ||
| containers. Restartable init containers (sidecars) that are `Guaranteed` will |
There was a problem hiding this comment.
nit: We could add that sidecar containers won't yield the resources as they are intended to run for the entire duration of pod's life
There was a problem hiding this comment.
Done, reworded to be more clear about it.
|
lgtm for partitioning logic, and overall design. Pending items to review from my end: CPU Quota management section, and PRR questionnaire, to be done by EOD |
b909fd0 to
44e13e4
Compare
|
Took another pass and lgtm! In a follow up PR, let's add:
|
44e13e4 to
b30db39
Compare
There was a problem hiding this comment.
/lgtm
My comments have been addresses, the KEP is in good shape and I don't have any low hanging fruits for improvements.
[EDIT] I still want to deep dive into memory manager fine details, but there are no red (or yellow) flags in the changes here so I prefer to move forward.
regarding this point, research in github history and in the KEP has been inconclusive so far. The most credible theory, stemming from I'd welcome further investigation in this area but we need some more hard data. |
For context, the "1. A small note about topology manager in-memory state being cleared when a container is removed." is relevant because of this discussion kubernetes/kubernetes#134345 (comment) from RestartAllContainers. |
b30db39 to
42f5e85
Compare
|
Amended the commit with the required updates to |
|
/lgtm
Confirmed by inter-diff |
| exclusive allocations (for both `Guaranteed` sidecars and `Guaranteed` | ||
| application containers) have been reserved. This ensures that shared sidecars do | ||
| not interfere with the exclusive resources of application containers once they | ||
| start. |
There was a problem hiding this comment.
+1 on podSharedPool. This allows us to treat a Pod as a single NUMA unit while allowing internal flexibility, meanwhile we still maintains backward compatibility at container level.
@pravk03 for vis.
| Unset | Set | Current behavior (alignment and exclusive allocation based on container resources) | ||
| Unset | Set | Current behavior (alignment and exclusive allocation based on individual container resources). | ||
| Set | Unset | No alignment or exclusive allocation. Containers run in the node's shared pool. | ||
| Set | Set | Alignment and exclusive allocation for containers that specify `Guaranteed` resources; pod-level resources are ignored for allocation decisions. |
There was a problem hiding this comment.
Can you help me understand the difference between this category (pod is set, and container is set) with the above one (pod is unset, and container is set) when topology-manager-scope is set to container?
There was a problem hiding this comment.
When only container-level resources are used (pod is unset, and container is set) and it is wanted to obtain exclusive resources, all the containers must have a guaranteed QoS, thus requiring each one of them to have an equal request and limit. In summary, it is an all or nothing situation, all containers receive exclusive resources, or none of them.
When pod-level resources are used (pod is set, and container is set), there is no longer an obligation to obtain Guaranteed QoS for each container, only for the overall pod, but it still requires the containers that you need to have exclusive resources to be Guaranteed. To put an example, it enables a hybrid model where you can have a pod with Guaranteed QoS determined by its pod-level resources, and three containers (one of them has a Guaranteed QoS), so only the Guaranteed container will obtain exclusive resources, while the non-Guaranteed containers will access the shared node resources constrained by the pod-level limit.
There was a problem hiding this comment.
Thanks for the detailed explanation. It makes sense to me, but can you add an example below to reflect this difference?
|
|
||
| ###### Special configurations | ||
|
|
||
| **Underutilization of Pod-Level Resources** |
There was a problem hiding this comment.
This is by design, but please ensure a logging message for a warning to the users.
There was a problem hiding this comment.
This log message change would be a part of PLR implementation then @KevinTMtz. We can submit that as a follow up PR change as that wouldn't affect this KEP in particular.
| sidecar behavior, if a restartable init container is individually | ||
| `Guaranteed`, it is allocated an exclusive slice that is reserved for | ||
| the entire pod lifecycle and is *not* added to the reusable set. | ||
| - **Standard Init Containers:** If individually `Guaranteed`, they are |
There was a problem hiding this comment.
In this case, shouldn't we have a pod resource underutilization issue? Since standard Init Containers are running sequentially, and also before any restarable sidecar container, can we change the logic by allowing them to burst to the entire pod?
There was a problem hiding this comment.
Guaranteed init containers get the corresponding exclusive resources to their request and limit.
Non-guaranteed init containers are granted access to the entire pod-level resource budget minus any resources exclusively allocated to Guaranteed sidecars running at that time. So yes, they are able to burst to whatever is not exclusively allocated in the pod.
There was a problem hiding this comment.
My point is once pod level resource is set, can we allow those standard init containers burst to entire pod resource and ignore its own resource request and limit since they are running sequentially anyway.
There was a problem hiding this comment.
That behavior can be obtained if no request and limit is specified for init containers. I think it is better to keep consistency and respect the expected behavior of the container-level limit that the user has, and if the user wants the container to burst with available resources on the pod, the pod shared pool will take care of that if no container-level limit is specified.
There was a problem hiding this comment.
sgtm. Let's document the best practice for this.
| 1. The `topology-manager-scope` is set to `pod`. | ||
| 2. The sum of resource requests from all containers that are `Guaranteed` | ||
| exactly equals the total resource budget defined in `pod.spec.resources`. | ||
| 3. There is at least one other container in the pod that does not qualify for |
There was a problem hiding this comment.
What about ephemeral container(s)? Today ephemeral containers borrow the node / pod resource, can we explicitly document this? With topology-manager-scope being introduced, should we allow it use pod shared resource or ...?
There was a problem hiding this comment.
Ephemeral containers are not offered any alignment or exclusive allocation by the resource managers, since they lack any guarantees for resources or execution. I think it is better to keep them separate as they currently are, and not receive any special handling. Do you have any specific concerns about ephemeral containers and resource managers?
There was a problem hiding this comment.
No, I don't have much concerns. Just want us to document it clearly in design and the user doc.
There was a problem hiding this comment.
@dchen1107 thanks for bringing this up. I have a few questions now which I missed earlier.
Consider a case: A pod with PLR set, has 1 guaranteed containers and 2 non-guaranteed containers.
Pod-level limit = 4 CPUs
container 1 limit = 2 CPUs
container 2 & 3 can share the other 2 CPUs from the pod shared pool and will have the corresponding cpuset set. If the pod is using all 4 CPUs, there's no way to inject an ephemeral container to such a pod. And there's no way to add enough buffer to the pod-level limit here to accommodate the ephemeral container under pod-level limits.
There was a problem hiding this comment.
Yes, I asked Kevin a follow up PR for ephemeral containers and other small changes. We just need to make decision where to borrow resource for the ephemeral container, pod shared pool? If no shared, can borrow from the first container? Or the container being debugged?
| them. However, to completely eliminate any risk of parsing errors, the | ||
| safest and recommended rollback procedure involves draining the node and | ||
| deleting the state file. | ||
| * **Failure Mode:** Forward compatibility is not supported. An older |
There was a problem hiding this comment.
AFAIK today's K8s roll back feature still requires node drain as the best practice. In this case I don't have much concern for alpha. Let's re-consider this in beta.
|
/lgtm Please send me the followup PR based on the comments later. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, KevinTMtz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc Interesting KEP with good discussion, looking forward for the implementation. |
This KEP change does not modify the original proposed behavior, only adds some specifics that surfaced during the implementation of the feature.
The most significant additions are a new property to the CPU and Memory managers state to keep track of the pod level exclusive resource allocation (requires PRR review perspective) and the introduction of a new term
ResourceIsolationLevelthat describes if the container resources belong to the container, the pod or the node (currently only used by the CPU manager to handle CPU quota enablement/disablement).