From a7d1a569745016904d017dad992680ff569ea936 Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:24:04 +0000 Subject: [PATCH 01/11] revert PreferNoRestart resize policy back to NotRequired --- .../README.md | 25 +++++++++---------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index 7491c233cf9..9463b43c2d2 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -278,14 +278,13 @@ the `/resize` subresource: To provide fine-grained user control, PodSpec.Containers is extended with ResizeRestartPolicy - a list of named subobjects (new object) that supports 'cpu' and 'memory' as names. It supports the following restart policy values: -* `PreferNoRestart` - default value; resize the Container without restart, if possible. - * `NotRequired` - Equivalent to `PreferNoRestart`, deprecated with v1.33. +* `NotRequired` - default value; resize the Container without restart, if possible. * `RestartContainer` - the container requires a restart to apply new resource values. (e.g. Java process needs to change its Xmx flag) By using ResizePolicy, user can mark Containers as safe (or unsafe) for in-place resource update. Kubelet uses it to determine the required action. -Note: `PreferNoRestart` restart policy for resize does not *guarantee* that a container won't be +Note: `NotRequired` restart policy for resize does not *guarantee* that a container won't be restarted. If the runtime knows a resize will trigger a restart, it should return an error instead, and the Kubelet will retry the resize on the next pod sync. The restart behavior when shrinking memory limits is not yet defined. @@ -295,10 +294,10 @@ that usually CPU can be added/removed without much problem whereas changes to available memory are more probable to require restarts. If more than one resource type with different policies are updated at the same -time, then `RestartContainer` policy takes precedence over `PreferNoRestart` policy. +time, then `RestartContainer` policy takes precedence over `NotRequired` policy. If a pod's RestartPolicy is `Never`, the ResizePolicy fields must be set to -`PreferNoRestart` to pass validation. That said, any in-place resize may result +`NotRequired` to pass validation. That said, any in-place resize may result in the container being stopped *and not restarted*, if the system can not perform the resize in place. @@ -528,12 +527,12 @@ The scheduler will use the maximum of: ### Flow Control The following steps denote the flow of a series of in-place resize operations -for a Pod with ResizePolicy set to PreferNoRestart for all its Containers. +for a Pod with ResizePolicy set to NotRequired for all its Containers. This is intentionally hitting various edge-cases for demonstration. 1. A new pod is created - `spec.containers[0].resources.requests[cpu]` = 1 - - `spec.containers[0].resizePolicy[cpu].restartPolicy` = `"PreferNoRestart"` + - `spec.containers[0].resizePolicy[cpu].restartPolicy` = `"NotRequired"` - all status is unset 1. Pod is scheduled @@ -1258,15 +1257,15 @@ Setup a namespace with min and max LimitRange and create a single, valid Pod. #### Resize Policy Tests Setup a guaranteed class Pod with two containers (c1 & c2). -1. No resize policy specified, defaults to PreferNoRestart. Verify that CPU and +1. No resize policy specified, defaults to NotRequired. Verify that CPU and memory are resized without restarting containers. -1. PreferNoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2. +1. NotRequired (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2. Verify that c1 is resized without restart, c2 is restarted on resize. -1. PreferNoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only, +1. NotRequired cpu, RestartContainer memory policy for c1. Resize c1 CPU only, verify container is resized without restart. -1. PreferNoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only, +1. NotRequired cpu, RestartContainer memory policy for c1. Resize c1 memory only, verify container is resized with restart. -1. PreferNoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory, +1. NotRequired cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory, verify container is resized with restart. #### Backward Compatibility and Negative Tests @@ -1650,7 +1649,7 @@ _This section must be completed when targeting beta graduation to a release._ - 2025-01-24 - v1.33 updates for planned beta - Replace ResizeStatus with conditions - Improve memory limit downsize handling - - Rename ResizeRestartPolicy `NotRequired` to `PreferNoRestart`, + - Rename ResizeRestartPolicy `NotRequired` to `NotRequired`, and update CRI `UpdateContainerResources` contract - Add back `AllocatedResources` field to resolve a scheduler corner case - Introduce Actuated resources for actuation From acfd63ab155515a7ad9a853b3b8393e9051475af Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:34:54 +0000 Subject: [PATCH 02/11] add more details about the resize status --- .../README.md | 22 ++++++++++++++----- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index 9463b43c2d2..26e54cec795 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -308,24 +308,34 @@ The `ResizePolicy` field is immutable. Resize status will be tracked via 2 new pod conditions: `PodResizePending` and `PodResizeInProgress`. **PodResizePending** will track states where the spec has been resized, but the Kubelet has not yet -allocated the resources. There are two reasons associated with this condition: +allocated the resources (desired resources != actuated resources). There are two reasons associated +with this condition: * `Deferred` - the proposed resize is feasible in theory (it fits on this node) - but is not possible right now; it will be regularly reevaluated. -* `Infeasible` - the proposed resize is not feasible and is rejected; it may not - be re-evaluated. + but is not possible right now; it will be regularly reevaluated. This can happen + if the node does not have enough free resources at the moment, but might in the + future when other pods are removed or scaled down. +* `Infeasible` - the proposed resize is not feasible and is rejected; it will never + be re-evaluated. Today, the possible reasons for infeasible include: + * The requested resources exceed the node's total capacity. + * The pod is a static pod. + * In-place resize is not yet supported for containers with swap enabled. + * In-place resize is not yet supported for guaranteed pods alongside memory manager static policy. + * In-place resize is not yet supported for guaranteed pods alongside CPU manager static policy. In either case, the condition's `message` will include details of why the resize has not been admitted. `lastTransitionTime` will be populated with the time the condition was added. `status` will always be `True` when the condition is present - if there is no longer a pending resized -(either the resize was allocated or reverted), the condition will be removed. +(either the resize was allocated or reverted), the condition will be removed. `observedGeneration` will +reflect the `metadata.generation` of the pod when the resize was last attempted. **PodResizeInProgress** will track in-progress resizes, and should be present whenever allocated resources != actuated resources (see [Resource States](#resource-states)). For successful synchronous resizes, this condition should be short lived, and `reason` and `message` will be left blank. If an error occurs while actuating the resize, the `reason` will be set to `Error`, and `message` will be populated with the error message. In the future, this condition will also be used for long-running -resizing behaviors (see [Memory Limit Decreases](#memory-limit-decreases)). +resizing behaviors (see [Memory Limit Decreases](#memory-limit-decreases)). `observedGeneration` will +reflect the `metadata.generation` of the pod when the resize was initially requested. Note that it is possible for both conditions to be present at the same time, for example if an error is encountered while actuating a resize and a new resize comes in that gets deferred. From 74511f8d0a209a1386837143be4446b5fc635a37 Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:39:35 +0000 Subject: [PATCH 03/11] document kubelet-trigered eviction for critical pods --- keps/sig-node/1287-in-place-update-pod-resources/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index 26e54cec795..e7871778b8b 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -476,6 +476,12 @@ Allocation will be attempted on the pods in the queue: A successful allocation will trigger a pod sync, which will actuate the allocated resize and update the pod status accordingly. +### Kubelet-triggered eviction + +A pod can be marked as critical with the `priorityClassName` of `system-node-critical` or `system-cluster-critical` as +described in [Guaranteed Scheduling For Critical Add-On Pods](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical). If the kubelet receives a resize request for a +critical pod and there is not enough space for the resize, it will evict a non-critical pod to make room. + ### Kubelet and API Server Interaction When a new Pod is created, Scheduler is responsible for selecting a suitable From 24256c9e1dfec637e324d86e0aa7018276c5e12c Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:47:09 +0000 Subject: [PATCH 04/11] update outdated notes regarding static CPU --- .../1287-in-place-update-pod-resources/README.md | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index e7871778b8b..0ec1b18a2f4 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -783,10 +783,6 @@ Impacts of a restart outside of resource configuration are out of scope. #### Notes -* If CPU Manager policy for a Node is set to 'static', then only integral - values of CPU resize are allowed. If non-integral CPU resize is requested - for a Node with 'static' CPU Manager policy, that resize is rejected, and - an error message is logged to the event stream. * To avoid races and possible gamification, all components will use Pod's Status.ContainerStatuses[i].Resources when computing resources used by Pods. @@ -1006,11 +1002,9 @@ This metric is recorded as a counter. ### Static CPU & Memory Policy -Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of -in-place resize. If a pod is a guaranteed QOS on a node with a static CPU or memory policy -configured, then the resize will be marked as infeasible. +Resizing pods with static CPU & memory policy configured is out-of-scope for this KEP. If a pod is a guaranteed QOS on a node with a static CPU or memory policy configured, then the resize will be marked as infeasible. -This will be reconsidered post-beta as a future enhancement. +This suppport will be added post-GA as a separate enhancement in its own KEP. ### Future Enhancements From d8d9469936879e75aa21010dc4ab0d4dc188b3f1 Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:55:13 +0000 Subject: [PATCH 05/11] correct details about instrumentation --- .../README.md | 29 +++++++++---------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index 0ec1b18a2f4..ede59095b2f 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -945,7 +945,6 @@ Labels: - `resource` - what resource. Possible values: `cpu`, or `memory`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each. - `requirement` - Possible values: `limits`, or `requests`. If more than one of these is changing in the resize request, we increment the counter multiple times, once for each. - `operation` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. -- `namespace` - the namespace of the pod. This metric is recorded as a counter. @@ -953,18 +952,13 @@ This metric is recorded as a counter. This metric tracks the duration of [doPodResizeAction](https://github.com/kubernetes/kubernetes/blob/92de70895830ea1a9c2c6554bdab4cbee7ce867d/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L699), which is responsible for actuating the resize. -Labels: -- `namespace` - the namespace of the pod. - This metric is recorded as a histogram. -#### `kubelet_pod_pending_resizes` +#### `kubelet_pod_infeasible_resizes_total` -This metric tracks the current count of pods that the kubelet marks as pending. This will make it -easier for us to see which of the current limitations users are running into the most. +This metric tracks the total number of resizes that were rejected by the kubelet as infeasible. -Labels: -- `reason` - why the resize is pending. Possible values: `infeasible` or `deferred`. +Labels: - `reason_detail` - more details about why the resize is pending. Although a more detailed "message" will be provided in the `PodResizePending` condition in the pod, we limit this label to only the following possible values to keep cardinality low: - `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy. @@ -972,10 +966,19 @@ condition in the pod, we limit this label to only the following possible values - `static_pod` - In-place resize is not supported for static pods. - `swap_limitation` - In-place resize is not supported for containers with swap. - `insufficient_node_allocatable` - The node doesn't have enough capacity for this resize request. -- `namespace` - the namespace of the pod. This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future. +This metric is recorded as a counter. + +#### `kubelet_pod_pending_resizes` + +This metric tracks the current count of pods that the kubelet marks as pending. This will make it +easier for us to see which of the current limitations users are running into the most. + +Labels: +- `reason` - why the resize is pending. Possible values: `infeasible` or `deferred`. + This metric is recorded as a gauge. #### `kubelet_pod_in_progress_resizes` @@ -983,9 +986,6 @@ This metric is recorded as a gauge. This metric tracks the total count of resize requests that the kubelet marks as in progress, meaning that the resources have been allocated but not yet actuated. -Labels: -- `namespace` - the namespace of the pod. - This metric is recorded as a gauge. #### `kubelet_pod_deferred_resize_accepted_total` @@ -995,8 +995,7 @@ later accepted. This metric primarily exists because if a deferred resize is acc opposed to being triggered by an event such as another pod being deleted or sized down), it indicates an issue in the Kubelet's logic for handling deferred resizes that we should fix. Labels: - - `accepted_reason` - whether the resize was accepted through the timed retry or due to another pod event. Possible values: `periodic_retry`, `event_based`. - - `namespace` - the namespace of the pod. + - `retry_trigger` - whether the resize was accepted through the timed retry or due to another pod event. Possible values: `periodic_retry`, `pod_resized`, `pod_updated`, `pods_added`, `pods_removed`. This metric is recorded as a counter. From 8618e3bc8a523f59afd2c8687cadf9003c282c58 Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Tue, 7 Oct 2025 20:15:35 +0000 Subject: [PATCH 06/11] correct small detail about shrinking memory limits --- keps/sig-node/1287-in-place-update-pod-resources/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index ede59095b2f..9c8f90522af 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -286,8 +286,8 @@ ResizeRestartPolicy - a list of named subobjects (new object) that supports Note: `NotRequired` restart policy for resize does not *guarantee* that a container won't be restarted. If the runtime knows a resize will trigger a restart, it should return an error instead, -and the Kubelet will retry the resize on the next pod sync. The restart behavior when shrinking -memory limits is not yet defined. +and the Kubelet will retry the resize on the next pod sync. The behavior when shrinking +memory limits is defined under [Memory Limit Decreases](#memory-limit-decreases) below. Setting the flag to separately control CPU & memory is due to an observation that usually CPU can be added/removed without much problem whereas changes to From 6e1c6afec9695ea48a0f9ea799a3874bf631c48e Mon Sep 17 00:00:00 2001 From: Natasha Sarkar Date: Mon, 22 Sep 2025 17:56:13 +0000 Subject: [PATCH 07/11] Update in-place pod resize for GA --- keps/prod-readiness/sig-node/1287.yaml | 2 ++ .../README.md | 34 +++++++++++-------- .../kep.yaml | 6 ++-- 3 files changed, 24 insertions(+), 18 deletions(-) diff --git a/keps/prod-readiness/sig-node/1287.yaml b/keps/prod-readiness/sig-node/1287.yaml index 94c6ddcc625..fceb140fd03 100644 --- a/keps/prod-readiness/sig-node/1287.yaml +++ b/keps/prod-readiness/sig-node/1287.yaml @@ -3,3 +3,5 @@ alpha: approver: "@ehashman" beta: approver: "@jpbetz" +stable: + approver: "@jpbetz" diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md index 9c8f90522af..c3d9fd6bcc2 100644 --- a/keps/sig-node/1287-in-place-update-pod-resources/README.md +++ b/keps/sig-node/1287-in-place-update-pod-resources/README.md @@ -20,6 +20,7 @@ - [Design Details](#design-details) - [Resource States](#resource-states) - [Priority of Resize Requests](#priority-of-resize-requests) + - [Kubelet-triggered eviction](#kubelet-triggered-eviction) - [Kubelet and API Server Interaction](#kubelet-and-api-server-interaction) - [Kubelet Restart Tolerance](#kubelet-restart-tolerance) - [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction) @@ -41,6 +42,7 @@ - [Instrumentation](#instrumentation) - [kubelet_container_requested_resizes_total](#kubelet_container_requested_resizes_total) - [kubelet_pod_resize_duration_seconds](#kubelet_pod_resize_duration_seconds) + - [kubelet_pod_infeasible_resizes_total](#kubelet_pod_infeasible_resizes_total) - [kubelet_pod_pending_resizes](#kubelet_pod_pending_resizes) - [kubelet_pod_in_progress_resizes](#kubelet_pod_in_progress_resizes) - [kubelet_pod_deferred_resize_accepted_total](#kubelet_pod_deferred_resize_accepted_total) @@ -96,20 +98,20 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [x] e2e Tests for all Beta API Operations (endpoints) + - [x] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [x] "Implementation History" section is up-to-date for milestone +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes