KEP 1287: Graduate InPlacePodVerticalScaling to GA #5562

natasha41575 · 2025-09-22T20:25:35Z

One-line PR description: Graduate InPlacePodVerticalScaling to GA.

Issue link: In-Place Update of Pod Resources #1287

Other comments: We believe we are ready to graduate InPlacePodVerticalScaling to GA in kubernetes 1.35. See comments below for how we have met the graduation criteria, and more details in the KEP for other updates.

/sig node autoscaling scheduling

/cc @tallclair @dchen1107

/cc @dom4ha
for sig-scheduling

/cc @jackfrancis
for sig-autoscaling

natasha41575 · 2025-09-22T20:26:17Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

  - Resize atomicity
  - Exposing allocated resources in the pod status
  - QOS class changes
+- The subset of pod resize tests [here](https://github.com/kubernetes/kubernetes/blob/1aec2eb0030d2f121b4cf78998e9391d9389f1a0/test/e2e/common/node/pod_resize.go) under `doPodResizeTests` and `doPodResizeErrorTests` that meet the Conformance test requirements are promoted to Conformance.


We have a tracking bug for the Conformance endpoints: kubernetes/kubernetes#133607

dom4ha · 2025-09-23T21:40:27Z

/label lead-opted-in
/milestone v1.35

natasha41575 · 2025-09-25T15:27:57Z

/cc @wojtek-t

keps/sig-node/1287-in-place-update-pod-resources/README.md

jpbetz · 2025-10-01T01:13:27Z

/approve
For PRR (Everything appears to have been filled out by Beta, which is what I like to see!)

SergeyKanzhelev · 2025-10-08T18:51:24Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+  lack of support for resize is now a significant missing piece of that functionality; however
+  we don't believe this is a strong enough reason to block IPPR GA. We can, however, consider
+  whether this should block GA of pod level resources.
+- `UpdatePodSandboxResources` is implemented by containerd & CRI-O. This is implemented by


Can you add a note here that we will at least test this in e2e for stable. We need to validate that NRI plugins in future will be able to "decline" the resize by returning error from this method call.

Synced offline, a summary of the open question:

All errors returned by this call are currently ignored: https://github.com/kubernetes/kubernetes/blob/0a4651c9910533f4649b8a11c334cf23237b1ccc/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L792

I think what we need is a decision about whether or not we should continue ignoring this error in all cases, or if we should only ignore the error in the "unimplemented" case. @tallclair do you have context on why we ignore the error entirely? Can we change it to ignore only when call is not implemented by the CRI?

If we make that change, that will allow us to support and use CRIProxy to test the behavior that Sergey is talking about (that NRI plugins can intercept and block the resize call via UpdatePodSandboxResources).

@SergeyKanzhelev I thought more and realized actually I think the behavior that you desire for an NRI plugin to block a resize can be achieved by intercepting the UpdateContainerResources CRI call here: https://github.com/kubernetes/kubernetes/blob/0a4651c9910533f4649b8a11c334cf23237b1ccc/pkg/kubelet/kuberuntime/kuberuntime_container.go#L409.

If I recall correctly the UpdatePodSandboxResources was intended to be purely informative, my guess is that's why the decision was made that it should just stay as a best-effort call and the error logged and not bubbled up.

I will look more into this tomorrow, but I think what we actually want here is have coverage that an NRI plugin can decline the resize the resize by returning an error from UpdateContainerResources, not from UpdatePodSandboxResources.

@tallclair do you have context on why we ignore the error entirely? Can we change it to ignore only when call is not implemented by the CRI?
If errors are completely ignored, how can NRI plugins actually block resizes?

If errors are completely ignored, how can NRI plugins actually block resizes?

Errors from UpdatePodSandboxResources are ignored, but errors from UpdateContainerResources are not, so an NRI plugin can block resizes by intercepting UpdateContainerResources.

Can you add a note here that we will at least test this in e2e for stable. We need to validate that NRI plugins in future will be able to "decline" the resize by returning error from this method call.

I added a note under e2e tests that we will use CRI Proxy to test that NRI plugins can intercept UpdateContainerResources to block a resize.

The big difference between UpdatePodSandboxResources and UpdateContainerResources is that container cgroups are managed by the runtime, while Pod cgroups are managed by the Kubelet. Since the Kubelet is managing the cgroup and doesn't have direct feedback from NRI, we decided to make this an informational best-effort call.

If we wanted to allow errors to prevent the Kubelet from modifying the cgroup, I think this would be technically feasible, but we would need to change the order of calls. Currently, Kubelet modifies the cgroup before calling the CRI. I'm not sure if one is more technically correct than the other. An argument for calling CRI (and NRI, by extension) first is that it's more consistent with UpdateContainerResources NRI, which is called before modifying the cgroups.

Errors from UpdatePodSandboxResources are ignored, but errors from UpdateContainerResources are not, so an NRI plugin can block resizes by intercepting UpdateContainerResources.

I'm not sure, but with Pod Level Resources there might not be an UpdateContainerResources call for all resizes.
/cc @ndixita

Summary of offline discussion:

We'll remove UpdatePodSandboxResources from the graduation criteria of this KEP and move it to #5419. We can also stop ignoring the error from UpdatePodSandboxResources (when the containerd implementation is available).

keps/sig-node/1287-in-place-update-pod-resources/README.md

dom4ha

/lgtm
/approve for sig-scheduling

We agreed that races with scheduler should be out-of-scope of this feature. Races when handling individual pods are not very harmful as they usually require another round of scheduling.

Things will change when we consider the Workload-Aware Scheduling effort in which scheduler needs to perform scheduling of a group of pods and maintain its integrity.

At that time it will become critical to assure that the startup can be performed without races as well as the scheduler is in the loop for any preemption decisions when some workload might be affected.

This work does not block this GA promotion though.

SergeyKanzhelev

/lgtm
/approve

I am happy with downscoping of this KEP and moving it to GA.

Among other things, what was originally in graduation criteria is still valueable to build:

VPA need to start taking advantage of this KEP. The IPPR KEP made changes to prevent known disruptions when resize is requested, however VPA is not using it and instead forcing the disruption when kubelet reporting it cannot do it in non-disruptive way.
Clarify the exact semantic on interaction with CRI. Container runtimes and NRI plugins MUST have a say in whether resize it allowed. This need to be improved in Pod-level resources resize KEP AND more changes may be needed long term.

tallclair

/lgtm
/approve

In think the feature is in good shape and useful as-is. We have additional enhancements we want to make, but these can progress as separate features. I am fully supportive of graduating to GA.

dchen1107 · 2025-10-16T00:36:27Z

/lgtm
/approve

k8s-ci-robot · 2025-10-16T00:36:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, dom4ha, jackfrancis, jpbetz, natasha41575, SergeyKanzhelev, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz]
~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2025-10-16T05:37:23Z

/unhold

natasha41575 added 5 commits September 22, 2025 17:24

revert PreferNoRestart resize policy back to NotRequired

a7d1a56

add more details about the resize status

acfd63a

document kubelet-trigered eviction for critical pods

74511f8

update outdated notes regarding static CPU

24256c9

correct details about instrumentation

d8d9469

k8s-ci-robot requested review from dchen1107, dom4ha, jackfrancis and tallclair September 22, 2025 20:25

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 22, 2025

github-project-automation bot added this to SIG Scheduling Sep 22, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 22, 2025

natasha41575 commented Sep 22, 2025

View reviewed changes

natasha41575 force-pushed the ippr-ga branch from 65f1319 to 0b6845c Compare September 22, 2025 20:32

k8s-ci-robot added this to the v1.35 milestone Sep 23, 2025

k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Sep 23, 2025

jackfrancis mentioned this pull request Sep 23, 2025

[core][autoscaler][IPPR] Initial implementation for resizing pods in-place to the maximum configured by the user ray-project/ray#55961

Open

8 tasks

natasha41575 mentioned this pull request Sep 24, 2025

To prevent kubelet from evicting pods due to node affinity after a restart kubernetes/kubernetes#133803

Open

k8s-ci-robot requested a review from wojtek-t September 25, 2025 15:27

jackfrancis reviewed Sep 30, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

jackfrancis reviewed Oct 1, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

jackfrancis reviewed Oct 1, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

jpbetz mentioned this pull request Oct 1, 2025

In-Place Update of Pod Resources #1287

Open

95 tasks

SergeyKanzhelev reviewed Oct 8, 2025

View reviewed changes

helayoty reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved

helayoty reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

dom4ha reviewed Oct 13, 2025

View reviewed changes

k8s-ci-robot assigned dom4ha Oct 13, 2025

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 13, 2025

natasha41575 force-pushed the ippr-ga branch from 711f81f to c88f707 Compare October 13, 2025 15:26

k8s-ci-robot requested a review from ndixita October 14, 2025 00:05

natasha41575 force-pushed the ippr-ga branch from c88f707 to 5f64a83 Compare October 15, 2025 14:44

refine graduation critera for GA

1608bf1

natasha41575 force-pushed the ippr-ga branch from 5f64a83 to 1608bf1 Compare October 15, 2025 14:45

natasha41575 mentioned this pull request Oct 15, 2025

KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation #5555

Open

Remove UpdatePodSandboxResources from graduation criteria

1d8a21b

natasha41575 force-pushed the ippr-ga branch from 241b5ed to 1d8a21b Compare October 15, 2025 21:47

natasha41575 requested review from SergeyKanzhelev and tallclair October 15, 2025 21:48

SergeyKanzhelev approved these changes Oct 15, 2025

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Oct 15, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025

tallclair reviewed Oct 15, 2025

View reviewed changes

k8s-ci-robot assigned dchen1107 Oct 16, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2025

k8s-ci-robot merged commit 0653f14 into kubernetes:master Oct 16, 2025
3 of 4 checks passed

github-project-automation bot moved this from In Progress to Done in SIG Scheduling Oct 16, 2025

natasha41575 deleted the ippr-ga branch October 16, 2025 15:40

KEP 1287: Graduate InPlacePodVerticalScaling to GA #5562

KEP 1287: Graduate InPlacePodVerticalScaling to GA #5562

Conversation

natasha41575 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Sep 23, 2025

Uh oh!

natasha41575 commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpbetz commented Oct 1, 2025

Uh oh!

SergeyKanzhelev Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

natasha41575 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

natasha41575 Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tallclair Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

tallclair Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

natasha41575 Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

tallclair left a comment

Choose a reason for hiding this comment

Uh oh!

dchen1107 commented Oct 16, 2025

Uh oh!

k8s-ci-robot commented Oct 16, 2025

Uh oh!

SergeyKanzhelev commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

natasha41575 commented Sep 22, 2025 •

edited

Loading

natasha41575 Oct 9, 2025 •

edited

Loading

natasha41575 Oct 9, 2025 •

edited

Loading

natasha41575 Oct 13, 2025 •

edited

Loading

natasha41575 Oct 15, 2025 •

edited

Loading