Skip to content

Conversation

tallclair
Copy link
Member

@tallclair tallclair commented Jan 24, 2025

InPlacePodVerticalScaling changes for v1.33, including:

/sig node
/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Jan 24, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 24, 2025
@tallclair
Copy link
Member Author

/assign @dchen1107 @thockin

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2025
@tallclair
Copy link
Member Author

/assign @vinaykul

@esotsal
Copy link

esotsal commented Jan 25, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from esotsal January 25, 2025 00:12
@tallclair
Copy link
Member Author

FYI - I just changed ResizePolicy to be immutable. We no longer require it to be mutable, since we no longer default the resize policy field, and a mutable policy prevents us from being able to allow memory limit decreases for containers with a RestartContainer policy (otherwise you could change the resize policy & decrease the limit, and then immediately change the resize policy back).

See [`Alternatives: Allocated Resources`](#allocated-resources-1) for alternative APIs considered.

The allocated resources API should be reevaluated prior to GA.
The scheduler uses `max(spec...resources, status...allocatedResources, status...resources)` for fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note to myself: This is a joint decision we made together, but it could lead the resource under utilized in the worst scenario.

- Restart after checkpointing: pod goes through admission using the allocated resources
1. Kubelet creates a container
- Resources acknowledged after CreateContainer call succeeds
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: cc/ @samuelkarp and @mikebrow to ensure UpdatePodResources call is idempotent since I roughly remembered there was an bug at CRI layer on this lately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to continue to argue that kubelet should be periodically re-asserting the resources it wants, so this absolutely needs to be idempotent :)

Also, I think we should just re-assert at startup for all pods and containers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with you above. We should address this, but not a block for this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idempotence is mentioned under the CRI changes (and actuating resources). Resync is under future enhancements.

If a resize request does not succeed, the Kubelet will retry the resize on every subsequent pod
sync, until it succeeds or the container is terminated.

### Memory Limit Decreases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note to us: cc/ @haircommander @mrunalp since you two are leading the conversation with the kernel community on cgroup v2 limitation.

@dchen1107
Copy link
Member

/lgtm overall with some minor comments.

Will leave other reviews and @thockin for a final review before approve the KEP.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve for API

- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
```
- `status.conditions[type==PodResizePending].type` = `"Infeasible"`
- actual CPU shares = 1638
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeline is great. So great I think you should do more (non-blocking).

I'd love to see the situation where an NRI plugin mutates the request to round to 10 shares, just to demonstrat how actual and allocated are related (e.g. I assume the actual would show shares/1024, so that 1.5 request would translate to 1536 shares, which gets rounded to 1540, which comes to actual as 1.504. At that point, the scheduler, with its max() logic would consider 1.504, not 1.500. Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in a separate PR.

- Restart after checkpointing: pod goes through admission using the allocated resources
1. Kubelet creates a container
- Resources acknowledged after CreateContainer call succeeds
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to continue to argue that kubelet should be periodically re-asserting the resources it wants, so this absolutely needs to be idempotent :)

Also, I think we should just re-assert at startup for all pods and containers.

restarts. There is the possibility that a poorly timed restart could lead to a resize request being
repeated, so `UpdateContainerResources` must be idempotent.

When a resize CRI request succeeds, the pod will be marked for resync to read the latest resources. If
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a section on exploring whether to activel re-assert periodically and at start? I think we should demand that re-asserting be a low-impact operation in the already-correct case (make that the CRI's problem) and just not be worried about re-asserting too often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth exploring 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it to the future enhancements section.

'cpu' and 'memory' as names. It supports the following restart policy values:
* NotRequired - default value; resize the Container without restart, if possible.
* RestartContainer - the container requires a restart to apply new resource values.
* `PreferNoRestart` - default value; resize the Container without restart, if possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify what if possible means here? Whether it will restart anyways or not take action or will be retried later indefinitely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to depend somewhat on what we decide to do with memory limit decreases. I'll add a bit more detail though.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 13, 2025

/approve

for sig-node. Thanks!
will leave final LGTM to @thockin

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2025
Copy link
Member

@vinaykul vinaykul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/suggestions.
Overall LGTM.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2025
@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, mrunalp, tallclair, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Development

Successfully merging this pull request may close these issues.