-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-1287: InPlacePodVerticalScaling changes for v1.33 #5089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/assign @dchen1107 @thockin |
/assign @vinaykul |
/cc |
FYI - I just changed ResizePolicy to be immutable. We no longer require it to be mutable, since we no longer default the resize policy field, and a mutable policy prevents us from being able to allow memory limit decreases for containers with a |
See [`Alternatives: Allocated Resources`](#allocated-resources-1) for alternative APIs considered. | ||
|
||
The allocated resources API should be reevaluated prior to GA. | ||
The scheduler uses `max(spec...resources, status...allocatedResources, status...resources)` for fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note to myself: This is a joint decision we made together, but it could lead the resource under utilized in the worst scenario.
- Restart after checkpointing: pod goes through admission using the allocated resources | ||
1. Kubelet creates a container | ||
- Resources acknowledged after CreateContainer call succeeds | ||
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: cc/ @samuelkarp and @mikebrow to ensure UpdatePodResources call is idempotent since I roughly remembered there was an bug at CRI layer on this lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to continue to argue that kubelet should be periodically re-asserting the resources it wants, so this absolutely needs to be idempotent :)
Also, I think we should just re-assert at startup for all pods and containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with you above. We should address this, but not a block for this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idempotence is mentioned under the CRI changes (and actuating resources). Resync is under future enhancements.
If a resize request does not succeed, the Kubelet will retry the resize on every subsequent pod | ||
sync, until it succeeds or the container is terminated. | ||
|
||
### Memory Limit Decreases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note to us: cc/ @haircommander @mrunalp since you two are leading the conversation with the kernel community on cgroup v2 limitation.
/lgtm overall with some minor comments. Will leave other reviews and @thockin for a final review before approve the KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve for API
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6 | ||
``` | ||
- `status.conditions[type==PodResizePending].type` = `"Infeasible"` | ||
- actual CPU shares = 1638 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeline is great. So great I think you should do more (non-blocking).
I'd love to see the situation where an NRI plugin mutates the request to round to 10 shares, just to demonstrat how actual and allocated are related (e.g. I assume the actual would show shares/1024, so that 1.5 request would translate to 1536 shares, which gets rounded to 1540, which comes to actual as 1.504. At that point, the scheduler, with its max()
logic would consider 1.504, not 1.500. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address in a separate PR.
- Restart after checkpointing: pod goes through admission using the allocated resources | ||
1. Kubelet creates a container | ||
- Resources acknowledged after CreateContainer call succeeds | ||
- Restart before acknowledgement: Kubelet issues a superfluous UpdatePodResources request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to continue to argue that kubelet should be periodically re-asserting the resources it wants, so this absolutely needs to be idempotent :)
Also, I think we should just re-assert at startup for all pods and containers.
restarts. There is the possibility that a poorly timed restart could lead to a resize request being | ||
repeated, so `UpdateContainerResources` must be idempotent. | ||
|
||
When a resize CRI request succeeds, the pod will be marked for resync to read the latest resources. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a section on exploring whether to activel re-assert periodically and at start? I think we should demand that re-asserting be a low-impact operation in the already-correct case (make that the CRI's problem) and just not be worried about re-asserting too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth exploring 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it to the future enhancements section.
'cpu' and 'memory' as names. It supports the following restart policy values: | ||
* NotRequired - default value; resize the Container without restart, if possible. | ||
* RestartContainer - the container requires a restart to apply new resource values. | ||
* `PreferNoRestart` - default value; resize the Container without restart, if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify what if possible means here? Whether it will restart anyways or not take action or will be retried later indefinitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to depend somewhat on what we decide to do with memory limit decreases. I'll add a bit more detail though.
/approve for sig-node. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/suggestions.
Overall LGTM.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, mrunalp, tallclair, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
InPlacePodVerticalScaling changes for v1.33, including:
Resize
status with pod conditions (PodResizePending
andPodResizing
)NotRequired
toPreferNoRestart
, and update CRIUpdateContainerResources
contractAllocatedResources
([FG:InPlacePodVerticalScaling] Inconsistency between scheduler & kubelet admission logic kubernetes#129532)/sig node
/milestone v1.33