-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keps: sig-node: initial pod overhead proposal #887
Conversation
runtimeHandler: | ||
type: string | ||
Pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$' | ||
+ runtimeCpuReqOverhead: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the larger audience: Do we have no way to get the quantity behavior into CRDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have questions about eviction, veritical pod autoscaling, and if this inhibits our ability to do pod level resource requirements in the future.
for a given runtimeClass. A mutating admission controller is introduced which will update the `Overhead` | ||
field in the workload's `PodSpec` to match what is provided for the selected RuntimeClass, if one is specified. | ||
|
||
Kubelet's creation of the pod cgroup will be calculated as the sum of container `ResourceRequirements` fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to account for overhead in how we handle eviction decisions? right now, we evict based on usage relative to request. this may prove problematic as you will potentially want to subtract overhead from the observed usage to make a more accurate decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this more explicit in the proposal.
|
||
For scheduling, the pod resource requests are added to the container resource requests. | ||
|
||
We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we had a flag to toggle this. pod level cgroup takes charges for empty dir memory backed volumes across container restart boundaries, so there are many potential sources of things that can get charged to the pod cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any concern with how a vertical pod autoscaler would work with this function? i thought the vpa would set container requests, but external tools like kubectl top will see the overhead charged to the container, and so vpa sizing may be skewed. is vpa going to be runtime class aware? it cannot use the same estimation for an image run in a normal container versus a vm-container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr - for the pod, the overhead would be observed.
Container level statistics are gathered directly from the cgroups in the guest, and do not include any sandbox overheads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to clarify scaling scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was not sure if this was the case with gvisor, but i had followed up with @dchen1107 where she confirmed separately.
|
||
## Alternatives [optional] | ||
|
||
In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think other alternatives can be discussed.
alternative options would be to support pod level resource requirements rather than just overhead. i think pod level resource requirements are very useful for shared resources (hugepages, memory when doing emptyDir volumes), and its possible doing overhead now may complicate our ability to do that later.
the benefit of overhead is a kubernetes service provider can subsidize the charge-back model potentially and eat the cost of the runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice. there are pros/cons to either approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod level resource requirements makes sense to me. I was originally trying to keep this out of the scope of this KEP, but it is indeed hard to do so. If pod level resources existed, then adding pod overhead would be a small addition imo (probably just a mutating admission controller to augment based on overhead for the registered runtimeClass).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this as a suggested alternative. TBH, though, I see this as an augmentation rather than an alternative. I think both could exist.
741ecd6
to
72b5dbf
Compare
@tallclair, @derekwaynecarr PTAL? Since my proposal will also impact scheduling, would it make sense to include someone for scheduling sig? |
Signed-off-by: Eric Ernst <[email protected]>
to-be-squashed - including in history for review period only Signed-off-by: Eric Ernst <[email protected]>
...updated to align with RuntimeClass admission controller naming, as proposed in #909 |
``` | ||
Pod { | ||
Spec PodSpec { | ||
// Overhead is the resource overhead consumed by the Pod, not including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than the "Pod" can we say "the overhead incurred from the container runtime" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just runtime
pod specify a limit, then the sum of those limits becomes a pod-level limit, enforced through the | ||
pod cgroup. | ||
|
||
Users are not expected to manually set the pod resources; if a runtimeClass is being utilized, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we would not always discard if no RuntimeClass is configured?
In the scope of this KEP, The RuntimeClass controller will have a single job: set the pod overhead field in the | ||
workload's PodSpec according to the runtimeClass specified. | ||
|
||
It is expected that only the RuntimeClass controller will set Pod.Spec.Overhead. If a value is provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should strip the overhead declaration from any pod that has no associated runtime class.
before this moves to implementable, i would like to see graduation criteria and understand if this is a separate feature gate from the existing RuntimeClass feature gate. I would also like to prevent users from populating overhead if RuntimeClass is not associated with a pod. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, egernst The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not getting these comments in before this merged. Would you mind opening a follow-up PR to address these (and some of @derekwaynecarr's open concerns)?
@@ -0,0 +1,310 @@ | |||
--- | |||
title: KEP Template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix
authors: | ||
- "@egernst" | ||
owning-sig: sig-node | ||
participating-sigs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sig-scheduling?
sig-autoscaling?
wg-resource-management
status: provisional | ||
--- | ||
|
||
# pod overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: capitalize
|
||
## Table of Contents | ||
|
||
Tools for generating: https://github.com/ekalinin/github-markdown-toc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix (run the tool, paste the output here - I usually drop the bullets for the title & TOC sections)
|
||
### Non-Goals | ||
|
||
* Making runtimeClass selections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more that come to mind:
- auto detecting overhead
- per-container overhead
- enforcement of overhead (pod cgroup limits) - Actually, maybe this is covered?
|
||
## Drawbacks [optional] | ||
|
||
This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify. | |
This KEP introduces further complexity, and adds a field the PodSpec which users aren't expected to modify. |
Even if this were to be introduced, there is a benefit in keeping the overhead separate. | ||
- post-pod creation handling of pod events: if runtimeClass definition is removed after a pod is created, | ||
it will be very complicated to calculate which part of the pod resource requirements were associated with | ||
the workloads versus the sandbox overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that matter?
to add a sandbox overhead when applicable. | ||
|
||
Pros: | ||
* no changes to the pod spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the user doesn't have the option of setting the field.
Cons: | ||
* handling of the pod overhead is spread out across a few components | ||
* Not user perceptible from a workload perspective. | ||
* very complicated if the runtimeClass policy changes after workloads are running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that we don't merge the runtime Handler into the pod spec. The RuntimeClass is immutable, although it can still be deleted and recreated. We could add special handling to prevent the RuntimeClass from being deleted as long there are pods running with it.
runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice. | ||
|
||
|
||
### Leaving the PodSpec unchaged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been assuming that we would merge the overhead into the podspec, but now I'm second-guessing that...
Also, I forgot to say that this is awesome, especially as your first contribution to Kubernetes! Thanks so much for taking this project on! |
Initial push for the pod overhead KEP.