KEP-4603: tune crashloopbackoff for 1.32#4893
KEP-4603: tune crashloopbackoff for 1.32#4893k8s-ci-robot merged 38 commits intokubernetes:masterfrom
Conversation
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
…esign details Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
|
Skipping CI for Draft Pull Request. |
|
Thank you thank you @soltysh for following up so much 🚀 The PRR is ready for your review in this PR. Thanks again! |
|
/assign |
tallclair
left a comment
There was a problem hiding this comment.
Very close to LGTM, it just needs the proposed API for the KubeletConfiguration. The rest are nits and non-blocking.
| kubelet as a config file or, beta as of Kubernetes 1.30, a config directory | ||
| ([ref](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/)). | ||
| Since this is a per-node configuration that likely will be set on a subset of | ||
| nodes, or potentially even differently per node, it's important that it can be |
There was a problem hiding this comment.
I can think of 2 use cases for a heterogeneous configuration:
- Dedicated node pool for workloads that are expected to rapidly restart
- Machine size adjusted config
In either case, I'd expect this configuration to be shared among a node pool. Upstream k8s doesn't have a node pool concept, but I think we should think of this configuration as shared across a group of nodes.
There was a problem hiding this comment.
Added this (and one other case) explicitly in to clarify the position of choosing KubeletConfiguration with this in 58df245
| drops fields unrecognized by the current kubelet's schema, making it a good | ||
| choice to circumvent compatibility issues with n-3 kubelets. While there is an | ||
| argument that this could be better manipulated with a command-line flag, so | ||
| lifecycle tooling that configures nodes can expose it more transparently, the |
There was a problem hiding this comment.
I don't think this argument holds weight. If we believed it, we shouldn't have added KubeletConfiguration in the first place. I don't think the backoff override is special enough that it should get hoisted up into a flag for better visibility.
That is, I agree with the decision to put it in the KubeletConfiguration rather than a flag.
There was a problem hiding this comment.
Clarified the position of choosing KubeletConfiguration with this in 58df245
| [`client_go.Backoff.hasExpired`](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/util/flowcontrol/backoff.go#L178), | ||
| and configure the `client_go.Backoff` object created for use by the kube runtime | ||
| manager for container restart bacoff with a function that compares to a flat | ||
| rate of 300 seconds. |
There was a problem hiding this comment.
IMO 300s is too long for the backoff recovery, but I'm happy with reducing the scope of changes for the first alpha. Maybe add a beta graduation criteria to revisit this decision?
| * <<[UNRESOLVED]>>node upgrade and downgrade path <<[/UNRESOLVED]>> | ||
| - Fix https://github.com/kubernetes/kubernetes/issues/123602 if this blocks the | ||
| implementation, otherwise beta criteria | ||
| - New `int32 crashloopbackoff.max` field in `KubeletConfiguration` API, validated |
There was a problem hiding this comment.
Do you propose the actual API / fieldname anywhere?
There was a problem hiding this comment.
Ah ya only just hidden here. Added explicitly in 1a5f314
| benchmarking is worked up, this is gated by its own feature gate, | ||
| `ReduceDefaultCrashLoopBackoffDecay`. | ||
|
|
||
| ### Per node config |
There was a problem hiding this comment.
I think this section could benefit from a TL;DR of what exactly is being proposed. You can keep the justification and discussion, but the proposal is too buried right now. This should also include the specific field name being proposed.
| - Test proving `KubeletConfiguration` objects will silently drop unrecognized | ||
| fields in the `config.validation_test` package | ||
| ([ref](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/validation/validation_test.go)). |
There was a problem hiding this comment.
Is this also the expected behavior when the feature gate is disabled?
There was a problem hiding this comment.
Yes. I did include this comment inline here in 1515af5
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
…tune-crashloopbackoff-132
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
| apiVersion: kubelet.config.k8s.io/v1beta1 | ||
| kind: KubeletConfiguration | ||
| crashloopbackoff: | ||
| max: 4 |
There was a problem hiding this comment.
I think this should be maxSeconds (https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#units)
|
/lgtm |
|
I think this addresses everything @dchen1107 listed in #4604 (comment) @lauralorenz please confirm. |
|
To fill in the context to @tallclair's question in #4893 (comment) for @dchen1107
The previously named
This is now introduced here in the Proposal section and further explained in this Design Details section.
Manual stress tests I did to rationalize the current default suggestions are discussed here and further benchmarking and stress testing remain a graduation criteria for alpha and are described more in the benchmarking Design Details section.
This change in design now supports faster restarts of any workloads -- not limited to those without support for The interaction with a very specific Job API feature from KEP-3329 is described here and has not changed much since 1.31 since as of 1.32, they target different restart types (
See the much more expanded section for this here and in this Appendix. |
|
📟 Greetings @soltysh , just a friendly update that this has sig-node reviewer lgtm, if that time sequences anything regarding the PRR. @dchen1107 has it in her queue for the sig-node hard approval. |
| * Runs startup probes until container started (startup probes may be more | ||
| image downloads) if image pull policy specifies it | ||
| ([ref](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/images/image_manager.go#L135)). | ||
| * Recreates the pod sandbox and probe workers |
There was a problem hiding this comment.
nit: Why kubelet recreate the pod sandbox for restarting a container within the already-running pod? cc/ @tallclair @yujuhong
There was a problem hiding this comment.
I don't think kubelet does that.
The probe worker is also set up only once when the pod is added.
There was a problem hiding this comment.
I also don't think register/unregister the pod to the managers, as listed below...
|
/approve Thanks @lauralorenz for the detailed design. Thanks @tallclair and others for the detailed review. I have focused on the potential risks this time which has been brought up several times by the community: 1) Increased Load on Kubelet: Faster restarts mean kubelet has to work harder and more frequently to manage pod lifecycles. This could lead to increased CPU and memory usage, potentially impacting node stability. 2) API Server Overload: Each pod restart triggers API server requests to update pod status. More frequent restarts could strain the API server, potentially affecting the entire cluster. The KEP itself documented well the risk mitigation strategies:
cc/ @liggitt |
soltysh
left a comment
There was a problem hiding this comment.
Minor nit, feel free to either fix it asap (I'll put a hold) or in a follow-up (in which case drop the hold).
/approve
the PRR
| --> | ||
|
|
||
|
|
||
| - `kubelet/kuberuntime/kuberuntime_manager_test`: **could not find a successful |
There was a problem hiding this comment.
This is missing, I see some reasonable data in https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
There was a problem hiding this comment.
Thanks! I will fix it now
There was a problem hiding this comment.
Fixed in another branch at lauralorenz@d367cd5, will merge in as follow on PR because I fear losing those hard won lgtms and approves 😅
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, lauralorenz, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/unhold merging in fix for nit at lauralorenz@d367cd5 after this |
|
FYI follow up PR for PRR nit is in #4910 |
KubeletConfiguration) max backoff configuration