Skip to content

KEP-4603: tune crashloopbackoff for 1.32#4893

Merged
k8s-ci-robot merged 38 commits intokubernetes:masterfrom
lauralorenz:kep-4603-tune-crashloopbackoff-132
Oct 8, 2024
Merged

KEP-4603: tune crashloopbackoff for 1.32#4893
k8s-ci-robot merged 38 commits intokubernetes:masterfrom
lauralorenz:kep-4603-tune-crashloopbackoff-132

Conversation

@lauralorenz
Copy link
Contributor

@lauralorenz lauralorenz commented Oct 2, 2024

  • One-line PR description: Tunable CrashLoopBackoff proposal for 1.32 based on new defaults and node level (via KubeletConfiguration) max backoff configuration
  • Other comments:

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
…esign details

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 2, 2024
@lauralorenz
Copy link
Contributor Author

Thank you thank you @soltysh for following up so much 🚀 The PRR is ready for your review in this PR. Thanks again!

@tallclair
Copy link
Member

/assign
/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 3, 2024
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close to LGTM, it just needs the proposed API for the KubeletConfiguration. The rest are nits and non-blocking.

kubelet as a config file or, beta as of Kubernetes 1.30, a config directory
([ref](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/)).
Since this is a per-node configuration that likely will be set on a subset of
nodes, or potentially even differently per node, it's important that it can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of 2 use cases for a heterogeneous configuration:

  1. Dedicated node pool for workloads that are expected to rapidly restart
  2. Machine size adjusted config

In either case, I'd expect this configuration to be shared among a node pool. Upstream k8s doesn't have a node pool concept, but I think we should think of this configuration as shared across a group of nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this (and one other case) explicitly in to clarify the position of choosing KubeletConfiguration with this in 58df245

drops fields unrecognized by the current kubelet's schema, making it a good
choice to circumvent compatibility issues with n-3 kubelets. While there is an
argument that this could be better manipulated with a command-line flag, so
lifecycle tooling that configures nodes can expose it more transparently, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this argument holds weight. If we believed it, we shouldn't have added KubeletConfiguration in the first place. I don't think the backoff override is special enough that it should get hoisted up into a flag for better visibility.

That is, I agree with the decision to put it in the KubeletConfiguration rather than a flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the position of choosing KubeletConfiguration with this in 58df245

[`client_go.Backoff.hasExpired`](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/util/flowcontrol/backoff.go#L178),
and configure the `client_go.Backoff` object created for use by the kube runtime
manager for container restart bacoff with a function that compares to a flat
rate of 300 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 300s is too long for the backoff recovery, but I'm happy with reducing the scope of changes for the first alpha. Maybe add a beta graduation criteria to revisit this decision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a beta criteria in 4b3835f

* <<[UNRESOLVED]>>node upgrade and downgrade path <<[/UNRESOLVED]>>
- Fix https://github.com/kubernetes/kubernetes/issues/123602 if this blocks the
implementation, otherwise beta criteria
- New `int32 crashloopbackoff.max` field in `KubeletConfiguration` API, validated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you propose the actual API / fieldname anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ya only just hidden here. Added explicitly in 1a5f314

benchmarking is worked up, this is gated by its own feature gate,
`ReduceDefaultCrashLoopBackoffDecay`.

### Per node config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section could benefit from a TL;DR of what exactly is being proposed. You can keep the justification and discussion, but the proposal is too buried right now. This should also include the specific field name being proposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 1a5f314

Comment on lines +1136 to +1138
- Test proving `KubeletConfiguration` objects will silently drop unrecognized
fields in the `config.validation_test` package
([ref](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/validation/validation_test.go)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also the expected behavior when the feature gate is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I did include this comment inline here in 1515af5

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
crashloopbackoff:
max: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 220604e

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2024
@tallclair
Copy link
Member

I think this addresses everything @dchen1107 listed in #4604 (comment)

@lauralorenz please confirm.

@lauralorenz
Copy link
Contributor Author

To fill in the context to @tallclair's question in #4893 (comment) for @dchen1107

Rapid Restart Implementation

The previously named RestartPolicy: Rapid has been moved to Alternatives Considered here. If you mean the change to the default curve, that is here in the Proposal section and the implementation explained more in this Design Details section.

Node-Level Opt-in

This is now introduced here in the Proposal section and further explained in this Design Details section.

Benchmarking and Stress Testing

Manual stress tests I did to rationalize the current default suggestions are discussed here and further benchmarking and stress testing remain a graduation criteria for alpha and are described more in the benchmarking Design Details section.

Interaction with Job API

This change in design now supports faster restarts of any workloads -- not limited to those without support for RestartPolicy: Always in the Pod API like the prior RestartPolicy: Rapid, as described in these design slides. Explicitly supporting Jobs was an added as an explicit KEP goal in this PR at fb56335#diff-ffa07b017bd73874c1a805a8492f4e83e1d334445fcf891da0be617332650427R291.

The interaction with a very specific Job API feature from KEP-3329 is described here and has not changed much since 1.31 since as of 1.32, they target different restart types (Never in KEP-3329 vs OnFailure and Always in this KEP).

Kubelet Overhead Analysis

See the much more expanded section for this here and in this Appendix.

@lauralorenz
Copy link
Contributor Author

📟 Greetings @soltysh , just a friendly update that this has sig-node reviewer lgtm, if that time sequences anything regarding the PRR. @dchen1107 has it in her queue for the sig-node hard approval.

* Runs startup probes until container started (startup probes may be more
image downloads) if image pull policy specifies it
([ref](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/images/image_manager.go#L135)).
* Recreates the pod sandbox and probe workers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why kubelet recreate the pod sandbox for restarting a container within the already-running pod? cc/ @tallclair @yujuhong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think kubelet does that.
The probe worker is also set up only once when the pod is added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think register/unregister the pod to the managers, as listed below...

@dchen1107
Copy link
Member

/approve

Thanks @lauralorenz for the detailed design. Thanks @tallclair and others for the detailed review.

I have focused on the potential risks this time which has been brought up several times by the community: 1) Increased Load on Kubelet: Faster restarts mean kubelet has to work harder and more frequently to manage pod lifecycles. This could lead to increased CPU and memory usage, potentially impacting node stability. 2) API Server Overload: Each pod restart triggers API server requests to update pod status. More frequent restarts could strain the API server, potentially affecting the entire cluster.

The KEP itself documented well the risk mitigation strategies:

  • Conservative Defaults
  • Node-Level Opt-in
  • Alpha Stage and Feature Gates
  • Extensive Testing
  • Controlled Experimentation
  • Gradual Rollout
  • Continuous Monitoring

cc/ @liggitt

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, feel free to either fix it asap (I'll put a hold) or in a follow-up (in which case drop the hold).

/approve
the PRR

-->


- `kubelet/kuberuntime/kuberuntime_manager_test`: **could not find a successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing, I see some reasonable data in https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will fix it now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in another branch at lauralorenz@d367cd5, will merge in as follow on PR because I fear losing those hard won lgtms and approves 😅

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, lauralorenz, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2024
@soltysh
Copy link
Contributor

soltysh commented Oct 8, 2024

/hold
to fix the nit

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2024
@lauralorenz
Copy link
Contributor Author

/unhold

merging in fix for nit at lauralorenz@d367cd5 after this

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit 72e6683 into kubernetes:master Oct 8, 2024
@lauralorenz
Copy link
Contributor Author

FYI follow up PR for PRR nit is in #4910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants