-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8616 redpanda: configurable sleep on crash loop #24787
CORE-8616 redpanda: configurable sleep on crash loop #24787
Conversation
daac090
to
2e85fd8
Compare
force-push: fix "node failed to stop" test failures |
Retry command for Build#60661please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#60661
test results on build#60694
test results on build#60725
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you mention a bit in the coverletter the interaction between :
- the duration of the sleep
- various k8s timeouts (e.g. liveness and other probes)
- where the sleep occurs in startup (e.g. does the admin api start up)
2e85fd8
to
0960e17
Compare
force-push: address code review suggestions |
@dotnwat Thanks, I've expanded the cover letter now to include these. Let me know if you still have any unanswered questions. |
This introduces the node config `crash_loop_sleep`. When redpanda detects that it reached the crash loop limit, instead of terminating immediately, it should sleep for `crash_loop_sleep` seconds before terminating. This is useful especially in a Kubernetes environment where setting this value allows customers to have ssh access into a crashlooping pod for a short window of time.
It also enabled debug-level logging for the main logger (which the crash loop limiter uses) for easier debugging.
0960e17
to
3192708
Compare
force-push: rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cover letter config name needs an update s/secs/sec
that's great. thank you I appreciate it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm from docs
/backport v24.3.x |
/backport v24.2.x |
/backport v24.1.x |
Failed to create a backport PR to v24.2.x branch. I tried:
|
Failed to create a backport PR to v24.1.x branch. I tried:
|
When redpanda detects that it has reached the crash loop limit, instead of terminating immediately, it now sleeps for
crash_loop_sleep_sec
seconds before terminating. This sleeping occurs early on during startup before most components (e.g., storage, controller, admin API, etc.) are initialised.crash_loop_sleep_sec
is a node config. It is disabled by default in redpanda, but we plan to enable it by default in the Redpanda Kubernetes Helm Chart shortly.This config is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.
Note that it may be pointless (though not harmful) to set
crash_loop_sleep_sec
to a value larger than the timeout specified in the KubernetesstartupProbe
orlivenessProbe
. While the redpanda process is sleeping forcrash_loop_sleep_sec
, Kubernetes thinks that the pod is still starting up slowly. During this time Kubernetes assumes that the pod has not failed but is also not healthy (the admin API is not up, it hasn't joined the cluster, etc.). Therefore, ifcrash_loop_sleep_sec
is larger than the configuredstartupProbe
orlivenessProbe
timeout then Kubernetes will kill the pod after the configuredstartupProbe
/livenessProbe
expires before the fullcrash_loop_sleep_sec
could elapse.The Redpanda Helm Chart currently specifies a
startupProbe
with a timeout of120s
by default, therefore it is recommended to setcrash_loop_sleep_sec
to a value below that.Fixes https://redpandadata.atlassian.net/browse/CORE-8616
cc @mmaslankaprv @dotnwat @travisdowns for visibility
Backports Required
Release Notes
Features
crash_loop_sleep_sec
, which sets the time the broker sleeps before terminating the process when the limit on the number of consecutive times a broker can crash has been reached. This is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.