CORE-8616 redpanda: configurable sleep on crash loop #24787

pgellert · 2025-01-13T16:12:40Z

When redpanda detects that it has reached the crash loop limit, instead of terminating immediately, it now sleeps for crash_loop_sleep_sec seconds before terminating. This sleeping occurs early on during startup before most components (e.g., storage, controller, admin API, etc.) are initialised.

crash_loop_sleep_sec is a node config. It is disabled by default in redpanda, but we plan to enable it by default in the Redpanda Kubernetes Helm Chart shortly.

This config is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.

Note that it may be pointless (though not harmful) to set crash_loop_sleep_sec to a value larger than the timeout specified in the Kubernetes startupProbe or livenessProbe. While the redpanda process is sleeping for crash_loop_sleep_sec, Kubernetes thinks that the pod is still starting up slowly. During this time Kubernetes assumes that the pod has not failed but is also not healthy (the admin API is not up, it hasn't joined the cluster, etc.). Therefore, if crash_loop_sleep_sec is larger than the configured startupProbe or livenessProbe timeout then Kubernetes will kill the pod after the configured startupProbe/livenessProbe expires before the full crash_loop_sleep_sec could elapse.

The Redpanda Helm Chart currently specifies a startupProbe with a timeout of 120s by default, therefore it is recommended to set crash_loop_sleep_sec to a value below that.

Fixes https://redpandadata.atlassian.net/browse/CORE-8616

cc @mmaslankaprv @dotnwat @travisdowns for visibility

Backports Required

Release Notes

Features

Introduces the node config crash_loop_sleep_sec, which sets the time the broker sleeps before terminating the process when the limit on the number of consecutive times a broker can crash has been reached. This is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.

pgellert · 2025-01-13T19:02:32Z

force-push: fix "node failed to stop" test failures

src/v/config/node_config.cc

src/v/redpanda/application.cc

tests/rptest/tests/crash_loop_checks_test.py

src/v/redpanda/application.cc

vbotbuildovich · 2025-01-13T22:23:36Z

Retry command for Build#60661

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks

vbotbuildovich · 2025-01-13T22:58:07Z

CI test results

test results on build#60661

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks	ducktape	https://buildkite.com/redpanda/redpanda/builds/60661#01946164-eb90-4638-8325-53b856f9d15a	FAIL	0/1
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60661#01946166-1c99-4310-9e21-1aeb60133f18	FLAKY	3/6

test results on build#60694

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60694#01946419-991e-49e3-8795-0f6f26ba2d9c	FLAKY	5/6

test results on build#60725

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60725#01946644-061f-44eb-adc8-9b738b0af548	FLAKY	5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60725#01946644-061e-4821-9008-fd73aa3f5424	FLAKY	4/6

dotnwat

Could you mention a bit in the coverletter the interaction between :

the duration of the sleep
various k8s timeouts (e.g. liveness and other probes)
where the sleep occurs in startup (e.g. does the admin api start up)

src/v/config/node_config.cc

pgellert · 2025-01-14T10:39:07Z

force-push: address code review suggestions

pgellert · 2025-01-14T11:08:59Z

Could you mention a bit in the coverletter the interaction between :

the duration of the sleep

various k8s timeouts (e.g. liveness and other probes)

where the sleep occurs in startup (e.g. does the admin api start up)

@dotnwat Thanks, I've expanded the cover letter now to include these. Let me know if you still have any unanswered questions.

src/v/config/node_config.cc

src/v/config/node_config.h

src/v/redpanda/application.cc

tests/rptest/tests/crash_loop_checks_test.py

This introduces the node config `crash_loop_sleep`. When redpanda detects that it reached the crash loop limit, instead of terminating immediately, it should sleep for `crash_loop_sleep` seconds before terminating. This is useful especially in a Kubernetes environment where setting this value allows customers to have ssh access into a crashlooping pod for a short window of time.

It also enabled debug-level logging for the main logger (which the crash loop limiter uses) for easier debugging.

pgellert · 2025-01-14T18:06:51Z

force-push: rename crash_loop_sleep_secs -> crash_loop_sleep_sec

bharathv

nit: cover letter config name needs an update s/secs/sec

dotnwat · 2025-01-14T20:39:34Z

Could you mention a bit in the coverletter the interaction between :

the duration of the sleep

various k8s timeouts (e.g. liveness and other probes)

where the sleep occurs in startup (e.g. does the admin api start up)

@dotnwat Thanks, I've expanded the cover letter now to include these. Let me know if you still have any unanswered questions.

that's great. thank you I appreciate it!

Deflaimun

lgtm from docs

vbotbuildovich · 2025-01-15T17:29:19Z

/backport v24.3.x

vbotbuildovich · 2025-01-15T17:29:20Z

/backport v24.2.x

vbotbuildovich · 2025-01-15T17:29:21Z

/backport v24.1.x

vbotbuildovich · 2025-01-15T17:30:29Z

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24787-v24.2.x-866 remotes/upstream/v24.2.x
git cherry-pick -x 31e03495f0 319270895d

Workflow run logs.

vbotbuildovich · 2025-01-15T17:30:34Z

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24787-v24.1.x-307 remotes/upstream/v24.1.x
git cherry-pick -x 31e03495f0 319270895d

Workflow run logs.

pgellert requested review from bharathv and a team January 13, 2025 16:12

pgellert self-assigned this Jan 13, 2025

pgellert requested a review from a team as a code owner January 13, 2025 16:12

pgellert requested review from michael-redpanda and removed request for a team January 13, 2025 16:12

github-actions bot added the area/redpanda label Jan 13, 2025

pgellert force-pushed the crashlog/crash-loop-sleep branch from daac090 to 2e85fd8 Compare January 13, 2025 19:00

bharathv reviewed Jan 13, 2025

View reviewed changes

dotnwat reviewed Jan 14, 2025

View reviewed changes

micheleRP reviewed Jan 14, 2025

View reviewed changes

src/v/config/node_config.cc Outdated Show resolved Hide resolved

pgellert force-pushed the crashlog/crash-loop-sleep branch from 2e85fd8 to 0960e17 Compare January 14, 2025 08:02

pgellert requested review from micheleRP, dotnwat and bharathv January 14, 2025 11:09

michael-redpanda previously approved these changes Jan 14, 2025

View reviewed changes