Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8616 redpanda: configurable sleep on crash loop #24787

Merged
merged 2 commits into from
Jan 15, 2025

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Jan 13, 2025

When redpanda detects that it has reached the crash loop limit, instead of terminating immediately, it now sleeps for crash_loop_sleep_sec seconds before terminating. This sleeping occurs early on during startup before most components (e.g., storage, controller, admin API, etc.) are initialised.

crash_loop_sleep_sec is a node config. It is disabled by default in redpanda, but we plan to enable it by default in the Redpanda Kubernetes Helm Chart shortly.

This config is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.

Note that it may be pointless (though not harmful) to set crash_loop_sleep_sec to a value larger than the timeout specified in the Kubernetes startupProbe or livenessProbe. While the redpanda process is sleeping for crash_loop_sleep_sec, Kubernetes thinks that the pod is still starting up slowly. During this time Kubernetes assumes that the pod has not failed but is also not healthy (the admin API is not up, it hasn't joined the cluster, etc.). Therefore, if crash_loop_sleep_sec is larger than the configured startupProbe or livenessProbe timeout then Kubernetes will kill the pod after the configured startupProbe/livenessProbe expires before the full crash_loop_sleep_sec could elapse.

The Redpanda Helm Chart currently specifies a startupProbe with a timeout of 120s by default, therefore it is recommended to set crash_loop_sleep_sec to a value below that.

Fixes https://redpandadata.atlassian.net/browse/CORE-8616

cc @mmaslankaprv @dotnwat @travisdowns for visibility

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Features

  • Introduces the node config crash_loop_sleep_sec, which sets the time the broker sleeps before terminating the process when the limit on the number of consecutive times a broker can crash has been reached. This is most useful in Kubernetes environments where setting this value allows customers to have ssh access into a crash looping pod for a short window of time.

@pgellert pgellert requested review from bharathv and a team January 13, 2025 16:12
@pgellert pgellert self-assigned this Jan 13, 2025
@pgellert pgellert requested a review from a team as a code owner January 13, 2025 16:12
@pgellert pgellert requested review from michael-redpanda and removed request for a team January 13, 2025 16:12
@pgellert pgellert force-pushed the crashlog/crash-loop-sleep branch from daac090 to 2e85fd8 Compare January 13, 2025 19:00
@pgellert
Copy link
Contributor Author

force-push: fix "node failed to stop" test failures

src/v/config/node_config.cc Outdated Show resolved Hide resolved
src/v/config/node_config.cc Outdated Show resolved Hide resolved
src/v/redpanda/application.cc Show resolved Hide resolved
tests/rptest/tests/crash_loop_checks_test.py Outdated Show resolved Hide resolved
src/v/redpanda/application.cc Outdated Show resolved Hide resolved
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60661

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 13, 2025

CI test results

test results on build#60661
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_chunks ducktape https://buildkite.com/redpanda/redpanda/builds/60661#01946164-eb90-4638-8325-53b856f9d15a FAIL 0/1
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60661#01946166-1c99-4310-9e21-1aeb60133f18 FLAKY 3/6
test results on build#60694
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60694#01946419-991e-49e3-8795-0f6f26ba2d9c FLAKY 5/6
test results on build#60725
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60725#01946644-061f-44eb-adc8-9b738b0af548 FLAKY 5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60725#01946644-061e-4821-9008-fd73aa3f5424 FLAKY 4/6

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention a bit in the coverletter the interaction between :

  1. the duration of the sleep
  2. various k8s timeouts (e.g. liveness and other probes)
  3. where the sleep occurs in startup (e.g. does the admin api start up)

@pgellert pgellert force-pushed the crashlog/crash-loop-sleep branch from 2e85fd8 to 0960e17 Compare January 14, 2025 08:02
@pgellert
Copy link
Contributor Author

force-push: address code review suggestions

@pgellert
Copy link
Contributor Author

Could you mention a bit in the coverletter the interaction between :

  1. the duration of the sleep
  2. various k8s timeouts (e.g. liveness and other probes)
  3. where the sleep occurs in startup (e.g. does the admin api start up)

@dotnwat Thanks, I've expanded the cover letter now to include these. Let me know if you still have any unanswered questions.

This introduces the node config `crash_loop_sleep`.

When redpanda detects that it reached the crash loop limit, instead of
terminating immediately, it should sleep for `crash_loop_sleep` seconds
before terminating.

This is useful especially in a Kubernetes environment where setting this
value allows customers to have ssh access into a crashlooping pod for a
short window of time.
It also enabled debug-level logging for the main logger (which the crash
loop limiter uses) for easier debugging.
@pgellert pgellert force-pushed the crashlog/crash-loop-sleep branch from 0960e17 to 3192708 Compare January 14, 2025 18:06
@pgellert
Copy link
Contributor Author

force-push: rename crash_loop_sleep_secs -> crash_loop_sleep_sec

@pgellert pgellert requested a review from Deflaimun January 14, 2025 18:06
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cover letter config name needs an update s/secs/sec

@dotnwat
Copy link
Member

dotnwat commented Jan 14, 2025

Could you mention a bit in the coverletter the interaction between :

  1. the duration of the sleep
  2. various k8s timeouts (e.g. liveness and other probes)
  3. where the sleep occurs in startup (e.g. does the admin api start up)

@dotnwat Thanks, I've expanded the cover letter now to include these. Let me know if you still have any unanswered questions.

that's great. thank you I appreciate it!

Copy link
Contributor

@Deflaimun Deflaimun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm from docs

@pgellert pgellert merged commit 3e19fa1 into redpanda-data:dev Jan 15, 2025
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24787-v24.2.x-866 remotes/upstream/v24.2.x
git cherry-pick -x 31e03495f0 319270895d

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24787-v24.1.x-307 remotes/upstream/v24.1.x
git cherry-pick -x 31e03495f0 319270895d

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants