Skip to content

Conversation

@stoksc
Copy link
Contributor

@stoksc stoksc commented Sep 10, 2025

This PR adds a watchdog thread to prevent hot reload operations from hanging indefinitely. When hot reload starts, a separate watchdog thread is spawned that sleeps for a configurable timeout (default 5 minutes, configurable via Hot_Reload.Timeout). If the reload completes, the watchdog is cancelled; if not, it calls abort() to crash the process rather than hang forever.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Hot-reload watchdog to prevent hangs during configuration reloads.
    • Configurable timeout "Hot_Reload.Timeout" (default 300s; set <=0 to disable).
    • Improved logging for watchdog start, cancellation, and timeout events.
    • Optional dummy input option to simulate hangs for watchdog testing (opt-in, default off).
  • Tests

    • Internal test validating watchdog timeout behavior during hot reload.

@coderabbitai
Copy link

coderabbitai bot commented Sep 10, 2025

Warning

Rate limit exceeded

@stoksc has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 11 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 1e5ca24 and 8787fc5.

📒 Files selected for processing (6)
  • include/fluent-bit/flb_config.h (2 hunks)
  • plugins/in_dummy/in_dummy.c (4 hunks)
  • plugins/in_dummy/in_dummy.h (1 hunks)
  • src/flb_config.c (2 hunks)
  • src/flb_reload.c (12 hunks)
  • tests/internal/reload.c (2 hunks)

Walkthrough

Adds a per-reload watchdog thread configurable via Hot_Reload.Timeout, integrates it into flb_reload to abort the process on timeout, exposes the timeout in flb_config (default 300s), adds a test-only hang flag to the dummy input, and introduces a unit test that expects SIGABRT when the watchdog fires.

Changes

Cohort / File(s) Summary
Config struct & mapping
include/fluent-bit/flb_config.h, src/flb_config.c
Adds int hot_reload_watchdog_timeout_seconds to struct flb_config, defines FLB_CONF_STR_HOT_RELOAD_TIMEOUT ("Hot_Reload.Timeout"), registers the key in service_configs (INT) and initializes default (300 seconds).
Reload watchdog implementation
src/flb_reload.c
Adds internal watchdog context and thread (hot_reload_watchdog_thread), start/cleanup helpers (flb_reload_watchdog_start, flb_reload_watchdog_cleanup), starts watchdog at reload begin, cancels/cleans up on all success/failure paths, logs events, and calls abort() on timeout.
Dummy plugin test hook
plugins/in_dummy/in_dummy.h, plugins/in_dummy/in_dummy.c
Adds test_hang_on_exit to struct flb_dummy (guarded by FLB_TESTS_INTERNAL), initializes it, exposes it in the plugin config_map (FLB_CONFIG_MAP_BOOL, default false), and when enabled simulates a long sleep in in_dummy_exit to test the watchdog.
Internal test
tests/internal/reload.c
Adds test_reload_watchdog_timeout which forks a child, configures hot-reload with a 2s watchdog and the hanging dummy input, triggers reload and expects the child to abort (SIGABRT); registers the test and adds required OS headers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester as Test harness
  participant Reload as flb_reload()
  participant WD as HotReloadWatchdog
  participant Dummy as in_dummy_exit()
  participant Proc as Process

  Tester->>Reload: trigger hot reload
  Reload->>WD: start watchdog (timeout N sec)
  Note right of WD: sleeps for configured timeout\non expiry -> log + abort()

  rect rgba(230,245,255,0.6)
    Reload->>Reload: reconstruct config, load plugins, validate
    Reload->>Dummy: call exit/cleanup (may hang if test flag set)
  end

  alt reload completes before timeout
    Reload->>WD: cancel & join watchdog
    Reload->>Tester: success
  else timeout reached
    WD-->>Proc: abort() -> SIGABRT
    Tester->>Tester: detect SIGABRT (test passes)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

backport to v4.0.x

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "reload: add a timeout mechanism" is concise and accurately summarizes the primary change in the PR — adding a timeout/watchdog for reload operations. It is specific to the feature (reload timeout) and avoids extraneous detail, making it clear for a reviewer scanning history. This aligns with the stated guidelines for a short, single-sentence PR title.

Poem

I twitch my whiskers at the thread,
A tiny watchdog guards the night,
If dummy sleeps and time runs out,
I tap the bell — the process shouts,
The rabbit hops; the test is right. 🐇

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
plugins/in_dummy/in_dummy.h (1)

52-52: Prefer a boolean type and gate test-only knob

  • Use a boolean type (flb_bool or bool) for clarity.
  • Consider gating this testing-only field behind a compile-time guard (e.g., FLB_TESTS) to avoid shipping a production-visible knob, or rename to underscore-prefixed to discourage use.

Example:

-    int test_hang_on_collect;
+    flb_bool test_hang_on_collect;

If keeping it public, ensure the config map and docs clearly mark it as test-only.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c377f6 and 78d0327.

📒 Files selected for processing (4)
  • plugins/in_dummy/in_dummy.c (3 hunks)
  • plugins/in_dummy/in_dummy.h (1 hunks)
  • src/flb_reload.c (10 hunks)
  • tests/internal/reload.c (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugins/in_dummy/in_dummy.c
🧰 Additional context used
🧬 Code graph analysis (1)
tests/internal/reload.c (4)
src/config_format/flb_config_format.c (3)
  • flb_cf_create (104-149)
  • flb_cf_section_create (602-684)
  • flb_cf_section_property_add (264-326)
src/flb_lib.c (2)
  • flb_create (138-220)
  • flb_start (914-925)
src/flb_reload.c (2)
  • flb_reload_reconstruct_cf (308-344)
  • flb_reload (425-614)
src/flb_config.c (1)
  • flb_config_load_config_format (937-1042)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
src/flb_reload.c (1)

41-44: Include <time.h> and <errno.h> for pthread_cond_timedwait

pthread_cond_timedwait requires struct timespec (time.h) and error codes like ETIMEDOUT (errno.h). Confirm Windows builds still pass via CI.

 #include <pthread.h>
 #include <unistd.h>
 #include <stdlib.h>
+#include <time.h>
+#include <errno.h>

Comment on lines 14 to 19
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Wrap POSIX-only headers for Windows builds; add missing headers

Fork/wait/alarm are not available on Windows. Guard these includes and add errno/time for the timeout loop.

-#include <signal.h>
-#include <sys/wait.h>
-#include <unistd.h>
+#ifndef FLB_SYSTEM_WINDOWS
+#include <signal.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <errno.h>
+#include <time.h>
+#endif
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>
#ifndef FLB_SYSTEM_WINDOWS
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>
#include <errno.h>
#include <time.h>
#endif
🤖 Prompt for AI Agents
In tests/internal/reload.c around lines 14 to 17, the POSIX-only headers
<signal.h>, <sys/wait.h>, and <unistd.h> must be guarded for Windows builds and
the timeout loop needs missing headers; wrap those three includes in an #ifndef
_WIN32 / #endif so they are only included on non-Windows, and add #include
<errno.h> and #include <time.h> (unconditional or also guarded if only used on
POSIX) so the timeout/errno usage compiles and behaves correctly on Windows and
POSIX systems.

{ "reconstruct_cf" , test_reconstruct_cf},
{ "reload" , test_reload},
{ "reload_yaml" , test_reload_yaml},
{ "reload_watchdog_timeout", test_reload_watchdog_timeout},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Conditionally register the POSIX-only test

Ensure test list entry is compiled out on Windows.

-    { "reload_watchdog_timeout", test_reload_watchdog_timeout},
+#ifndef FLB_SYSTEM_WINDOWS
+    { "reload_watchdog_timeout", test_reload_watchdog_timeout},
+#endif
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{ "reload_watchdog_timeout", test_reload_watchdog_timeout},
#ifndef FLB_SYSTEM_WINDOWS
{ "reload_watchdog_timeout", test_reload_watchdog_timeout},
#endif
🤖 Prompt for AI Agents
In tests/internal/reload.c around line 381, the POSIX-only test entry "{
\"reload_watchdog_timeout\", test_reload_watchdog_timeout}," must be compiled
out on Windows; wrap this test list entry in a platform guard (e.g. #if
!defined(_WIN32) ... #endif or an equivalent POSIX-only check like #if
defined(__unix__) || defined(__APPLE__)) so the entry is omitted when building
on Windows.

Copy link
Contributor

@pwhelan pwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I just had two comments on the PR.

src/flb_reload.c Outdated
static void flb_reload_cancel_watchdog(struct flb_config *config)
{
if (config->hot_reload_watchdog_active) {
config->hot_reload_watchdog_active = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition here, where we still abort after checking and before calling pthread_cancel. I guess that's fine tho, it'll occur when the reload takes almost as much time as hot_reload_timeout_seconds, and will result in an extra restart.

The pthread lib does have a mutex type if we want to fix this https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF we do this, we can remove the sleep and instead use pthread_cond_timedwait in hot_reload_watchdog_thread https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_timedwait.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didnt think this race was import enough to get complicated; the side effect seemed like nbd considering the chances.

but i've basically rewritten this without the active check though and depend on join to stop the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new stuff still technically has the same issue though. i'll sleep on if it is worth fixing. want to keep it as simple as possible tbh.

src/flb_reload.c Outdated
ret = pthread_create(&old_config->hot_reload_watchdog_tid, NULL,
hot_reload_watchdog_thread, (void *)old_config);
if (ret != 0) {
flb_error("[reload] Failed to create hot reload watchdog thread: %d", ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could make the argument that if we cannot create the watchdog thread we should abort early. I can see the merits of trying to continue the hot reload tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's true. close call for me. something is probably very wrong already if we can't make it.

return -1;
}

if (ctx->test_hang_on_collect) {
Copy link
Contributor

@ThomasBrady ThomasBrady Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] maybe guard this logic by a TESTONLY compiler directive. Idk what the convention in FB is for testonly but something like #define TEST_HANG_ON_COLLECT in the test file then #ifdef TEST_HANG_ON_COLLECT here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure the best approach for something like this. i'll ask for input before i change but i agree.

@edsiper
Copy link
Member

edsiper commented Sep 12, 2025

what are the thoughts around a parent process instead ?

@stoksc
Copy link
Contributor Author

stoksc commented Sep 15, 2025

I think it is an interesting idea long term. I like the solution I have because it is mostly purely additive so it is easier for me to reason about its correctness separate of the correctness of the rest of the system. I have more confidence it will fix the immediate issue I'm aware of without causing any regressions in existing functionality. wdyt @edsiper?

@stoksc
Copy link
Contributor Author

stoksc commented Sep 17, 2025

@edsiper any updates here?

src/flb_config.c Outdated
config->shutdown_by_hot_reloading = FLB_FALSE;
config->hot_reloading = FLB_FALSE;
config->hot_reload_succeeded = FLB_FALSE;
config->hot_reload_watchdog_timeout_seconds = 300;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not enable this by default, and only start the reload watchdog IF it has been set manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int timeout_seconds;
};

static void *hot_reload_watchdog_thread(void *arg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should only called if the reload watchdog timer is enabled manually, the user must be aware about an explicit abort() in case Fluent Bit is being used in library mode and can generate other side effects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I'm structured the code so that we return early without starting the thread if the feature is disabled and disabled it by default.

src/flb_reload.c Outdated

watchdog_ctx = flb_malloc(sizeof(struct flb_reload_watchdog_ctx));
if (!watchdog_ctx) {
flb_error("[reload] Failed to allocate memory for watchdog context");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use flb_errno()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks, didn't know of it.

- Add hot_reload_watchdog_timeout_seconds field to flb_config struct
- Add FLB_CONF_STR_HOT_RELOAD_TIMEOUT configuration string
- Add parsing logic in service_configs array
- Set default timeout to 300 seconds in flb_config_init

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add watchdog thread that monitors hot reload duration
- Thread sleeps for configured timeout then aborts if reload hasn't completed
- Watchdog is started at beginning of reload and cancelled on completion
- Uses pthread_cancel with async cancellation for immediate response
- Properly cleans up thread resources with pthread_join

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add test_hang_on_collect configuration option
- When enabled, dummy plugin hangs during collect to simulate stuck reload
- Used only for testing hot reload watchdog functionality

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add test_reload_watchdog_timeout to verify watchdog functionality
- Test forks a child process that triggers reload with hanging dummy input
- Verifies that watchdog aborts the process after timeout
- Checks that child terminates with SIGABRT signal

Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
@stoksc stoksc force-pushed the stoksc/hotreload-watchdog-thread branch from 8787fc5 to 80b1579 Compare September 18, 2025 17:42
@edsiper edsiper added this to the Fluent Bit v4.1 milestone Sep 19, 2025
@edsiper edsiper merged commit 20b895f into fluent:master Sep 19, 2025
100 of 105 checks passed
@edsiper
Copy link
Member

edsiper commented Sep 19, 2025

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants