-
Notifications
You must be signed in to change notification settings - Fork 1.9k
reload: add a timeout mechanism #10869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reload: add a timeout mechanism #10869
Conversation
|
Warning Rate limit exceeded@stoksc has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 11 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
WalkthroughAdds a per-reload watchdog thread configurable via Hot_Reload.Timeout, integrates it into flb_reload to abort the process on timeout, exposes the timeout in flb_config (default 300s), adds a test-only hang flag to the dummy input, and introduces a unit test that expects SIGABRT when the watchdog fires. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Tester as Test harness
participant Reload as flb_reload()
participant WD as HotReloadWatchdog
participant Dummy as in_dummy_exit()
participant Proc as Process
Tester->>Reload: trigger hot reload
Reload->>WD: start watchdog (timeout N sec)
Note right of WD: sleeps for configured timeout\non expiry -> log + abort()
rect rgba(230,245,255,0.6)
Reload->>Reload: reconstruct config, load plugins, validate
Reload->>Dummy: call exit/cleanup (may hang if test flag set)
end
alt reload completes before timeout
Reload->>WD: cancel & join watchdog
Reload->>Tester: success
else timeout reached
WD-->>Proc: abort() -> SIGABRT
Tester->>Tester: detect SIGABRT (test passes)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4c377f6 to
68f79e1
Compare
68f79e1 to
78d0327
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (1)
plugins/in_dummy/in_dummy.h (1)
52-52: Prefer a boolean type and gate test-only knob
- Use a boolean type (
flb_boolorbool) for clarity.- Consider gating this testing-only field behind a compile-time guard (e.g., FLB_TESTS) to avoid shipping a production-visible knob, or rename to underscore-prefixed to discourage use.
Example:
- int test_hang_on_collect; + flb_bool test_hang_on_collect;If keeping it public, ensure the config map and docs clearly mark it as test-only.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
plugins/in_dummy/in_dummy.c(3 hunks)plugins/in_dummy/in_dummy.h(1 hunks)src/flb_reload.c(10 hunks)tests/internal/reload.c(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/in_dummy/in_dummy.c
🧰 Additional context used
🧬 Code graph analysis (1)
tests/internal/reload.c (4)
src/config_format/flb_config_format.c (3)
flb_cf_create(104-149)flb_cf_section_create(602-684)flb_cf_section_property_add(264-326)src/flb_lib.c (2)
flb_create(138-220)flb_start(914-925)src/flb_reload.c (2)
flb_reload_reconstruct_cf(308-344)flb_reload(425-614)src/flb_config.c (1)
flb_config_load_config_format(937-1042)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
src/flb_reload.c (1)
41-44: Include<time.h>and<errno.h>for pthread_cond_timedwaitpthread_cond_timedwait requires
struct timespec(time.h) and error codes likeETIMEDOUT(errno.h). Confirm Windows builds still pass via CI.#include <pthread.h> #include <unistd.h> #include <stdlib.h> +#include <time.h> +#include <errno.h>
| #include <signal.h> | ||
| #include <sys/wait.h> | ||
| #include <unistd.h> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Wrap POSIX-only headers for Windows builds; add missing headers
Fork/wait/alarm are not available on Windows. Guard these includes and add errno/time for the timeout loop.
-#include <signal.h>
-#include <sys/wait.h>
-#include <unistd.h>
+#ifndef FLB_SYSTEM_WINDOWS
+#include <signal.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <errno.h>
+#include <time.h>
+#endif📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <signal.h> | |
| #include <sys/wait.h> | |
| #include <unistd.h> | |
| #ifndef FLB_SYSTEM_WINDOWS | |
| #include <signal.h> | |
| #include <sys/wait.h> | |
| #include <unistd.h> | |
| #include <errno.h> | |
| #include <time.h> | |
| #endif |
🤖 Prompt for AI Agents
In tests/internal/reload.c around lines 14 to 17, the POSIX-only headers
<signal.h>, <sys/wait.h>, and <unistd.h> must be guarded for Windows builds and
the timeout loop needs missing headers; wrap those three includes in an #ifndef
_WIN32 / #endif so they are only included on non-Windows, and add #include
<errno.h> and #include <time.h> (unconditional or also guarded if only used on
POSIX) so the timeout/errno usage compiles and behaves correctly on Windows and
POSIX systems.
| { "reconstruct_cf" , test_reconstruct_cf}, | ||
| { "reload" , test_reload}, | ||
| { "reload_yaml" , test_reload_yaml}, | ||
| { "reload_watchdog_timeout", test_reload_watchdog_timeout}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Conditionally register the POSIX-only test
Ensure test list entry is compiled out on Windows.
- { "reload_watchdog_timeout", test_reload_watchdog_timeout},
+#ifndef FLB_SYSTEM_WINDOWS
+ { "reload_watchdog_timeout", test_reload_watchdog_timeout},
+#endif📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| { "reload_watchdog_timeout", test_reload_watchdog_timeout}, | |
| #ifndef FLB_SYSTEM_WINDOWS | |
| { "reload_watchdog_timeout", test_reload_watchdog_timeout}, | |
| #endif |
🤖 Prompt for AI Agents
In tests/internal/reload.c around line 381, the POSIX-only test entry "{
\"reload_watchdog_timeout\", test_reload_watchdog_timeout}," must be compiled
out on Windows; wrap this test list entry in a platform guard (e.g. #if
!defined(_WIN32) ... #endif or an equivalent POSIX-only check like #if
defined(__unix__) || defined(__APPLE__)) so the entry is omitted when building
on Windows.
78d0327 to
612f73d
Compare
pwhelan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I just had two comments on the PR.
src/flb_reload.c
Outdated
| static void flb_reload_cancel_watchdog(struct flb_config *config) | ||
| { | ||
| if (config->hot_reload_watchdog_active) { | ||
| config->hot_reload_watchdog_active = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential race condition here, where we still abort after checking and before calling pthread_cancel. I guess that's fine tho, it'll occur when the reload takes almost as much time as hot_reload_timeout_seconds, and will result in an extra restart.
The pthread lib does have a mutex type if we want to fix this https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF we do this, we can remove the sleep and instead use pthread_cond_timedwait in hot_reload_watchdog_thread https://pubs.opengroup.org/onlinepubs/009604599/functions/pthread_cond_timedwait.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didnt think this race was import enough to get complicated; the side effect seemed like nbd considering the chances.
but i've basically rewritten this without the active check though and depend on join to stop the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new stuff still technically has the same issue though. i'll sleep on if it is worth fixing. want to keep it as simple as possible tbh.
src/flb_reload.c
Outdated
| ret = pthread_create(&old_config->hot_reload_watchdog_tid, NULL, | ||
| hot_reload_watchdog_thread, (void *)old_config); | ||
| if (ret != 0) { | ||
| flb_error("[reload] Failed to create hot reload watchdog thread: %d", ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could make the argument that if we cannot create the watchdog thread we should abort early. I can see the merits of trying to continue the hot reload tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's true. close call for me. something is probably very wrong already if we can't make it.
plugins/in_dummy/in_dummy.c
Outdated
| return -1; | ||
| } | ||
|
|
||
| if (ctx->test_hang_on_collect) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] maybe guard this logic by a TESTONLY compiler directive. Idk what the convention in FB is for testonly but something like #define TEST_HANG_ON_COLLECT in the test file then #ifdef TEST_HANG_ON_COLLECT here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure the best approach for something like this. i'll ask for input before i change but i agree.
612f73d to
9f99096
Compare
44c9431 to
8787fc5
Compare
|
what are the thoughts around a parent process instead ? |
|
I think it is an interesting idea long term. I like the solution I have because it is mostly purely additive so it is easier for me to reason about its correctness separate of the correctness of the rest of the system. I have more confidence it will fix the immediate issue I'm aware of without causing any regressions in existing functionality. wdyt @edsiper? |
|
@edsiper any updates here? |
src/flb_config.c
Outdated
| config->shutdown_by_hot_reloading = FLB_FALSE; | ||
| config->hot_reloading = FLB_FALSE; | ||
| config->hot_reload_succeeded = FLB_FALSE; | ||
| config->hot_reload_watchdog_timeout_seconds = 300; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not enable this by default, and only start the reload watchdog IF it has been set manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| int timeout_seconds; | ||
| }; | ||
|
|
||
| static void *hot_reload_watchdog_thread(void *arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function should only called if the reload watchdog timer is enabled manually, the user must be aware about an explicit abort() in case Fluent Bit is being used in library mode and can generate other side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. I'm structured the code so that we return early without starting the thread if the feature is disabled and disabled it by default.
src/flb_reload.c
Outdated
|
|
||
| watchdog_ctx = flb_malloc(sizeof(struct flb_reload_watchdog_ctx)); | ||
| if (!watchdog_ctx) { | ||
| flb_error("[reload] Failed to allocate memory for watchdog context"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use flb_errno()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks, didn't know of it.
- Add hot_reload_watchdog_timeout_seconds field to flb_config struct - Add FLB_CONF_STR_HOT_RELOAD_TIMEOUT configuration string - Add parsing logic in service_configs array - Set default timeout to 300 seconds in flb_config_init Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add watchdog thread that monitors hot reload duration - Thread sleeps for configured timeout then aborts if reload hasn't completed - Watchdog is started at beginning of reload and cancelled on completion - Uses pthread_cancel with async cancellation for immediate response - Properly cleans up thread resources with pthread_join Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add test_hang_on_collect configuration option - When enabled, dummy plugin hangs during collect to simulate stuck reload - Used only for testing hot reload watchdog functionality Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
- Add test_reload_watchdog_timeout to verify watchdog functionality - Test forks a child process that triggers reload with hanging dummy input - Verifies that watchdog aborts the process after timeout - Checks that child terminates with SIGABRT signal Signed-off-by: Bradley Laney <bradley.laney@chronosphere.io>
8787fc5 to
80b1579
Compare
|
thanks! |
This PR adds a watchdog thread to prevent hot reload operations from hanging indefinitely. When hot reload starts, a separate watchdog thread is spawned that sleeps for a configurable timeout (default 5 minutes, configurable via Hot_Reload.Timeout). If the reload completes, the watchdog is cancelled; if not, it calls abort() to crash the process rather than hang forever.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Tests