Skip to content

Conversation

@pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Oct 2, 2025

Summary

Due to how the watchdog thread works and how preemption of cancelled threads work on windows each hot reload on windows will take at least 5 minutes.

Description

Due to how cancel states and joining works this can delay the process of loading a new configuration by minutes since it is the main thread that does both operations.

The function flb_reload spins up a background watchdog thread (when enabled) which should abort() the process if the configuration does not cleanly load. In the process of deactivating it once the configuration is loaded that same thread is joined via pthread_join in the function flb_reload_watchdog_cleanup. On windows at least the sleep function is not preempted by the thread being cancelled which leads to a delay of 5 minutes after each reload.

To fix this I refactored the watchdog thread to sleep in increments of 100ms. This should be a decent tradeoff between responsiveness and performance.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features
    • Added an informational message when hot reload completes successfully, improving user feedback.
  • Refactor
    • Increased the check frequency of the hot-reload watchdog to enhance responsiveness to cancellation while preserving overall timeout behavior.
  • Chores
    • Introduced additional debug logging around watchdog cleanup during successful reloads to aid troubleshooting and visibility.

Signed-off-by: Phillip Whelan <phillip.whelan@chronosphere.io>
@coderabbitai
Copy link

coderabbitai bot commented Oct 2, 2025

Walkthrough

The hot-reload watchdog in src/flb_reload.c now uses a 100 ms iterative sleep loop instead of a single long sleep, improving cancellation responsiveness. Additional debug/info logs were added around successful reload completion and watchdog cleanup. Abort-on-timeout behavior is unchanged.

Changes

Cohort / File(s) Summary
Hot-reload watchdog adjustments
src/flb_reload.c
Replace single sleep(timeout_seconds) with 100 ms loop (timeout_seconds × 10 iterations), add debug log before watchdog cleanup on success, add info log for successful reload, preserve abort-on-timeout behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Main as Main Thread
  participant Watchdog as Watchdog Thread
  participant Reload as Reload Logic

  Note over Main,Watchdog: Hot-reload initiated

  Main->>Watchdog: Start watchdog(timeout_seconds)
  Main->>Reload: Begin reload sequence

  rect rgba(200,230,255,0.25)
    Note over Watchdog: Granular loop: sleep 100ms × (timeout_seconds × 10)
    loop 100ms intervals until timeout or cancel
      Watchdog-->>Watchdog: Check cancel/completion flag
    end
  end

  alt Reload completes before timeout
    Reload-->>Main: Success signal
    Main->>Watchdog: Cancel/cleanup request
    Note right of Main: Debug: cleaning up watchdog
    Watchdog-->>Main: Exit
    Main-->>Main: Info: reload completed successfully
  else Timeout reached
    Watchdog-->>Main: Timeout signal
    Main-->>Main: Abort reload path
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

docs-required

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Poem

Thump-thump goes my watchdog beat,
In hundred-millisecond feet—so neat!
I nibble logs, then hop away,
Cleanup done, success on display.
If time runs out, I won’t delay—
Ears up, abort! Another day.
(_/)\u2006✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the primary change by indicating the reload watchdog has been refactored for preemptability specifically on Windows, accurately reflecting the core updates to the sleep loop and cancellation behavior in the pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pwhelan-fix-watchdog-thread-preemption

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pwhelan
Copy link
Contributor Author

pwhelan commented Oct 2, 2025

It is necessary to enable hot reloading with the timeout:

service:
  Hot_Reload.Timeout: 5
  Hot_Reload: On
  Log_Level: debug
pipeline:
  inputs:
    - name: dummy
  outputs:
    - name: stdout
      match: "*"

Attaching a valgrind log:

valgrind.log

As well as a debug log:

debug.log

@edsiper edsiper merged commit 19ec5a3 into master Oct 3, 2025
60 checks passed
@edsiper edsiper deleted the pwhelan-fix-watchdog-thread-preemption branch October 3, 2025 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants