Skip to content

Conversation

@alecholmes
Copy link
Contributor

@alecholmes alecholmes commented Sep 11, 2025

This fixes a fleet bug where fetching an invalid config then shutting down prevented the agent from starting up again.

NB for people who reviewed this as an earlier draft: this is equivalent to the now-closed PR in my fork repo.

Problem
The fleet plugin is buggy today in that it will immediately commit newly received config as the current config, implying the new config is valid. This can notably prevent fluent-bit from restarting correctly after receiving an invalid config.

For example:

  • Process starts with a valid config
  • Fleet plugin receives a new config. It marks it as current and kicks off a hot reload.
  • The hot reload fails. The process will use the old working config.
  • Process exits
  • Process starts
  • Process tries to use the newer invalid config instead of the older valid config.

Change

This PR changes how configs received by the fleet plugin are committed as valid.

When a new config is received:

  • Any existing new (not current or old) config is deleted, since it is being replaced.
  • The received config is saved on disk as new and a hot reload is kicked off.
  • Subsequent fleet collect callbacks will check if the hot reload of the new config was successful. If so, the new config is promoted to current.

When a process starts up:

  • It deletes new config on disk. By virtue of being marked new, that config was not successfully reloaded in the past.
  • It loads up using current or old config, if available.
  • It then will attempt to fetch any newer config and reload.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • More robust hot-reload with automatic commit/rollback of fleet configs based on reload outcome.
    • Startup now cleans up stale, uncommitted configs for a cleaner state.
    • Added clearer logging around reloading, committing, and cleanup actions.
  • Refactor

    • Streamlined fleet configuration management to unify detection and lifecycle handling, improving reliability during reloads.

Signed-off-by: Alec Holmes <alec@chronosphere.io>
@coderabbitai
Copy link

coderabbitai bot commented Sep 11, 2025

Walkthrough

Adds a hot-reload success flag to the global config, sets it during successful reloads, and integrates it into the Calyptia Fleet input plugin to manage config refs with commit/rollback logic. Refactors fleet config existence checks, adjusts add/commit/rollback flows, and cleans up uncommitted refs during init and collection.

Changes

Cohort / File(s) Summary of changes
Core hot-reload state
include/fluent-bit/flb_config.h, src/flb_config.c, src/flb_reload.c
Added int hot_reload_succeeded to struct flb_config; initialized to FLB_FALSE; set to FLB_TRUE post successful reload in flb_reload.
Calyptia Fleet plugin: ref/commit/rollback
plugins/in_calyptia_fleet/in_calyptia_fleet.c
Unified ref existence check via exists_fleet_config(ctx, ref_name); changed calyptia_config_add to take flb_sds_t; implemented commit/rollback lifecycle using cur/new/old refs; added commit_config_if_reloaded gating on config->hot_reload_succeeded; updated reload and init paths with cleanup and logging.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Fleet as Calyptia Fleet Plugin
  participant Core as Fluent Bit Core
  participant Cfg as flb_config
  participant Refs as Ref Files (new/cur/old)

  Note over Fleet,Refs: Add/update config
  Fleet->>Refs: calyptia_config_add(cfgname)\n- set "new" -> cfgname\n- reparent "cur" -> "old"\n- remove mismatched "new"
  Fleet->>Core: trigger reload (path from "cur" or "old")
  Fleet->>Cfg: set hot_reload_succeeded = FLB_FALSE

  Note over Core,Cfg: Reload processing
  Core->>Core: perform hot reload
  Core->>Cfg: set hot_reload_succeeded = FLB_TRUE

  Note over Fleet,Refs: Collection cycle
  Fleet->>Cfg: check hot_reload_succeeded
  alt success
    Fleet->>Refs: commit: "new" -> "cur", remove "old"/"new"
  else no reload or failure
    Fleet->>Refs: rollback if needed:\n- delete "new"\n- "old" -> "cur"\n- remove "old"
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "in_calyptia_fleet: Only commit fleet config when successfully reloaded" is a concise, single-sentence summary that accurately captures the PR's primary change (delay committing a new fleet config until a hot reload succeeds). It names the affected plugin and clearly states the behavioral fix without extraneous detail or vague wording.

Poem

I nibble on configs, neat and new,
Hop from cur to old, then through—
A reload blooms, the flag turns true,
Commit the change, as rabbits do.
If storms arrive, I roll back too—
Thump-thump: fleet’s steady, burrow-blue. 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/flb_reload.c (1)

454-459: Also clear the success flag on the old context at reload start to avoid stale state.

If a reload attempt aborts, the active (old) context may still carry hot_reload_succeeded = TRUE from a prior success. Clearing it when marking hot reload prevents ambiguous state for any consumer not setting it explicitly (your Fleet plugin does, but making reload self-contained is safer).

Diff suggestion:

-    /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */
-    old_config->hot_reloading = FLB_TRUE;
+    /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */
+    old_config->hot_reloading = FLB_TRUE;
+    /* Clear stale success state on the old context */
+    old_config->hot_reload_succeeded = FLB_FALSE;
...
-    new_config->hot_reloading = FLB_FALSE;
+    new_config->hot_reloading = FLB_FALSE;
+    /* Mark success */
     new_config->hot_reload_succeeded = FLB_TRUE;

Also applies to: 551-553

plugins/in_calyptia_fleet/in_calyptia_fleet.c (1)

1490-1563: Validate dereferenced ref paths before recursive delete to prevent path traversal/file clobber.

calyptia_config_delete_by_ref() dereferences a .ref file and then deletes all matches for its prefix (/path/to/file*). If an attacker tampers with the .ref file (or it gets corrupted), this can unlink arbitrary files/directories outside the Fleet cache. Before deleting:

  • Canonicalize both the base Fleet dir and the dereferenced path (e.g., realpath() on POSIX; GetFullPathNameA() on Windows),
  • Verify the dereferenced path is within the base dir, and
  • Refuse to operate if the check fails.

Concrete patch (POSIX path check; add equivalent for Windows):

 static int calyptia_config_delete_by_ref(struct flb_in_calyptia_fleet_config *ctx,
                                          const char *ref_name)
 {
-    struct cfl_array *confs;
+    struct cfl_array *confs;
     flb_sds_t config_path;
+    flb_sds_t base_dir = NULL;
+    char canon_base[PATH_MAX];
+    char canon_target[PATH_MAX];
     char *ext;
     int idx;
     struct stat entry_stat;
     const char *entry_path;

@@
-    config_path = fleet_config_deref(ctx, ref_name);
+    config_path = fleet_config_deref(ctx, ref_name);
     if (config_path == NULL) {
         return FLB_FALSE;
     }
+
+    /* Constrain deletes to the Fleet base directory */
+    base_dir = generate_base_fleet_directory(ctx);
+    if (base_dir == NULL) {
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+#ifndef _WIN32
+    if (realpath(base_dir, canon_base) == NULL ||
+        realpath(config_path, canon_target) == NULL) {
+        flb_plg_error(ctx->ins, "failed to canonicalize paths for delete safety");
+        flb_sds_destroy(base_dir);
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+    if (strncmp(canon_target, canon_base, strlen(canon_base)) != 0 ||
+        (canon_target[strlen(canon_base)] != '/' && canon_target[strlen(canon_base)] != '\0')) {
+        flb_plg_error(ctx->ins, "ref path %s not under base dir %s", canon_target, canon_base);
+        flb_sds_destroy(base_dir);
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+#else
+    /* TODO: Add GetFullPathNameA() + prefix check on Windows */
+#endif
+    flb_sds_destroy(base_dir);
🧹 Nitpick comments (7)
plugins/in_calyptia_fleet/in_calyptia_fleet.c (7)

621-630: Minor: free order/readability in reload thread.

You log reload->cfg_path then flb_free(reload). Safe, since cfg_path ownership is transferred earlier, but consider storing cfg_path in a local before the free for clarity:

-    flb_info("reloading configuration from path: %s", reload->cfg_path);
-    flb_free(reload);
+    flb_info("reloading configuration from path: %s", reload->cfg_path);
+    flb_free(reload);

1566-1613: Add error handling around unlink of cur ref; preserve ENOENT, log others.

Currently unlink(cur_ref_filename); ignores failures. Recommend ignoring ENOENT but logging other errors to aid debugging.

-    unlink(cur_ref_filename);
+    if (unlink(cur_ref_filename) != 0 && errno != ENOENT) {
+        flb_plg_warn(ctx->ins, "failed to delete current ref file: %s (errno=%d)",
+                     cur_ref_filename, errno);
+    }

1621-1667: Commit flow looks correct; optionally reset success flag post-commit.

Once committed, hot_reload_succeeded remains TRUE in the active config indefinitely. Not harmful (commit re-checks absence of "new"), but you could clear it to explicitly record “no pending commit”.

     flb_plg_info(ctx->ins, "committed new config: %s", config_path);
     flb_sds_destroy(config_path);
+    /* Optional: clear success flag now that commit has been processed */
+    ctx->ins->config->hot_reload_succeeded = FLB_FALSE;

1670-1709: Rollback flow OK; consider logging when no 'new' exists to trim noise.

If exists_fleet_config(ctx, "new") == FLB_TRUE fails often, the log volume can be high. Downgrade to debug when nothing to delete. No functional change required.


1719-1754: Commit-on-success guard is solid; add one more safety: ensure timestamped path.

You already check is_new_fleet_config(ctx, config). As an extra guard against ref tampering, verify the current conf_path_file is under the Fleet base dir (same canonicalization as delete). Optional.


2037-2060: Do not hard-fail collection solely on commit failure.

FLB_INPUT_RETURN(-1) may disable the collector after a transient filesystem hiccup during commit. Consider logging and retrying on the next tick to keep the agent ingesting with the current config.

-    if (commit_config_if_reloaded(ctx) == FLB_FALSE) {
-        FLB_INPUT_RETURN(-1);
-    }
+    if (commit_config_if_reloaded(ctx) == FLB_FALSE) {
+        flb_plg_error(ctx->ins, "commit of reloaded configuration failed; will retry");
+        FLB_INPUT_RETURN(0);
+    }

1097-1158: Efficient timestamp check; avoid zero-capacity SDS.

glob_pattern = flb_sds_create_size(0); works, but using an initial capacity near base_dir + "/*.yaml" reduces reallocs slightly.

No change required; micro-optimization.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7798a84 and 4989c58.

📒 Files selected for processing (4)
  • include/fluent-bit/flb_config.h (1 hunks)
  • plugins/in_calyptia_fleet/in_calyptia_fleet.c (12 hunks)
  • src/flb_config.c (1 hunks)
  • src/flb_reload.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_calyptia_fleet/in_calyptia_fleet.c (2)
src/flb_sds.c (1)
  • flb_sds_destroy (389-399)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: PR - fuzzing test
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
🔇 Additional comments (6)
src/flb_reload.c (1)

551-553: Good: mark success on the new config after a clean start.

Setting new_config->hot_reload_succeeded = FLB_TRUE; is the right signal for post-reload commit logic.

src/flb_config.c (1)

310-311: LGTM: sane default.

Initializing config->hot_reload_succeeded = FLB_FALSE; avoids indeterminate reads.

plugins/in_calyptia_fleet/in_calyptia_fleet.c (4)

576-591: exists_fleet_config: OK, but avoid TOCTOU assumptions in callers.

The deref + access() check is fine as a hint, but callers should be tolerant to the file disappearing between check and use (most are). No change requested.


1197-1210: Confirm HTTP-date parsing matches server format.

You parse Last-Modified with "%a, %d %B %Y %H:%M:%S GMT". RFC 7231 uses abbreviated month (%b). If the server sends “Wed, 21 Oct 2015 07:28:00 GMT”, %B will fail. If this format was already validated in production, ignore; otherwise consider switching to %b with a fallback.


2149-2177: Startup load prefers cur then old: good failover.

Nice improvement: avoids promoting an uncommitted “new” on process start.


2419-2535: Good: cleanup uncommitted 'new' on startup; ensure delete safety shares the same base-dir validation.

This path calls calyptia_config_delete_by_ref(ctx, "new"); once the delete hardening (base-dir canonicalization) is in place, this startup cleanup becomes safe even under ref corruption.

Copy link
Contributor

@pwhelan pwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

How this interacts with #10869 might necessitate a refactoring but that can be handled in that PR.

@pwhelan pwhelan merged commit 60db310 into fluent:master Sep 12, 2025
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants