in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874

alecholmes · 2025-09-11T15:49:24Z

This fixes a fleet bug where fetching an invalid config then shutting down prevented the agent from starting up again.

NB for people who reviewed this as an earlier draft: this is equivalent to the now-closed PR in my fork repo.

Problem
The fleet plugin is buggy today in that it will immediately commit newly received config as the current config, implying the new config is valid. This can notably prevent fluent-bit from restarting correctly after receiving an invalid config.

For example:

Process starts with a valid config
Fleet plugin receives a new config. It marks it as current and kicks off a hot reload.
The hot reload fails. The process will use the old working config.
Process exits
Process starts
Process tries to use the newer invalid config instead of the older valid config.

Change

This PR changes how configs received by the fleet plugin are committed as valid.

When a new config is received:

Any existing new (not current or old) config is deleted, since it is being replaced.
The received config is saved on disk as new and a hot reload is kicked off.
Subsequent fleet collect callbacks will check if the hot reload of the new config was successful. If so, the new config is promoted to current.

When a process starts up:

It deletes new config on disk. By virtue of being marked new, that config was not successfully reloaded in the past.
It loads up using current or old config, if available.
It then will attempt to fetch any newer config and reload.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Run local packaging test showing all targets (including any new ones) build.
Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

New Features
- More robust hot-reload with automatic commit/rollback of fleet configs based on reload outcome.
- Startup now cleans up stale, uncommitted configs for a cleaner state.
- Added clearer logging around reloading, committing, and cleanup actions.
Refactor
- Streamlined fleet configuration management to unify detection and lifecycle handling, improving reliability during reloads.

Signed-off-by: Alec Holmes <alec@chronosphere.io>

coderabbitai · 2025-09-11T15:49:33Z

Walkthrough

Adds a hot-reload success flag to the global config, sets it during successful reloads, and integrates it into the Calyptia Fleet input plugin to manage config refs with commit/rollback logic. Refactors fleet config existence checks, adjusts add/commit/rollback flows, and cleans up uncommitted refs during init and collection.

Changes

Cohort / File(s)	Summary of changes
Core hot-reload state `include/fluent-bit/flb_config.h`, `src/flb_config.c`, `src/flb_reload.c`	Added `int hot_reload_succeeded` to `struct flb_config`; initialized to `FLB_FALSE`; set to `FLB_TRUE` post successful reload in `flb_reload`.
Calyptia Fleet plugin: ref/commit/rollback `plugins/in_calyptia_fleet/in_calyptia_fleet.c`	Unified ref existence check via `exists_fleet_config(ctx, ref_name)`; changed `calyptia_config_add` to take `flb_sds_t`; implemented commit/rollback lifecycle using `cur`/`new`/`old` refs; added `commit_config_if_reloaded` gating on `config->hot_reload_succeeded`; updated reload and init paths with cleanup and logging.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Fleet as Calyptia Fleet Plugin
  participant Core as Fluent Bit Core
  participant Cfg as flb_config
  participant Refs as Ref Files (new/cur/old)

  Note over Fleet,Refs: Add/update config
  Fleet->>Refs: calyptia_config_add(cfgname)\n- set "new" -> cfgname\n- reparent "cur" -> "old"\n- remove mismatched "new"
  Fleet->>Core: trigger reload (path from "cur" or "old")
  Fleet->>Cfg: set hot_reload_succeeded = FLB_FALSE

  Note over Core,Cfg: Reload processing
  Core->>Core: perform hot reload
  Core->>Cfg: set hot_reload_succeeded = FLB_TRUE

  Note over Fleet,Refs: Collection cycle
  Fleet->>Cfg: check hot_reload_succeeded
  alt success
    Fleet->>Refs: commit: "new" -> "cur", remove "old"/"new"
  else no reload or failure
    Fleet->>Refs: rollback if needed:\n- delete "new"\n- "old" -> "cur"\n- remove "old"
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

in_calyptia_fleet: Unify fleet config file caching across platforms #10826 — Also reworks Calyptia Fleet ref-based commit/rollback handling, overlapping with this PR’s ref lifecycle and reload integration.

Suggested reviewers

edsiper
koleini
fujimotos

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "in_calyptia_fleet: Only commit fleet config when successfully reloaded" is a concise, single-sentence summary that accurately captures the PR's primary change (delay committing a new fleet config until a hot reload succeeds). It names the affected plugin and clearly states the behavioral fix without extraneous detail or vague wording.

Poem

I nibble on configs, neat and new,
Hop from cur to old, then through—
A reload blooms, the flag turns true,
Commit the change, as rabbits do.
If storms arrive, I roll back too—
Thump-thump: fleet’s steady, burrow-blue. 🐇✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/flb_reload.c (1)

454-459: Also clear the success flag on the old context at reload start to avoid stale state.

If a reload attempt aborts, the active (old) context may still carry hot_reload_succeeded = TRUE from a prior success. Clearing it when marking hot reload prevents ambiguous state for any consumer not setting it explicitly (your Fleet plugin does, but making reload self-contained is safer).

Diff suggestion:
-    /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */
-    old_config->hot_reloading = FLB_TRUE;
+    /* Mark hot reloading for old ctx to prevent duplicated request via HTTP */
+    old_config->hot_reloading = FLB_TRUE;
+    /* Clear stale success state on the old context */
+    old_config->hot_reload_succeeded = FLB_FALSE;
...
-    new_config->hot_reloading = FLB_FALSE;
+    new_config->hot_reloading = FLB_FALSE;
+    /* Mark success */
     new_config->hot_reload_succeeded = FLB_TRUE;
Also applies to: 551-553

plugins/in_calyptia_fleet/in_calyptia_fleet.c (1)

1490-1563: Validate dereferenced ref paths before recursive delete to prevent path traversal/file clobber.

calyptia_config_delete_by_ref() dereferences a .ref file and then deletes all matches for its prefix (/path/to/file*). If an attacker tampers with the .ref file (or it gets corrupted), this can unlink arbitrary files/directories outside the Fleet cache. Before deleting:

Canonicalize both the base Fleet dir and the dereferenced path (e.g., realpath() on POSIX; GetFullPathNameA() on Windows),
Verify the dereferenced path is within the base dir, and
Refuse to operate if the check fails.

Concrete patch (POSIX path check; add equivalent for Windows):

 static int calyptia_config_delete_by_ref(struct flb_in_calyptia_fleet_config *ctx,
                                          const char *ref_name)
 {
-    struct cfl_array *confs;
+    struct cfl_array *confs;
     flb_sds_t config_path;
+    flb_sds_t base_dir = NULL;
+    char canon_base[PATH_MAX];
+    char canon_target[PATH_MAX];
     char *ext;
     int idx;
     struct stat entry_stat;
     const char *entry_path;

@@
-    config_path = fleet_config_deref(ctx, ref_name);
+    config_path = fleet_config_deref(ctx, ref_name);
     if (config_path == NULL) {
         return FLB_FALSE;
     }
+
+    /* Constrain deletes to the Fleet base directory */
+    base_dir = generate_base_fleet_directory(ctx);
+    if (base_dir == NULL) {
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+#ifndef _WIN32
+    if (realpath(base_dir, canon_base) == NULL ||
+        realpath(config_path, canon_target) == NULL) {
+        flb_plg_error(ctx->ins, "failed to canonicalize paths for delete safety");
+        flb_sds_destroy(base_dir);
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+    if (strncmp(canon_target, canon_base, strlen(canon_base)) != 0 ||
+        (canon_target[strlen(canon_base)] != '/' && canon_target[strlen(canon_base)] != '\0')) {
+        flb_plg_error(ctx->ins, "ref path %s not under base dir %s", canon_target, canon_base);
+        flb_sds_destroy(base_dir);
+        flb_sds_destroy(config_path);
+        return FLB_FALSE;
+    }
+#else
+    /* TODO: Add GetFullPathNameA() + prefix check on Windows */
+#endif
+    flb_sds_destroy(base_dir);

🧹 Nitpick comments (7)

plugins/in_calyptia_fleet/in_calyptia_fleet.c (7)
621-630: Minor: free order/readability in reload thread.

You log reload->cfg_path then flb_free(reload). Safe, since cfg_path ownership is transferred earlier, but consider storing cfg_path in a local before the free for clarity:
-    flb_info("reloading configuration from path: %s", reload->cfg_path);
-    flb_free(reload);
+    flb_info("reloading configuration from path: %s", reload->cfg_path);
+    flb_free(reload);
1566-1613: Add error handling around unlink of cur ref; preserve ENOENT, log others.

Currently unlink(cur_ref_filename); ignores failures. Recommend ignoring ENOENT but logging other errors to aid debugging.
-    unlink(cur_ref_filename);
+    if (unlink(cur_ref_filename) != 0 && errno != ENOENT) {
+        flb_plg_warn(ctx->ins, "failed to delete current ref file: %s (errno=%d)",
+                     cur_ref_filename, errno);
+    }
1621-1667: Commit flow looks correct; optionally reset success flag post-commit.

Once committed, hot_reload_succeeded remains TRUE in the active config indefinitely. Not harmful (commit re-checks absence of "new"), but you could clear it to explicitly record “no pending commit”.
     flb_plg_info(ctx->ins, "committed new config: %s", config_path);
     flb_sds_destroy(config_path);
+    /* Optional: clear success flag now that commit has been processed */
+    ctx->ins->config->hot_reload_succeeded = FLB_FALSE;
1670-1709: Rollback flow OK; consider logging when no 'new' exists to trim noise.

If exists_fleet_config(ctx, "new") == FLB_TRUE fails often, the log volume can be high. Downgrade to debug when nothing to delete. No functional change required.

1719-1754: Commit-on-success guard is solid; add one more safety: ensure timestamped path.

You already check is_new_fleet_config(ctx, config). As an extra guard against ref tampering, verify the current conf_path_file is under the Fleet base dir (same canonicalization as delete). Optional.

2037-2060: Do not hard-fail collection solely on commit failure.

FLB_INPUT_RETURN(-1) may disable the collector after a transient filesystem hiccup during commit. Consider logging and retrying on the next tick to keep the agent ingesting with the current config.
-    if (commit_config_if_reloaded(ctx) == FLB_FALSE) {
-        FLB_INPUT_RETURN(-1);
-    }
+    if (commit_config_if_reloaded(ctx) == FLB_FALSE) {
+        flb_plg_error(ctx->ins, "commit of reloaded configuration failed; will retry");
+        FLB_INPUT_RETURN(0);
+    }
1097-1158: Efficient timestamp check; avoid zero-capacity SDS.

glob_pattern = flb_sds_create_size(0); works, but using an initial capacity near base_dir + "/*.yaml" reduces reallocs slightly.

No change required; micro-optimization.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7798a84 and 4989c58.

📒 Files selected for processing (4)

include/fluent-bit/flb_config.h (1 hunks)
plugins/in_calyptia_fleet/in_calyptia_fleet.c (12 hunks)
src/flb_config.c (1 hunks)
src/flb_reload.c (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

plugins/in_calyptia_fleet/in_calyptia_fleet.c (2)

src/flb_sds.c (1)

flb_sds_destroy (389-399)

include/fluent-bit/flb_mem.h (1)

flb_free (126-128)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)

GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
GitHub Check: pr-compile-centos-7
GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
GitHub Check: PR - fuzzing test
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)

🔇 Additional comments (6)

src/flb_reload.c (1)

551-553: Good: mark success on the new config after a clean start.

Setting new_config->hot_reload_succeeded = FLB_TRUE; is the right signal for post-reload commit logic.

src/flb_config.c (1)

310-311: LGTM: sane default.

Initializing config->hot_reload_succeeded = FLB_FALSE; avoids indeterminate reads.

plugins/in_calyptia_fleet/in_calyptia_fleet.c (4)

576-591: exists_fleet_config: OK, but avoid TOCTOU assumptions in callers.

The deref + access() check is fine as a hint, but callers should be tolerant to the file disappearing between check and use (most are). No change requested.

1197-1210: Confirm HTTP-date parsing matches server format.

You parse Last-Modified with "%a, %d %B %Y %H:%M:%S GMT". RFC 7231 uses abbreviated month (%b). If the server sends “Wed, 21 Oct 2015 07:28:00 GMT”, %B will fail. If this format was already validated in production, ignore; otherwise consider switching to %b with a fallback.

2149-2177: Startup load prefers cur then old: good failover.

Nice improvement: avoids promoting an uncommitted “new” on process start.

2419-2535: Good: cleanup uncommitted 'new' on startup; ensure delete safety shares the same base-dir validation.

This path calls calyptia_config_delete_by_ref(ctx, "new"); once the delete hardening (base-dir canonicalization) is in place, this startup cleanup becomes safe even under ref corruption.

include/fluent-bit/flb_config.h

pwhelan

LGTM

How this interacts with #10869 might necessitate a refactoring but that can be handled in that PR.

in_calyptia_fleet: Only commit fleet config when successfully reloaded

4989c58

Signed-off-by: Alec Holmes <alec@chronosphere.io>

github-actions bot added the docs-required label Sep 11, 2025

alecholmes temporarily deployed to pr September 11, 2025 15:49 — with GitHub Actions Inactive

alecholmes marked this pull request as ready for review September 11, 2025 15:51

alecholmes requested review from edsiper, fujimotos and koleini as code owners September 11, 2025 15:51

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

include/fluent-bit/flb_config.h Show resolved Hide resolved

alecholmes temporarily deployed to pr September 11, 2025 16:07 — with GitHub Actions Inactive

alecholmes temporarily deployed to pr September 11, 2025 16:08 — with GitHub Actions Inactive

pwhelan approved these changes Sep 12, 2025

View reviewed changes

pwhelan merged commit 60db310 into fluent:master Sep 12, 2025
56 checks passed

cosmo0920 added the backport to v4.0.x label Oct 6, 2025

cosmo0920 mentioned this pull request Oct 6, 2025

reload: add a timeout mechanism and its based commits backporting [Backport to 4.0] #10985

Merged

7 tasks

coderabbitai bot mentioned this pull request Oct 7, 2025

in_calyptia_fleet: load configuration on initial startup. #10996

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874

in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874

Uh oh!

alecholmes commented Sep 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

pwhelan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874

in_calyptia_fleet: Only commit fleet config when successfully reloaded #10874

Uh oh!

Conversation

alecholmes commented Sep 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pwhelan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alecholmes commented Sep 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 11, 2025 •

edited

Loading