Skip to content

Conversation

@alecholmes
Copy link
Owner

@alecholmes alecholmes commented Sep 3, 2025

This fixes a fleet bug where fetching an invalid config then shutting down prevented the agent from starting up again.

Problem
The fleet plugin is buggy today in that it will immediately commit newly received config as the current config, implying the new config is valid. This can notably prevent fluent-bit from restarting correctly after receiving an invalid config.

For example:

  • Process starts with a valid config
  • Fleet plugin receives a new config. It marks it as current and kicks off a hot reload.
  • The hot reload fails. The process will use the old working config.
  • Process exits
  • Process starts
  • Process tries to use the newer invalid config instead of the older valid config.

Change

This PR changes how configs received by the fleet plugin are committed as valid.

When a new config is received:

  • Any existing new (not current or old) config is deleted, since it is being replaced.
  • The received config is saved on disk as new and a hot reload is kicked off.
  • Subsequent fleet collect callbacks will check if the hot reload of the new config was successful. If so, the new config is promoted to current.

When a process starts up:

  • It deletes new config on disk. By virtue of being marked new, that config was not successfully reloaded in the past.
  • It loads up using current or old config, if available.
  • It then will attempt to fetch any newer config and reload.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 92d74a7 to 0687187 Compare September 3, 2025 18:54
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from a1960d3 to 5d37835 Compare September 3, 2025 18:54
@alecholmes alecholmes changed the title in_calyptia_fleet: Unify fleet config file caching across platforms in_calyptia_fleet: Fix inability to start up agent when invalid fleet config fetched Sep 3, 2025
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 0687187 to 202e8ec Compare September 3, 2025 21:15
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 5d37835 to b355c89 Compare September 3, 2025 21:15
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 202e8ec to 1d14800 Compare September 3, 2025 21:47
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch 7 times, most recently from 3612a47 to bf72ac0 Compare September 4, 2025 21:18
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from f938e2d to 1a8e426 Compare September 4, 2025 22:13
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from bf72ac0 to 46b6764 Compare September 4, 2025 22:13
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 1a8e426 to d88caa4 Compare September 4, 2025 23:09
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch 3 times, most recently from e0dc0cb to dd75e84 Compare September 4, 2025 23:15
if (new_config_path != NULL
&& is_new_fleet_config(ctx, config) == FLB_FALSE) {
flb_plg_warn(ctx->ins, "deleting uncommitted new config: %s", new_config_path);
if (calyptia_config_delete_by_ref(ctx, "new") == FLB_FALSE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also rollback old to curr?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's unncessary as we fallback to old in load_fleet_config() if curr doesn't exist

reload->flb->config->conf_path_file = reload->cfg_path;

flb_free(reload);
sleep(5);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we sleep?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, and this feels like a Chesterton's fence situation. @pwhelan do you have history on this one?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call sleep just to give the actual reload time to be executed. It might be totally unnecessary.


/**
* Commits the latest received config as the valid, now-current config.
* This updates the symlink to the current config file to point to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This updates the symlink to the current config file to point to the
* This updates the ref file for the current config file to point to the

Or similar

flb_sds_t cur_ref_filename = fleet_config_ref_filename(ctx, "cur");
if (cur_ref_filename != NULL) {
unlink(cur_ref_filename);
if (exists_fleet_config(ctx, "cur") == FLB_TRUE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? shouldn't this be guarded on the ref file existing rather than the referenced config existing?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've updated this function a bit as a result and also tweaked the earlier check around cleaning up uncommitted new config.

@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from d88caa4 to d5478e8 Compare September 5, 2025 23:52
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from dd75e84 to 02e2d1f Compare September 5, 2025 23:52
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from d5478e8 to cd546f3 Compare September 8, 2025 18:25
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 02e2d1f to 7c2a445 Compare September 8, 2025 18:25
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from cd546f3 to f61fd78 Compare September 8, 2025 20:04
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 7c2a445 to 1345945 Compare September 8, 2025 20:04
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from f61fd78 to ce07178 Compare September 8, 2025 20:13
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 1345945 to bbb50c9 Compare September 8, 2025 20:13
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from ce07178 to 05cc743 Compare September 8, 2025 20:39
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from bbb50c9 to 97cebc4 Compare September 8, 2025 20:39
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 05cc743 to a391f55 Compare September 9, 2025 22:08
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 97cebc4 to d9d0604 Compare September 9, 2025 22:08
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from a391f55 to 2e540e5 Compare September 10, 2025 23:02
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from d9d0604 to f8329cd Compare September 10, 2025 23:02
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 2e540e5 to 38b11b7 Compare September 10, 2025 23:17
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from f8329cd to 6f74614 Compare September 10, 2025 23:17
@alecholmes alecholmes force-pushed the alec/2025-08-27-unify-fleet-config-file-mgmt branch from 38b11b7 to 7883904 Compare September 10, 2025 23:31
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from 6f74614 to a64a7c3 Compare September 10, 2025 23:31
dependabot bot and others added 5 commits September 11, 2025 14:00
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Remove a redundant wget in yum install

Signed-off-by: ilove737 <[email protected]>
@alecholmes alecholmes force-pushed the alec/2025-08-27-fix-committing-of-fleet-config-v2 branch from a64a7c3 to 4989c58 Compare September 11, 2025 15:45
@alecholmes
Copy link
Owner Author

PR below this has been merged into OSS so I'm going to close this PR and recreate it in the fluent-bit repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants