Skip to content

Conversation

@pchila
Copy link
Member

@pchila pchila commented Oct 6, 2025

What does this PR do?

This PR introduces a registry of elastic-agent installs by maintaining a YAML structure in .installed installation marker.
As soon as a new version of agent is unpacked on disk during an upgrade, the installation is added to the registry, updated when the symlinks are rotated to such installation and removed after the installation is deleted from disk.

At @cmacknz 's request the install registry has been scrapped substituted by a more limited but simpler mechanism.

This PR introduces TTL marker files in versioned homes of Elastic Agent versions to keep track of available manual rollback targets and the time until those installs should remain on disk.
When upgrading, existing installs are marked with a .ttl YAML file, for example:

version: 9.3.0-SNAPSHOT
hash: abcdef
valid_until: 2025-10-17T14:01:13+02:00

This allows to determine which available rollbacks are still on disk even after the upgrade marker is removed at the end of a successful upgrade.

Why is it important?

This PR is a prerequisite for introducing manual rollbacks beyond grace period with PR #9643

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila self-assigned this Oct 6, 2025
@pchila pchila requested a review from a team as a code owner October 6, 2025 08:13
@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-skip skip-changelog labels Oct 6, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pchila pchila mentioned this pull request Oct 6, 2025
5 tasks
@pchila
Copy link
Member Author

pchila commented Oct 6, 2025

Linter errors are already addressed in #9643

pkoutsovasilis
pkoutsovasilis previously approved these changes Oct 6, 2025
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I manually tested that:

  • .installed keeps track of installed versions when I upgrade with correct TTL
  • .installed keeps track of installed version when I manually rollback (removes old version)

It doesn't yet clean up entries in .installed - I did manually edit the ttl and restarted elastic-agent - but this I can see from the code that this is in #9643; @pchila keep me honest here

@pchila
Copy link
Member Author

pchila commented Oct 6, 2025

LGTM, I manually tested that:

* .installed keeps track of installed versions when I upgrade

* .installed keeps track of installed version when I manually rollback

It doesn't yet clean up entries in .installed - I did manually edit the ttl and restarted elastic-agent - but this I can see from the code that this is in #9643; @pchila keep me honest here

Correct. The handling for the expired installs is in the follow-up PR which allows for manual rollbacks while not in the grace period.
(Manually rolling back during the grace period does not need to rely on the install registry so the cleanup is still missing at this stage)

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to review this PR now but one general comment: could you please add godoc comments for any exported types, capturing what they represent / their purpose? Thanks.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila Thanks for breaking out this PR from #9643 — I appreciate it as it makes each PR more focussed on a single logical change and easier to review (at least for me). Left some minor comments but this is looking good!

@pchila
Copy link
Member Author

pchila commented Oct 7, 2025

CI failures in build https://buildkite.com/elastic/elastic-agent/builds/28194 are due to TestSensitiveLogsESExporter integration test failing (untouched by this PR but recently introduced with #9341 )
Edit: fixed by a rebase on latest main

@pchila pchila force-pushed the introduce-install-registry branch from cd18e71 to c2223f2 Compare October 7, 2025 14:35
@cmacknz
Copy link
Member

cmacknz commented Oct 8, 2025

@pchila and I met today and came to the following conclusions:

  • We should have a way to feature flag the rollback functionality, so that we can surface errors related to new code and fail the upgrade but also have a way to bypass unexpected bugs.
    • The simplest path to do this is to use the rollback window parameter we already support and will expose in the Fleet UI.
    • We can interpret having a rollback_window of 0 as a request to disable the rollback logic.
    • In concrete terms of the discussion of this PR, this would mean we treat updating the registry of available rollback versions as fatal errors unless rollback window is zero where we either ignore or simply don't update the registry at all.
  • We should move the installation registry out of the .installed file.
    • The largest reason to do this is that it makes it much easier for us to change or relocate the list of registry versions by deleting or ignoring the new file.
      • For example if we had a local database available, we would prefer to track this there.
    • As part of this Paolo suggested we should only track versions available for rollback through the upgrade process, and remove the updating the registry at install time to further simplify things.

@mergify
Copy link
Contributor

mergify bot commented Oct 8, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b introduce-install-registry upstream/introduce-install-registry
git merge upstream/main
git push upstream introduce-install-registry

@pchila pchila force-pushed the introduce-install-registry branch from 2b6cba6 to 07acdab Compare October 14, 2025 15:06
@pchila pchila changed the title Introduce install registry Introduce install TTL markers Oct 16, 2025
@pchila
Copy link
Member Author

pchila commented Oct 16, 2025

@pchila and I met today and came to the following conclusions:

* We should have a way to feature flag the rollback functionality, so that we can surface errors related to new code and fail the upgrade but also have a way to bypass unexpected bugs.
  
  * The simplest path to do this is to use the rollback window parameter we already support and will expose in the Fleet UI.
  * We can interpret having a rollback_window of 0 as a request to disable the rollback logic.
  * In concrete terms of the discussion of this PR, this would mean we treat updating the registry of available rollback versions as fatal errors unless rollback window is zero where we either ignore or simply don't update the registry at all.

* We should move the installation registry out of the `.installed` file.
  
  * The largest reason to do this is that it makes it much easier for us to change or relocate the list of registry versions by deleting or ignoring the new file.
    
    * For example if we had a local database available, we would prefer to track this there.
  * As part of this Paolo suggested we should only track versions available for rollback through the upgrade process, and remove the updating the registry at install time to further simplify things.

@cmacknz see d4cc617

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying, a couple of comments but direction overall LGTM

ycombinator
ycombinator previously approved these changes Oct 20, 2025
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

History

cc @pchila

@pchila pchila merged commit 24c4bbd into elastic:main Oct 22, 2025
21 checks passed
hayotbisonai pushed a commit to hayotbisonai/elastic-agent that referenced this pull request Nov 23, 2025
* Add available rollbacks list under /data

* make TTLMarkerRegistry.addOrReplace private

* Add logger to ttl marker source

* Add fatal upgrade error if ttl marker cannot be set and related cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants