Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-retry upgrades on the edge #778

Closed
michel-laterman opened this issue Jul 26, 2022 · 8 comments · Fixed by #1219
Closed

Auto-retry upgrades on the edge #778

michel-laterman opened this issue Jul 26, 2022 · 8 comments · Fixed by #1219
Assignees
Labels

Comments

@michel-laterman
Copy link
Contributor

Describe the enhancement:

An agent upgrade command needs some form of retry mechanism as discussed in #752
This retry mechanism should work for scheduled actions (may retry within the upgrade window), and other actions (given a default time frame or number of attempts).

The action queue may need to be (re) used for this.

Describe a specific use case for the enhancement or feature:

An upgrade may fail due to temporary network issues. In this case we would expect the upgrade to try again later and succeed.

Currently the upgrade process will remove all non-current version artifacts from the downloads directory when starting the upgrade process; so this may redownload a (verified) artifact

@michel-laterman michel-laterman added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jul 26, 2022
@joshdover joshdover changed the title upgrade retry mechanism Auto-retry upgrades on the edge Jul 27, 2022
@joshdover
Copy link
Contributor

Depending on the scope of change required for this, we should consider backporting this to the 8.4 release branch.

@ph
Copy link
Contributor

ph commented Jul 27, 2022

Few notes concerning this features that would need to be considered:

  • We should consider some kind of backoff strategy / exponentials to the retry mechanism.
  • We need to make sure we don't introduce a more thundering horde effect when we retry the action.

Questions:

  • Where would that retry logic goes? Should that be a concerns of queue where all events goes into a local queue and get acked or failed? Should that be something else? @michel-laterman I would love to see something written down in this issues concerning the implementation and the lifecycle of theses events.
  • Where would the retry rules should be defined, should they be decided by the emitter of the action?

@jlind23
Copy link
Contributor

jlind23 commented Oct 4, 2022

@michel-laterman Could you please let us know where you are at with this issue? Is see a couple of open/draft/stale PR. Anything we can do to unblock you?

@michel-laterman
Copy link
Contributor Author

Still in progress, I was working on bug fixes last week. I still need to do a couple minor updates before marking them for review

@kpollich
Copy link
Member

@michel-laterman - Can this be closed now that #1219 has landed, or is there more work to do here?

@jlind23
Copy link
Contributor

jlind23 commented Oct 24, 2022

@michel-laterman Any news on this?

@michel-laterman
Copy link
Contributor Author

The only thing missing is to have the fleet-ui reflect the change

@jlind23
Copy link
Contributor

jlind23 commented Nov 3, 2022

@michel-laterman @kpollich closing this one as completed and we will rather follow up with the kibana related issue: elastic/kibana#140225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants