Skip to content

[Fleet] Use LockManagerService for fleet setup lock#219113

Merged
nchaulet merged 8 commits intoelastic:mainfrom
nchaulet:improve-fleet-setup-lock
Apr 29, 2025
Merged

[Fleet] Use LockManagerService for fleet setup lock#219113
nchaulet merged 8 commits intoelastic:mainfrom
nchaulet:improve-fleet-setup-lock

Conversation

@nchaulet
Copy link
Member

@nchaulet nchaulet commented Apr 24, 2025

Summary

Resolve #216025

The fleetSetupCompleted is used by other plugin to know when the fleet setup as run and when they can use fleet to install packages, because an issue with the way we implemented our lock that method could return early if another instance of Kibana is performing or tried to do a setup.

That PR:

  • Change the lock we use for the setup to use the one from LockManagerService that has a better way to mange ttl and refresh itself.
  • And do not return early when encountering a lock error but retry.

@sorenlouv Do you see any issue we start using the LockManagerService in Fleet?, it works well for our usage to replace a really incomplete lock we had.

@nchaulet nchaulet self-assigned this Apr 24, 2025
@nchaulet nchaulet added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:Fleet Team label for Observability Data Collection Fleet team labels Apr 24, 2025
@nchaulet nchaulet marked this pull request as ready for review April 24, 2025 16:59
@nchaulet nchaulet requested a review from a team as a code owner April 24, 2025 16:59
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@nchaulet
Copy link
Member Author

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @nchaulet

@sorenlouv
Copy link
Member

@sorenlouv Do you see any issue we start using the LockManagerService in Fleet?, it works well for our usage to replace a really incomplete lock we had.

@nchaulet Great to hear you find this useful! WDYT about extracting the LockManager out to a separate package? That way we avoid fleet depending on the Assistant plugin.
I think it should be relatively easy, since LockManager doesn't have any plugin dependencies, so can be lifted out as-is.

@juliaElastic
Copy link
Contributor

Can we test on a cloud instance with multiple kibana instances setting up at the same time?

@nchaulet
Copy link
Member Author

Can we test on a cloud instance with multiple kibana instances setting up at the same time?

I tested multiple local instances, tried to kill them during setup, and it seems it's working well. We also have some integration test testing setup in HA setup here https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/fleet/server/integration_tests/ha_setup.test.ts#L94

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@nchaulet
Copy link
Member Author

Depends on before merging #219220

@nchaulet nchaulet merged commit a15db81 into elastic:main Apr 29, 2025
10 checks passed
akowalska622 pushed a commit to akowalska622/kibana that referenced this pull request Apr 30, 2025
Comment on lines +83 to +95
export async function _runSetupWithLock(setupFn: () => Promise<SetupStatus>) {
return await pRetry(
() => appContextService.getLockManagerService()!.withLock('fleet-setup', () => setupFn()),
{
onFailedAttempt: async (error) => {
if (!(error instanceof LockAcquisitionError)) {
throw error;
}
},
maxRetryTime: 5 * 60 * 1000, // Retry for 5 minute to get the lock
}
);
}
Copy link
Member

@sorenlouv sorenlouv May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug in the retry logic here. From the pRetry documentation:

If the onFailedAttempt function throws, all retries will be aborted and the original promise will reject with the thrown error.

Right now all errors but LockAcquisitionError are thrown, meaning they are aborted without retries. Only LockAcquisitionError's are retried.

Is that the intention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that the intention, we already have some retry logic in the Fleet setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. Why would you want to retry setupFn, if it's already running? Then it would run multiple times - although sequentially, not in parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. Why would you want to retry setupFn, if it's already running? Then

We want to be sure the setup ran and finish at least once If it's running on a different instance Kibana, but that instance is killed for any reason, this way we will retry, if the setup was already successful that operation should be relatively quick.

akowalska622 pushed a commit to akowalska622/kibana that referenced this pull request May 29, 2025
juliaElastic pushed a commit that referenced this pull request Nov 7, 2025
## Summary

Backport #219113

We are hitting some retries issue during setup, and relying on the lock
manager service as we do in 9+ version seems to be more reliable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Fleet] fleetSetupCompleted method do not work with multiple Kibana

5 participants