Skip to content

DiscoveryService: prevent EICE Node duplication#40220

Merged
marcoandredinis merged 10 commits into
masterfrom
marco/discovery_service_exclusive_group
Apr 15, 2024
Merged

DiscoveryService: prevent EICE Node duplication#40220
marcoandredinis merged 10 commits into
masterfrom
marco/discovery_service_exclusive_group

Conversation

@marcoandredinis
Copy link
Copy Markdown
Contributor

@marcoandredinis marcoandredinis commented Apr 4, 2024

This PR ensures does a couple of things to prevent duplicated nodes.

Run only one DiscoveryService per DiscoveryGroup

This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.

Deterministic name for EICE Nodes

Instead of using a random UUID, we are now building the Node's Name based on the account id and instance ID of the EC2 instance.

This reduces the load because we can now use the NodeWatcher's internal Map for quicker access (instead of always listing and filtering all the Nodes).

Given the name is deterministic, it also ensures we don't get duplicate nodes.

@marcoandredinis marcoandredinis added backport/branch/v14 no-changelog Indicates that a PR does not require a changelog entry labels Apr 4, 2024
@marcoandredinis marcoandredinis requested a review from tigrato April 4, 2024 17:33
@github-actions github-actions Bot requested a review from Joerger April 4, 2024 17:34
@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from f211007 to 9431698 Compare April 4, 2024 17:39
@marcoandredinis marcoandredinis marked this pull request as draft April 5, 2024 11:10
@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch 3 times, most recently from b56aa2e to 6cd8ccf Compare April 5, 2024 14:56
@marcoandredinis marcoandredinis changed the title DiscoveryService: exclusive worker per Discovery Group DiscoveryService: prevent EICE Node duplication Apr 5, 2024
@marcoandredinis marcoandredinis marked this pull request as ready for review April 5, 2024 15:00
@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from 6cd8ccf to d5e98ea Compare April 5, 2024 15:01
Comment thread api/types/server.go Outdated
Comment thread api/types/server.go Outdated
Comment thread lib/srv/discovery/discovery.go Outdated
Comment thread lib/srv/discovery/discovery.go Outdated
}
}()

s.releaseGroupLockFn = func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to synchronize the release function, from a memory model perspective and from a logical one? What if we end up stopping the *Server before we managed to update the releaseGroupLockFn?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the releaseGroupLockFn creation to much earlier.
Now, it's the first thing it does when calling the Start

This should ensure it's very unlikely to call the Stop before releaseGroupLockFn was set.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Very unlikely" doesn't mean anything; we should set a flag in Stop (while holding the lock) that means "server has been stopped" and we need to check the flag (while holding the lock) as we set releaseGroupLockFn, and we should refuse to start if the flag is set (and thus nothing can ever stop the server again).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check again?
I've simplified the flow and I'm no longer using the releaseGroupLockFn
The AcquireSemaphoreLock receives the Server's Context and should end as soon as we cancel that context.

@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from d5e98ea to 0725cc1 Compare April 8, 2024 09:41
Comment thread api/types/server.go Outdated
Comment thread lib/srv/discovery/discovery.go Outdated
Comment on lines +1345 to +1346
s.releaseGroupLockFn = func() {
if lease != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell there's a potential race condition on s.releaseGroupLockFn and a potential race condition on lease.

The logical race condition of (*Server).Stop getting called right before or during a call to acquireDiscoveryGroup, leading to resources that are never going to get cleaned up is also still there I think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take another look?
I think I fixed both races, and go test -race ... seems to agree.

@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from 0725cc1 to e849c7b Compare April 10, 2024 10:17
Comment thread api/types/server.go Outdated
Comment thread lib/srv/discovery/discovery.go Outdated
Comment on lines 1405 to 1321
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *Server) Start() error {
if err := s.acquireDiscoveryGroup(); err != nil {
return trace.Wrap(err)
}
if s.ec2Watcher != nil {
go s.handleEC2Discovery()
go s.reconciler.run(s.ctx)
func (s *Server) Start() error {
for {
err := s.runWhileAcquiringDiscoveryGroup(s.start)
switch{
case error.Is(...) /* lease was lost*/:
continue
default:
return trace.Wrap(err)
}
}
}
func (s *Server) start(ctx context.Context) error {
if s.ec2Watcher != nil {
go s.handleEC2Discovery()
go s.reconciler.run(ctx)
....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the service is ready to be started multiple times.
The runWhileAcquiringDiscoveryGroup would also need to wait for the service to clean up. Eg watchers, fetchers, installers and reconcilers would need to stop before we can proceed.

Comment thread lib/srv/discovery/discovery.go Outdated
Comment on lines 1328 to 1358
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *Server) acquireDiscoveryGroup() error {
if s.DiscoveryGroup == "" {
return nil
}
retry, err := retryutils.NewRetryV2(retryutils.RetryV2Config{
First: 0,
Driver: retryutils.NewExponentialDriver(defaults.HighResPollingPeriod),
Max: defaults.LowResPollingPeriod,
Jitter: retryutils.NewHalfJitter(),
Clock: s.clock,
})
if err != nil {
return trace.Wrap(err)
}
var lease *services.SemaphoreLock
for range retry.After() {
retry.Inc()
s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)
lease, err = s.tryAcquireDiscoveryGroupLease()
if err != nil {
if !strings.Contains(err.Error(), teleport.MaxLeases) {
return trace.Wrap(err)
}
s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock: %v", s.DiscoveryGroup, err)
continue
}
break
}
go func() {
for {
select {
case <-lease.Renewed():
continue
case <-lease.Done():
s.Log.WithError(lease.Wait()).Warnf("DiscoveryGroup %q lock was lost, stopping discovery service", s.DiscoveryGroup)
s.Stop()
return
}
}
}()
return nil
}
// acquireDiscoveryGroup tries to acquire a lock if the Service has a DiscoveryGroup
// It will retry using an exponential backoff algorithm.
func (s *Server) runWhileAcquiringDiscoveryGroup(run func(context.Context) error) error {
if s.DiscoveryGroup == "" {
s.Log.Warnf("DiscoveryGroup is not set, skipping semaphore lock. It is recommended to set a DiscoveryGroup")
return trace.Wrap(run(s.ctx))
}
s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)
lease, err := services.AcquireSemaphoreLock(s.ctx, services.SemaphoreLockConfig{
Service: s.Config.AccessPoint,
Expiry: time.Minute,
Params: types.AcquireSemaphoreRequest{
SemaphoreKind: types.SemaphoreKindDiscoveryServiceGroup,
SemaphoreName: s.DiscoveryGroup,
MaxLeases: 1,
Holder: s.ServerID,
},
Clock: s.clock,
})
switch {
case err == nil:
s.Log.Debugf("Discovery service acquired lock for DiscoveryGroup %q", s.DiscoveryGroup)
defer lease.Wait()
defer lease.Stop()
ctx, cancel := context.WithCancel(s.ctx)
defer cancel()
go func() {
select {
case <-lease.Done():
cancel()
case <-ctx.Done():
return
}
}()
err := run(ctx)
switch {
case errors.Is(err, context.Canceled):
select {
case <-s.ctx.Done():
return trace.Wrap(err)
default:
return nil
}
}
case trace.IsAlreadyExists(err) ....:
s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock", s.DiscoveryGroup)
return trace.Wrap(err)
}
return nil
}

@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch 2 times, most recently from 323f93e to e69584f Compare April 10, 2024 12:06
@marcoandredinis marcoandredinis requested a review from tigrato April 10, 2024 13:17
@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from b5c7d4b to a4016d2 Compare April 12, 2024 10:10
Comment thread lib/srv/discovery/discovery.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can call s.Stop here.
There are several reasons of it:

  • Once you loose a lock you can regain it after a while. This happens several times during auth server restarts where auth is restarted and long-lived connections are droped and the agent loses the lease although the lock is still in its "name". Once auth is back alive the agent can resume the connection.
  • This process is only valid if you have multiple discovery agents running. If you just have one it will cause the discovery service do stop after the first auth disconnection and it won't reconnect
  • Stopping the service will cause issues to /healthz report as the critical service discovery service won't be running anymore and teleport will report unhealthy. This causes kubernetes to restart pods and load balancers to stop sending traffic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct.
I'll remove the semaphore lock logic and rely entirely on server name to ensure we don't duplicate the nodes.

I still think we should invest in preventing duplicate API calls, but I'll leave that to a future PR.

@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from a4016d2 to 5915468 Compare April 15, 2024 10:20
@marcoandredinis
Copy link
Copy Markdown
Contributor Author

@espadolini Can you please review again? I've removed the semaphore lock logic entirely.
We'll rely on server node name for the purposes of preventing duplicate nodes.

Copy link
Copy Markdown
Contributor

@tigrato tigrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue to track the service lock?
It will be important to report the discovery group status so it's worth doing it

@public-teleport-github-review-bot public-teleport-github-review-bot Bot removed the request for review from Joerger April 15, 2024 10:29
@marcoandredinis
Copy link
Copy Markdown
Contributor Author

Follow up on ensuring a single DiscoveryService is running per DiscoveryGroup #40546

Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How worried are we that brand new instances will result in an upsert from both auth servers? Is the discovery loop jittered?

Comment thread lib/srv/discovery/discovery.go Outdated
Comment on lines 867 to 868
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this with the subsequent conditional and do if existingNode != nil && existingNode.Expiry().After(...) && ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the current format, but that's fine 👍

@marcoandredinis marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from c8af625 to 8f24391 Compare April 15, 2024 13:08
@marcoandredinis
Copy link
Copy Markdown
Contributor Author

How worried are we that brand new instances will result in an upsert from both auth servers?

It will happen most of the time for first time that we sync nodes.
After being inserted into cluster, their expiration is jittered and so we'll probably have a single node upsert from one of the servers.

Is the discovery loop jittered?

No, it's not.
It starts as soon as the service starts, so that should give us a little window where the 2nd server already has the ec2 instances as EICE nodes.
However, after changing a DiscoveryConfig for that DiscoveryGroup, it will re-sync their Fetchers and they will call the APIs at the same™️ time

@marcoandredinis marcoandredinis added this pull request to the merge queue Apr 15, 2024
Merged via the queue into master with commit 851b0d8 Apr 15, 2024
@marcoandredinis marcoandredinis deleted the marco/discovery_service_exclusive_group branch April 15, 2024 13:42
@public-teleport-github-review-bot
Copy link
Copy Markdown

@marcoandredinis See the table below for backport results.

Branch Result
branch/v14 Failed
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discovery no-changelog Indicates that a PR does not require a changelog entry size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants