DiscoveryService: prevent EICE Node duplication#40220
Conversation
f211007 to
9431698
Compare
b56aa2e to
6cd8ccf
Compare
6cd8ccf to
d5e98ea
Compare
| } | ||
| }() | ||
|
|
||
| s.releaseGroupLockFn = func() { |
There was a problem hiding this comment.
Do we need to synchronize the release function, from a memory model perspective and from a logical one? What if we end up stopping the *Server before we managed to update the releaseGroupLockFn?
There was a problem hiding this comment.
I moved the releaseGroupLockFn creation to much earlier.
Now, it's the first thing it does when calling the Start
This should ensure it's very unlikely to call the Stop before releaseGroupLockFn was set.
There was a problem hiding this comment.
"Very unlikely" doesn't mean anything; we should set a flag in Stop (while holding the lock) that means "server has been stopped" and we need to check the flag (while holding the lock) as we set releaseGroupLockFn, and we should refuse to start if the flag is set (and thus nothing can ever stop the server again).
There was a problem hiding this comment.
Can you please check again?
I've simplified the flow and I'm no longer using the releaseGroupLockFn
The AcquireSemaphoreLock receives the Server's Context and should end as soon as we cancel that context.
d5e98ea to
0725cc1
Compare
| s.releaseGroupLockFn = func() { | ||
| if lease != nil { |
There was a problem hiding this comment.
From what I can tell there's a potential race condition on s.releaseGroupLockFn and a potential race condition on lease.
The logical race condition of (*Server).Stop getting called right before or during a call to acquireDiscoveryGroup, leading to resources that are never going to get cleaned up is also still there I think?
There was a problem hiding this comment.
Can you please take another look?
I think I fixed both races, and go test -race ... seems to agree.
0725cc1 to
e849c7b
Compare
There was a problem hiding this comment.
| func (s *Server) Start() error { | |
| if err := s.acquireDiscoveryGroup(); err != nil { | |
| return trace.Wrap(err) | |
| } | |
| if s.ec2Watcher != nil { | |
| go s.handleEC2Discovery() | |
| go s.reconciler.run(s.ctx) | |
| func (s *Server) Start() error { | |
| for { | |
| err := s.runWhileAcquiringDiscoveryGroup(s.start) | |
| switch{ | |
| case error.Is(...) /* lease was lost*/: | |
| continue | |
| default: | |
| return trace.Wrap(err) | |
| } | |
| } | |
| } | |
| func (s *Server) start(ctx context.Context) error { | |
| if s.ec2Watcher != nil { | |
| go s.handleEC2Discovery() | |
| go s.reconciler.run(ctx) | |
| .... |
There was a problem hiding this comment.
I'm not sure the service is ready to be started multiple times.
The runWhileAcquiringDiscoveryGroup would also need to wait for the service to clean up. Eg watchers, fetchers, installers and reconcilers would need to stop before we can proceed.
There was a problem hiding this comment.
| func (s *Server) acquireDiscoveryGroup() error { | |
| if s.DiscoveryGroup == "" { | |
| return nil | |
| } | |
| retry, err := retryutils.NewRetryV2(retryutils.RetryV2Config{ | |
| First: 0, | |
| Driver: retryutils.NewExponentialDriver(defaults.HighResPollingPeriod), | |
| Max: defaults.LowResPollingPeriod, | |
| Jitter: retryutils.NewHalfJitter(), | |
| Clock: s.clock, | |
| }) | |
| if err != nil { | |
| return trace.Wrap(err) | |
| } | |
| var lease *services.SemaphoreLock | |
| for range retry.After() { | |
| retry.Inc() | |
| s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup) | |
| lease, err = s.tryAcquireDiscoveryGroupLease() | |
| if err != nil { | |
| if !strings.Contains(err.Error(), teleport.MaxLeases) { | |
| return trace.Wrap(err) | |
| } | |
| s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock: %v", s.DiscoveryGroup, err) | |
| continue | |
| } | |
| break | |
| } | |
| go func() { | |
| for { | |
| select { | |
| case <-lease.Renewed(): | |
| continue | |
| case <-lease.Done(): | |
| s.Log.WithError(lease.Wait()).Warnf("DiscoveryGroup %q lock was lost, stopping discovery service", s.DiscoveryGroup) | |
| s.Stop() | |
| return | |
| } | |
| } | |
| }() | |
| return nil | |
| } | |
| // acquireDiscoveryGroup tries to acquire a lock if the Service has a DiscoveryGroup | |
| // It will retry using an exponential backoff algorithm. | |
| func (s *Server) runWhileAcquiringDiscoveryGroup(run func(context.Context) error) error { | |
| if s.DiscoveryGroup == "" { | |
| s.Log.Warnf("DiscoveryGroup is not set, skipping semaphore lock. It is recommended to set a DiscoveryGroup") | |
| return trace.Wrap(run(s.ctx)) | |
| } | |
| s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup) | |
| lease, err := services.AcquireSemaphoreLock(s.ctx, services.SemaphoreLockConfig{ | |
| Service: s.Config.AccessPoint, | |
| Expiry: time.Minute, | |
| Params: types.AcquireSemaphoreRequest{ | |
| SemaphoreKind: types.SemaphoreKindDiscoveryServiceGroup, | |
| SemaphoreName: s.DiscoveryGroup, | |
| MaxLeases: 1, | |
| Holder: s.ServerID, | |
| }, | |
| Clock: s.clock, | |
| }) | |
| switch { | |
| case err == nil: | |
| s.Log.Debugf("Discovery service acquired lock for DiscoveryGroup %q", s.DiscoveryGroup) | |
| defer lease.Wait() | |
| defer lease.Stop() | |
| ctx, cancel := context.WithCancel(s.ctx) | |
| defer cancel() | |
| go func() { | |
| select { | |
| case <-lease.Done(): | |
| cancel() | |
| case <-ctx.Done(): | |
| return | |
| } | |
| }() | |
| err := run(ctx) | |
| switch { | |
| case errors.Is(err, context.Canceled): | |
| select { | |
| case <-s.ctx.Done(): | |
| return trace.Wrap(err) | |
| default: | |
| return nil | |
| } | |
| } | |
| case trace.IsAlreadyExists(err) ....: | |
| s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock", s.DiscoveryGroup) | |
| return trace.Wrap(err) | |
| } | |
| return nil | |
| } | |
323f93e to
e69584f
Compare
b5c7d4b to
a4016d2
Compare
There was a problem hiding this comment.
I don't think you can call s.Stop here.
There are several reasons of it:
- Once you loose a lock you can regain it after a while. This happens several times during auth server restarts where auth is restarted and long-lived connections are droped and the agent loses the lease although the lock is still in its "name". Once auth is back alive the agent can resume the connection.
- This process is only valid if you have multiple discovery agents running. If you just have one it will cause the discovery service do stop after the first auth disconnection and it won't reconnect
- Stopping the service will cause issues to
/healthzreport as the critical service discovery service won't be running anymore and teleport will report unhealthy. This causes kubernetes to restart pods and load balancers to stop sending traffic.
There was a problem hiding this comment.
Yeah, you are correct.
I'll remove the semaphore lock logic and rely entirely on server name to ensure we don't duplicate the nodes.
I still think we should invest in preventing duplicate API calls, but I'll leave that to a future PR.
a4016d2 to
5915468
Compare
|
@espadolini Can you please review again? I've removed the semaphore lock logic entirely. |
tigrato
left a comment
There was a problem hiding this comment.
Can you create an issue to track the service lock?
It will be important to report the discovery group status so it's worth doing it
|
Follow up on ensuring a single DiscoveryService is running per DiscoveryGroup #40546 |
espadolini
left a comment
There was a problem hiding this comment.
How worried are we that brand new instances will result in an upsert from both auth servers? Is the discovery loop jittered?
There was a problem hiding this comment.
Merge this with the subsequent conditional and do if existingNode != nil && existingNode.Expiry().After(...) && ...
There was a problem hiding this comment.
I actually prefer the current format, but that's fine 👍
This PR ensures only one DiscoveryService is running per DiscoveryGroup. It does so using a SemaphoreLock. This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.
c8af625 to
8f24391
Compare
It will happen most of the time for first time that we sync nodes.
No, it's not. |
|
@marcoandredinis See the table below for backport results.
|
This PR ensures does a couple of things to prevent duplicated nodes.
Run only one DiscoveryService per DiscoveryGroup
This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.
Deterministic name for EICE Nodes
Instead of using a random UUID, we are now building the Node's Name based on the account id and instance ID of the EC2 instance.
This reduces the load because we can now use the NodeWatcher's internal Map for quicker access (instead of always listing and filtering all the Nodes).
Given the name is deterministic, it also ensures we don't get duplicate nodes.