DiscoveryService: prevent EICE Node duplication by marcoandredinis · Pull Request #40220 · gravitational/teleport

marcoandredinis · 2024-04-04T17:33:42Z

This PR ensures does a couple of things to prevent duplicated nodes.

Run only one DiscoveryService per DiscoveryGroup

This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.

Deterministic name for EICE Nodes

Instead of using a random UUID, we are now building the Node's Name based on the account id and instance ID of the EC2 instance.

This reduces the load because we can now use the NodeWatcher's internal Map for quicker access (instead of always listing and filtering all the Nodes).

Given the name is deterministic, it also ensures we don't get duplicate nodes.

espadolini · 2024-04-05T18:26:23Z

+		}
+	}()
+
+	s.releaseGroupLockFn = func() {


Do we need to synchronize the release function, from a memory model perspective and from a logical one? What if we end up stopping the *Server before we managed to update the releaseGroupLockFn?

I moved the releaseGroupLockFn creation to much earlier.
Now, it's the first thing it does when calling the Start

This should ensure it's very unlikely to call the Stop before releaseGroupLockFn was set.

"Very unlikely" doesn't mean anything; we should set a flag in Stop (while holding the lock) that means "server has been stopped" and we need to check the flag (while holding the lock) as we set releaseGroupLockFn, and we should refuse to start if the flag is set (and thus nothing can ever stop the server again).

Can you please check again?
I've simplified the flow and I'm no longer using the releaseGroupLockFn
The AcquireSemaphoreLock receives the Server's Context and should end as soon as we cancel that context.

espadolini · 2024-04-08T11:02:37Z

+	s.releaseGroupLockFn = func() {
+		if lease != nil {


From what I can tell there's a potential race condition on s.releaseGroupLockFn and a potential race condition on lease.

The logical race condition of (*Server).Stop getting called right before or during a call to acquireDiscoveryGroup, leading to resources that are never going to get cleaned up is also still there I think?

Can you please take another look?
I think I fixed both races, and go test -race ... seems to agree.

tigrato · 2024-04-10T11:01:16Z

Suggested change

func (s *Server) Start() error {

if err := s.acquireDiscoveryGroup(); err != nil {

return trace.Wrap(err)

}

if s.ec2Watcher != nil {

go s.handleEC2Discovery()

go s.reconciler.run(s.ctx)

func (s *Server) Start() error {

for {

err := s.runWhileAcquiringDiscoveryGroup(s.start)

switch{

case error.Is(...) /* lease was lost*/:

continue

default:

return trace.Wrap(err)

}

}

}

func (s *Server) start(ctx context.Context) error {

if s.ec2Watcher != nil {

go s.handleEC2Discovery()

go s.reconciler.run(ctx)

....

I'm not sure the service is ready to be started multiple times.
The runWhileAcquiringDiscoveryGroup would also need to wait for the service to clean up. Eg watchers, fetchers, installers and reconcilers would need to stop before we can proceed.

tigrato · 2024-04-10T11:11:23Z

Suggested change

func (s *Server) acquireDiscoveryGroup() error {

if s.DiscoveryGroup == "" {

return nil

}

retry, err := retryutils.NewRetryV2(retryutils.RetryV2Config{

First: 0,

Driver: retryutils.NewExponentialDriver(defaults.HighResPollingPeriod),

Max: defaults.LowResPollingPeriod,

Jitter: retryutils.NewHalfJitter(),

Clock: s.clock,

})

if err != nil {

return trace.Wrap(err)

}

var lease *services.SemaphoreLock

for range retry.After() {

retry.Inc()

s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)

lease, err = s.tryAcquireDiscoveryGroupLease()

if err != nil {

if !strings.Contains(err.Error(), teleport.MaxLeases) {

return trace.Wrap(err)

}

s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock: %v", s.DiscoveryGroup, err)

continue

}

break

}

go func() {

for {

select {

case <-lease.Renewed():

continue

case <-lease.Done():

s.Log.WithError(lease.Wait()).Warnf("DiscoveryGroup %q lock was lost, stopping discovery service", s.DiscoveryGroup)

s.Stop()

return

}

}

}()

return nil

}

// acquireDiscoveryGroup tries to acquire a lock if the Service has a DiscoveryGroup

// It will retry using an exponential backoff algorithm.

func (s *Server) runWhileAcquiringDiscoveryGroup(run func(context.Context) error) error {

if s.DiscoveryGroup == "" {

s.Log.Warnf("DiscoveryGroup is not set, skipping semaphore lock. It is recommended to set a DiscoveryGroup")

return trace.Wrap(run(s.ctx))

}

s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)

lease, err := services.AcquireSemaphoreLock(s.ctx, services.SemaphoreLockConfig{

Service: s.Config.AccessPoint,

Expiry: time.Minute,

Params: types.AcquireSemaphoreRequest{

SemaphoreKind: types.SemaphoreKindDiscoveryServiceGroup,

SemaphoreName: s.DiscoveryGroup,

MaxLeases: 1,

Holder: s.ServerID,

},

Clock: s.clock,

})

switch {

case err == nil:

s.Log.Debugf("Discovery service acquired lock for DiscoveryGroup %q", s.DiscoveryGroup)

defer lease.Wait()

defer lease.Stop()

ctx, cancel := context.WithCancel(s.ctx)

defer cancel()

go func() {

select {

case <-lease.Done():

cancel()

case <-ctx.Done():

return

}

}()

err := run(ctx)

switch {

case errors.Is(err, context.Canceled):

select {

case <-s.ctx.Done():

return trace.Wrap(err)

default:

return nil

}

}

case trace.IsAlreadyExists(err) ....:

s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock", s.DiscoveryGroup)

return trace.Wrap(err)

}

return nil

}

tigrato · 2024-04-15T10:02:38Z

I don't think you can call s.Stop here.
There are several reasons of it:

Once you loose a lock you can regain it after a while. This happens several times during auth server restarts where auth is restarted and long-lived connections are droped and the agent loses the lease although the lock is still in its "name". Once auth is back alive the agent can resume the connection.

This process is only valid if you have multiple discovery agents running. If you just have one it will cause the discovery service do stop after the first auth disconnection and it won't reconnect

Stopping the service will cause issues to /healthz report as the critical service discovery service won't be running anymore and teleport will report unhealthy. This causes kubernetes to restart pods and load balancers to stop sending traffic.

Yeah, you are correct.
I'll remove the semaphore lock logic and rely entirely on server name to ensure we don't duplicate the nodes.

I still think we should invest in preventing duplicate API calls, but I'll leave that to a future PR.

marcoandredinis · 2024-04-15T10:25:24Z

@espadolini Can you please review again? I've removed the semaphore lock logic entirely.
We'll rely on server node name for the purposes of preventing duplicate nodes.

tigrato

Can you create an issue to track the service lock?
It will be important to report the discovery group status so it's worth doing it

marcoandredinis · 2024-04-15T10:39:51Z

Follow up on ensuring a single DiscoveryService is running per DiscoveryGroup #40546

espadolini

How worried are we that brand new instances will result in an upsert from both auth servers? Is the discovery loop jittered?

espadolini · 2024-04-15T11:33:11Z

Merge this with the subsequent conditional and do if existingNode != nil && existingNode.Expiry().After(...) && ...

I actually prefer the current format, but that's fine 👍

This PR ensures only one DiscoveryService is running per DiscoveryGroup. It does so using a SemaphoreLock. This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.

marcoandredinis · 2024-04-15T13:12:52Z

How worried are we that brand new instances will result in an upsert from both auth servers?

It will happen most of the time for first time that we sync nodes.
After being inserted into cluster, their expiration is jittered and so we'll probably have a single node upsert from one of the servers.

Is the discovery loop jittered?

No, it's not.
It starts as soon as the service starts, so that should give us a little window where the 2nd server already has the ec2 instances as EICE nodes.
However, after changing a DiscoveryConfig for that DiscoveryGroup, it will re-sync their Fetchers and they will call the APIs at the same™️ time

public-teleport-github-review-bot · 2024-04-15T13:43:21Z

@marcoandredinis See the table below for backport results.

Branch	Result
branch/v14	Failed
branch/v15	Create PR

marcoandredinis added backport/branch/v14 no-changelog Indicates that a PR does not require a changelog entry labels Apr 4, 2024

marcoandredinis requested a review from tigrato April 4, 2024 17:33

github-actions Bot added discovery size/md labels Apr 4, 2024

github-actions Bot requested a review from Joerger April 4, 2024 17:34

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from f211007 to 9431698 Compare April 4, 2024 17:39

marcoandredinis marked this pull request as draft April 5, 2024 11:10

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch 3 times, most recently from b56aa2e to 6cd8ccf Compare April 5, 2024 14:56

marcoandredinis requested a review from espadolini April 5, 2024 14:56

marcoandredinis changed the title ~~DiscoveryService: exclusive worker per Discovery Group~~ DiscoveryService: prevent EICE Node duplication Apr 5, 2024

marcoandredinis marked this pull request as ready for review April 5, 2024 15:00

github-actions Bot requested review from capnspacehook and strideynet April 5, 2024 15:01

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from 6cd8ccf to d5e98ea Compare April 5, 2024 15:01

marcoandredinis removed request for capnspacehook and strideynet April 5, 2024 15:08

espadolini reviewed Apr 5, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from d5e98ea to 0725cc1 Compare April 8, 2024 09:41

marcoandredinis requested a review from espadolini April 8, 2024 09:54

espadolini reviewed Apr 8, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from 0725cc1 to e849c7b Compare April 10, 2024 10:17

marcoandredinis requested a review from espadolini April 10, 2024 10:29

tigrato reviewed Apr 10, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch 2 times, most recently from 323f93e to e69584f Compare April 10, 2024 12:06

marcoandredinis requested a review from tigrato April 10, 2024 13:17

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from b5c7d4b to a4016d2 Compare April 12, 2024 10:10

tigrato reviewed Apr 15, 2024

View reviewed changes

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from a4016d2 to 5915468 Compare April 15, 2024 10:20

marcoandredinis requested review from espadolini and tigrato April 15, 2024 10:25

tigrato approved these changes Apr 15, 2024

View reviewed changes

public-teleport-github-review-bot Bot removed the request for review from Joerger April 15, 2024 10:29

espadolini approved these changes Apr 15, 2024

View reviewed changes

marcoandredinis added 10 commits April 15, 2024 13:57

DiscoveryService: exclusive worker per Discovery Group

b304ac8

This PR ensures only one DiscoveryService is running per DiscoveryGroup. It does so using a SemaphoreLock. This should prevent any duplicate API calls and also prevent duplicate resources on Teleport side.

improve eice server name and releaseGroupLockFn init

6a01946

review pt2

ba7eeb7

improve error handling

56d0446

re-organize lock acquire

c3c31aa

improve eice node name handling

800658c

simplify lock on discovery group

cee4b19

Remove lock logic and rely on node name to prevent dups

8dd5c02

improve test comment

2d05515

change comment about skipping nodes

8f24391

marcoandredinis force-pushed the marco/discovery_service_exclusive_group branch from c8af625 to 8f24391 Compare April 15, 2024 13:08

marcoandredinis enabled auto-merge April 15, 2024 13:15

marcoandredinis added this pull request to the merge queue Apr 15, 2024

Merged via the queue into master with commit 851b0d8 Apr 15, 2024

marcoandredinis deleted the marco/discovery_service_exclusive_group branch April 15, 2024 13:42

marcoandredinis mentioned this pull request Apr 15, 2024

[v15] DiscoveryService: prevent EICE Node duplication #40552

Merged

marcoandredinis removed the backport/branch/v14 label Apr 15, 2024

-func (s *Server) Start() error {
-	if err := s.acquireDiscoveryGroup(); err != nil {
-		return trace.Wrap(err)
-	}
-	if s.ec2Watcher != nil {
-		go s.handleEC2Discovery()
-		go s.reconciler.run(s.ctx)
+func (s *Server) Start() error {
+       for {
+	err := s.runWhileAcquiringDiscoveryGroup(s.start)
+	switch{
+	case error.Is(...) /* lease was lost*/:
+	     continue
+	default:
+	     return trace.Wrap(err)
+	}
+      }
+}
+func (s *Server) start(ctx context.Context) error {
+	if s.ec2Watcher != nil {
+		go s.handleEC2Discovery()
+		go s.reconciler.run(ctx)
+		....

-func (s *Server) acquireDiscoveryGroup() error {
-	if s.DiscoveryGroup == "" {
-		return nil
-	}
-	retry, err := retryutils.NewRetryV2(retryutils.RetryV2Config{
-		First:  0,
-		Driver: retryutils.NewExponentialDriver(defaults.HighResPollingPeriod),
-		Max:    defaults.LowResPollingPeriod,
-		Jitter: retryutils.NewHalfJitter(),
-		Clock:  s.clock,
-	})
-	if err != nil {
-		return trace.Wrap(err)
-	}
-	var lease *services.SemaphoreLock
-	for range retry.After() {
-		retry.Inc()
-		s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)
-		lease, err = s.tryAcquireDiscoveryGroupLease()
-		if err != nil {
-			if !strings.Contains(err.Error(), teleport.MaxLeases) {
-				return trace.Wrap(err)
-			}
-			s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock: %v", s.DiscoveryGroup, err)
-			continue
-		}
-		break
-	}
-	go func() {
-		for {
-			select {
-			case <-lease.Renewed():
-				continue
-			case <-lease.Done():
-				s.Log.WithError(lease.Wait()).Warnf("DiscoveryGroup %q lock was lost, stopping discovery service", s.DiscoveryGroup)
-				s.Stop()
-				return
-			}
-		}
-	}()
-	return nil
-}
+// acquireDiscoveryGroup tries to acquire a lock if the Service has a DiscoveryGroup
+// It will retry using an exponential backoff algorithm.
+func (s *Server) runWhileAcquiringDiscoveryGroup(run func(context.Context) error) error {
+	if s.DiscoveryGroup == "" {
+		s.Log.Warnf("DiscoveryGroup is not set, skipping semaphore lock. It is recommended to set a DiscoveryGroup")
+		return trace.Wrap(run(s.ctx))
+	}
+	s.Log.Debugf("Discovery service is trying to acquire lock for DiscoveryGroup %q", s.DiscoveryGroup)
+	lease, err := services.AcquireSemaphoreLock(s.ctx, services.SemaphoreLockConfig{
+		Service: s.Config.AccessPoint,
+		Expiry:  time.Minute,
+		Params: types.AcquireSemaphoreRequest{
+			SemaphoreKind: types.SemaphoreKindDiscoveryServiceGroup,
+			SemaphoreName: s.DiscoveryGroup,
+			MaxLeases:     1,
+			Holder:        s.ServerID,
+		},
+		Clock: s.clock,
+	})
+	switch {
+	case err == nil:
+		s.Log.Debugf("Discovery service acquired lock for DiscoveryGroup %q", s.DiscoveryGroup)
+		defer lease.Wait()
+		defer lease.Stop()
+		ctx, cancel := context.WithCancel(s.ctx)
+		defer cancel()
+		go func() {
+			select {
+			case <-lease.Done():
+				cancel()
+			case <-ctx.Done():
+				return
+			}
+		}()
+		err := run(ctx)
+		switch {
+		case errors.Is(err, context.Canceled):
+			select {
+			case <-s.ctx.Done():
+				return trace.Wrap(err)
+			default:
+				return nil
+			}
+		}
+	case trace.IsAlreadyExists(err) ....:
+		s.Log.Debugf("Discovery service is waiting on DiscoveryGroup %q lock", s.DiscoveryGroup)
+		return trace.Wrap(err)
+	}
+	return nil
+}

Conversation

marcoandredinis commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run only one DiscoveryService per DiscoveryGroup

Deterministic name for EICE Nodes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcoandredinis commented Apr 15, 2024

Uh oh!

tigrato left a comment

Choose a reason for hiding this comment

Uh oh!

marcoandredinis commented Apr 15, 2024

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcoandredinis commented Apr 15, 2024

Uh oh!

public-teleport-github-review-bot Bot commented Apr 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marcoandredinis commented Apr 4, 2024 •

edited

Loading