-
Notifications
You must be signed in to change notification settings - Fork 2k
Expect the auth backend/cache to be initialized before turning ready #59907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2606,6 +2606,72 @@ func (process *TeleportProcess) initAuthService() error { | |
| } | ||
| } | ||
|
|
||
| // We mark the process state as starting until the auth backend is ready. | ||
| // If cache is enabled, this will wait for the cache to be populated. | ||
| // We don't want auth to say its ready until its cache is populated, | ||
| // else a rollout might progress too quickly and cause backend pressure | ||
| // and outages. | ||
| { | ||
| component := teleport.Component(teleport.ComponentAuth, "backend") | ||
| process.ExpectService(component) | ||
| process.RegisterFunc("auth.wait-for-event-stream", func() error { | ||
| start := process.Clock.Now() | ||
|
|
||
| retry, err := retryutils.NewRetryV2(retryutils.RetryV2Config{ | ||
| First: 0, | ||
| Driver: retryutils.NewLinearDriver(5 * time.Second), | ||
| Max: time.Minute, | ||
| Jitter: retryutils.DefaultJitter, | ||
| Clock: process.Clock, | ||
| }) | ||
| if err != nil { | ||
| return trace.Wrap(err, "creating the backend watch retry (this is a bug)") | ||
| } | ||
| log := process.logger.With(teleport.ComponentLabel, component) | ||
| var attempt int | ||
|
|
||
| // The wait logic is retried until it is successful or a termination is triggered. | ||
| for { | ||
| attempt++ | ||
| select { | ||
| case <-process.GracefulExitContext().Done(): | ||
| return trace.Wrap(err, "context canceled while backing off (attempt %d)", attempt) | ||
| case <-retry.After(): | ||
| retry.Inc() | ||
| } | ||
|
|
||
| w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{ | ||
| Name: "auth.wait-for-backend", | ||
| Kinds: []types.WatchKind{ | ||
| {Kind: types.KindClusterName}, | ||
| }, | ||
| }) | ||
| if err != nil { | ||
| log.ErrorContext(process.ExitContext(), "Failed to create watcher", "kind", types.KindClusterName, "attempt", attempt, "error", err) | ||
| continue | ||
| } | ||
|
|
||
| select { | ||
| case evt := <-w.Events(): | ||
| w.Close() | ||
| if evt.Type != types.OpInit { | ||
| log.ErrorContext(process.ExitContext(), "expected init event, got something else (this is a bug)", "kind", types.KindClusterName, "attempt", attempt, "error", err, "event_type", evt.Type) | ||
| continue | ||
| } | ||
| process.logger.With(teleport.ComponentLabel, component).InfoContext(process.ExitContext(), "auth backend initialized", "duration", process.Clock.Since(start).String()) | ||
| process.OnHeartbeat(component)(nil) | ||
| return nil | ||
| case <-w.Done(): | ||
| log.ErrorContext(process.ExitContext(), "watcher closed while waiting for backend init", "kind", types.KindClusterName, "attempt", attempt, "error", w.Error()) | ||
| continue | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm worried that for any reason
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
| case <-process.GracefulExitContext().Done(): | ||
| w.Close() | ||
| return trace.Wrap(err, "context canceled while waiting for backendf init event") | ||
| } | ||
| } | ||
| }) | ||
| } | ||
|
|
||
| // Register TLS endpoint of the auth service | ||
| tlsConfig, err := connector.ServerTLSConfig(cfg.CipherSuites) | ||
| if err != nil { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can add a fallback safety timeout when it will force to start the service even if a cache is not fully healthy - like 5m or 10m
Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.
So if a bad resource is created the auth are still running but the db access flow is degraded. If auth pods will be restarted due to troubleshooting all the flow will be broken and users will lost access to the teleport cluster entirely.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auth cannot break because of the max grpc message size because it doesn't send grpc message when creating its cache. gRPC message size is checked on the receiver side (proxy, node, ...), auth doesn't establish a grpc client against itself.
I would argue the opposite: if we have an auth that cannot build a valid cache, we should never let the rollout proceed and send clients to this auth.
This happened 3 months ago for a very large financial customer: they were running with broken cache for a week and they complained about constantly being disconnected from the UI and being unable to access any resource. A Teleport auth unable to establish watcher is a completely dysfunctional Teleport instance and should not be allowed to serve requests. Keeping the pod unready would have made the issue visible.