Expect the auth backend/cache to be initialized before turning ready#59907
Expect the auth backend/cache to be initialized before turning ready#59907
Conversation
45355c9 to
1483fd0
Compare
lib/service/service.go
Outdated
| process.RegisterFunc("auth.wait-for-backend", func() error { | ||
| start := process.Clock.Now() | ||
|
|
||
| w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{ |
There was a problem hiding this comment.
I think we must have a retry loop around this, or we might end up with a process that's forever unready even if things are resolved later on.
lib/service/service.go
Outdated
| w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{ | ||
| Name: "auth.wait-for-backend", | ||
| Kinds: []types.WatchKind{ | ||
| {Kind: types.KindClusterName}, |
There was a problem hiding this comment.
The services like Relay doesn't have permission to get the types.KindClusterName and create a watcher for this type https://github.com/gravitational/teleport/blob/master/lib/cache/cache.go#L283-L290
https://github.com/gravitational/teleport/blob/master/lib/authz/permissions.go#L1138
In the result this the process startup for this Role service will be broken.
We need to make sure the the Kind type used for init will be always supported for the service teleport role or have a more flexible way to pick the WatchKind used to receive the Init event.
There was a problem hiding this comment.
This change does not apply to the relay service. This is in initAuthService which is only called to start an auth service, so we have auth credentials and the ability to read types.KindClusterName.
rosstimothy
left a comment
There was a problem hiding this comment.
Does this PR need to be rebased, or was it incorrectly rebased after #59667 merged?
There was a problem hiding this comment.
Do we need to use a shorter data directory to ensure that this test passes on darwin? We've got a number of tests that either turn off the debug service or craft a custom temp directory to work around golang/go#62614.
There was a problem hiding this comment.
I thought about this but this test runs on my laptop without tuning the data dir, so 🤷
1f0e14f to
e06a24f
Compare
| return nil | ||
| case <-w.Done(): | ||
| log.ErrorContext(process.ExitContext(), "watcher closed while waiting for backend init", "kind", types.KindClusterName, "attempt", attempt, "error", w.Error()) | ||
| continue |
There was a problem hiding this comment.
Should we OnHeartbeat(component)(w.Error()) or is it better to not actively unready the process in this scenario?
There was a problem hiding this comment.
I'm worried that for any reason w.Error() returns nil and we accidentally turn ready
There was a problem hiding this comment.
The syncRotationStateCycle uses trace.ConnectionProblem(watcher.Error(), "watcher has disconnected") (probably because the watcher from storage always returns nil from error 😳) which is guaranteed to be non-nil, so that would be an option, I don't think it would be a good idea tho because from what I'm seeing in the state machine logic we would never recover without at least two TeleportOKEvents spaced at least by 10 seconds, so anything that's only ever sending one TeleportOKEvent must not send anything but TeleportOKEvent (or exactly one TeleportDegradedEvent and then die).
| } | ||
|
|
||
| select { | ||
| case evt := <-w.Events(): |
There was a problem hiding this comment.
I wonder if we can add a fallback safety timeout when it will force to start the service even if a cache is not fully healthy - like 5m or 10m
Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.
So if a bad resource is created the auth are still running but the db access flow is degraded. If auth pods will be restarted due to troubleshooting all the flow will be broken and users will lost access to the teleport cluster entirely.
There was a problem hiding this comment.
Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.
auth cannot break because of the max grpc message size because it doesn't send grpc message when creating its cache. gRPC message size is checked on the receiver side (proxy, node, ...), auth doesn't establish a grpc client against itself.
I would argue the opposite: if we have an auth that cannot build a valid cache, we should never let the rollout proceed and send clients to this auth.
This happened 3 months ago for a very large financial customer: they were running with broken cache for a week and they complained about constantly being disconnected from the UI and being unable to access any resource. A Teleport auth unable to establish watcher is a completely dysfunctional Teleport instance and should not be allowed to serve requests. Keeping the pod unready would have made the issue visible.
smallinsky
left a comment
There was a problem hiding this comment.
The flow LGTM but wonder if we should add hard safety fallback #59907 (comment) to start the auth even with broken cache.
Recently we had few issues where cache was degraded due to collection reaching MaxGRPCMessage Size.
With this alignment if the cache is degraded the customer will lost access to whole cluster permanently if the auth pod is restarted so I think that the we should hardcode some fallback timeout behavior like 5m or 10m to start the auth even with broken cache.
f22c817 to
c742310
Compare
|
@hugoShaka See the table below for backport results.
|
…ready (#61620) * Add a way to announce which sevrices should be expected (#59667) I was looking into tying the auth readiness with its cache health (so we don't end up in a state with no ready cache across all auths during a rollout) and I saw that we are currently reporting ready as soon as one of the sevrice heartbeats. We don't keep track of which services should be heartbeating/reporting ready. This PR introduces a new `process.ExpectService(component)` function to declare early that we are starting a service and that the process should not be ready without it. * disable expect service in backport * Expect the auth backend/cache to be initialized before turning ready (#59907)
This PR makes the auth unready until it has functional watchers. With cache enabled, this means that all everything is loaded in cache.
Why this change is needed:
Depends on: #59667
Changelog: Auth readiness tuned to wait for cache initialization.