Skip to content

Expect the auth backend/cache to be initialized before turning ready#59907

Merged
hugoShaka merged 1 commit intomasterfrom
hugo/ready-wait-for-backend-init
Oct 8, 2025
Merged

Expect the auth backend/cache to be initialized before turning ready#59907
hugoShaka merged 1 commit intomasterfrom
hugo/ready-wait-for-backend-init

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka commented Oct 3, 2025

This PR makes the auth unready until it has functional watchers. With cache enabled, this means that all everything is loaded in cache.

Why this change is needed:

  • cache init might be slow if we have millions of entries or backend problems, starting to operate without a cache will send all queries to the backend, causing pressure, slower cache population, and potentially more backend issues
    • side note: In a separate PR I am delaying the start of read-intensive routines until the cache is initialized
  • many features rely on the ability to establish watchers, including access plugins, dynamic app/db/kube access, and most of the okta/entra sync logic
  • currently, Teleport can start and become ready without a cache nor functioning watchers. In the most extreme cases, a rollout can proceed and break all watchers, leaving the cluster is a broken state

Depends on: #59667
Changelog: Auth readiness tuned to wait for cache initialization.

process.RegisterFunc("auth.wait-for-backend", func() error {
start := process.Clock.Now()

w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must have a retry loop around this, or we might end up with a process that's forever unready even if things are resolved later on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{
Name: "auth.wait-for-backend",
Kinds: []types.WatchKind{
{Kind: types.KindClusterName},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The services like Relay doesn't have permission to get the types.KindClusterName and create a watcher for this type https://github.com/gravitational/teleport/blob/master/lib/cache/cache.go#L283-L290
https://github.com/gravitational/teleport/blob/master/lib/authz/permissions.go#L1138

In the result this the process startup for this Role service will be broken.

We need to make sure the the Kind type used for init will be always supported for the service teleport role or have a more flexible way to pick the WatchKind used to receive the Init event.

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not apply to the relay service. This is in initAuthService which is only called to start an auth service, so we have auth credentials and the ability to read types.KindClusterName.

Base automatically changed from hugo/expect-services-before-ready to master October 6, 2025 18:13
Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR need to be rebased, or was it incorrectly rebased after #59667 merged?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use a shorter data directory to ensure that this test passes on darwin? We've got a number of tests that either turn off the debug service or craft a custom temp directory to work around golang/go#62614.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but this test runs on my laptop without tuning the data dir, so 🤷

@hugoShaka hugoShaka force-pushed the hugo/ready-wait-for-backend-init branch from 1f0e14f to e06a24f Compare October 7, 2025 16:38
return nil
case <-w.Done():
log.ErrorContext(process.ExitContext(), "watcher closed while waiting for backend init", "kind", types.KindClusterName, "attempt", attempt, "error", w.Error())
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we OnHeartbeat(component)(w.Error()) or is it better to not actively unready the process in this scenario?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that for any reason w.Error() returns nil and we accidentally turn ready

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syncRotationStateCycle uses trace.ConnectionProblem(watcher.Error(), "watcher has disconnected") (probably because the watcher from storage always returns nil from error 😳) which is guaranteed to be non-nil, so that would be an option, I don't think it would be a good idea tho because from what I'm seeing in the state machine logic we would never recover without at least two TeleportOKEvents spaced at least by 10 seconds, so anything that's only ever sending one TeleportOKEvent must not send anything but TeleportOKEvent (or exactly one TeleportDegradedEvent and then die).

}

select {
case evt := <-w.Events():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can add a fallback safety timeout when it will force to start the service even if a cache is not fully healthy - like 5m or 10m

Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.

So if a bad resource is created the auth are still running but the db access flow is degraded. If auth pods will be restarted due to troubleshooting all the flow will be broken and users will lost access to the teleport cluster entirely.

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.

auth cannot break because of the max grpc message size because it doesn't send grpc message when creating its cache. gRPC message size is checked on the receiver side (proxy, node, ...), auth doesn't establish a grpc client against itself.

I would argue the opposite: if we have an auth that cannot build a valid cache, we should never let the rollout proceed and send clients to this auth.

This happened 3 months ago for a very large financial customer: they were running with broken cache for a week and they complained about constantly being disconnected from the UI and being unable to access any resource. A Teleport auth unable to establish watcher is a completely dysfunctional Teleport instance and should not be allowed to serve requests. Keeping the pod unready would have made the issue visible.

Copy link
Copy Markdown
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow LGTM but wonder if we should add hard safety fallback #59907 (comment) to start the auth even with broken cache.

Recently we had few issues where cache was degraded due to collection reaching MaxGRPCMessage Size.

With this alignment if the cache is degraded the customer will lost access to whole cluster permanently if the auth pod is restarted so I think that the we should hardcode some fallback timeout behavior like 5m or 10m to start the auth even with broken cache.

@hugoShaka hugoShaka force-pushed the hugo/ready-wait-for-backend-init branch from f22c817 to c742310 Compare October 8, 2025 15:22
@hugoShaka hugoShaka enabled auto-merge October 8, 2025 15:22
@hugoShaka hugoShaka added this pull request to the merge queue Oct 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2025
@hugoShaka hugoShaka added this pull request to the merge queue Oct 8, 2025
Merged via the queue into master with commit 1d6aa82 Oct 8, 2025
40 checks passed
@hugoShaka hugoShaka deleted the hugo/ready-wait-for-backend-init branch October 8, 2025 19:48
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@hugoShaka See the table below for backport results.

Branch Result
branch/v18 Failed

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
…ready (#61620)

* Add a way to announce which sevrices should be expected (#59667)

I was looking into tying the auth readiness with its cache health (so we
don't end up in a state with no ready cache across all auths during a
rollout) and I saw that we are currently reporting ready as soon as one
of the sevrice heartbeats. We don't keep track of which services should
be heartbeating/reporting ready.

This PR introduces a new `process.ExpectService(component)` function to
declare early that we are starting a service and that the process should
not be ready without it.

* disable expect service in backport

* Expect the auth backend/cache to be initialized before turning ready (#59907)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants