Expect the auth backend/cache to be initialized before turning ready by hugoShaka · Pull Request #59907 · gravitational/teleport

hugoShaka · 2025-10-03T15:05:55Z

This PR makes the auth unready until it has functional watchers. With cache enabled, this means that all everything is loaded in cache.

Why this change is needed:

cache init might be slow if we have millions of entries or backend problems, starting to operate without a cache will send all queries to the backend, causing pressure, slower cache population, and potentially more backend issues
- side note: In a separate PR I am delaying the start of read-intensive routines until the cache is initialized
many features rely on the ability to establish watchers, including access plugins, dynamic app/db/kube access, and most of the okta/entra sync logic
currently, Teleport can start and become ready without a cache nor functioning watchers. In the most extreme cases, a rollout can proceed and break all watchers, leaving the cluster is a broken state

Depends on: #59667
Changelog: Auth readiness tuned to wait for cache initialization.

espadolini · 2025-10-06T10:03:35Z

lib/service/service.go

+		process.RegisterFunc("auth.wait-for-backend", func() error {
+			start := process.Clock.Now()
+
+			w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{


I think we must have a retry loop around this, or we might end up with a process that's forever unready even if things are resolved later on.

smallinsky · 2025-10-06T10:31:01Z

lib/service/service.go

+			w, err := authServer.NewWatcher(process.ExitContext(), types.Watch{
+				Name: "auth.wait-for-backend",
+				Kinds: []types.WatchKind{
+					{Kind: types.KindClusterName},


The services like Relay doesn't have permission to get the types.KindClusterName and create a watcher for this type https://github.com/gravitational/teleport/blob/master/lib/cache/cache.go#L283-L290
https://github.com/gravitational/teleport/blob/master/lib/authz/permissions.go#L1138

In the result this the process startup for this Role service will be broken.

We need to make sure the the Kind type used for init will be always supported for the service teleport role or have a more flexible way to pick the WatchKind used to receive the Init event.

This change does not apply to the relay service. This is in initAuthService which is only called to start an auth service, so we have auth credentials and the ability to read types.KindClusterName.

rosstimothy

Does this PR need to be rebased, or was it incorrectly rebased after #59667 merged?

integration/helpers/instance.go

rosstimothy · 2025-10-07T12:21:40Z

integration/readiness_test.go

Do we need to use a shorter data directory to ensure that this test passes on darwin? We've got a number of tests that either turn off the debug service or craft a custom temp directory to work around golang/go#62614.

I thought about this but this test runs on my laptop without tuning the data dir, so 🤷

lib/service/service.go

espadolini · 2025-10-07T16:43:35Z

lib/service/service.go

+					return nil
+				case <-w.Done():
+					log.ErrorContext(process.ExitContext(), "watcher closed while waiting for backend init", "kind", types.KindClusterName, "attempt", attempt, "error", w.Error())
+					continue


Should we OnHeartbeat(component)(w.Error()) or is it better to not actively unready the process in this scenario?

I'm worried that for any reason w.Error() returns nil and we accidentally turn ready

The syncRotationStateCycle uses trace.ConnectionProblem(watcher.Error(), "watcher has disconnected") (probably because the watcher from storage always returns nil from error 😳) which is guaranteed to be non-nil, so that would be an option, I don't think it would be a good idea tho because from what I'm seeing in the state machine logic we would never recover without at least two TeleportOKEvents spaced at least by 10 seconds, so anything that's only ever sending one TeleportOKEvent must not send anything but TeleportOKEvent (or exactly one TeleportDegradedEvent and then die).

lib/service/service.go

smallinsky · 2025-10-07T18:33:54Z

lib/service/service.go

+				}
+
+				select {
+				case evt := <-w.Events():


I wonder if we can add a fallback safety timeout when it will force to start the service even if a cache is not fully healthy - like 5m or 10m

Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.

So if a bad resource is created the auth are still running but the db access flow is degraded. If auth pods will be restarted due to troubleshooting all the flow will be broken and users will lost access to the teleport cluster entirely.

Imagine scenario when one of not important collection for the teleport functionality is broken like one of the db_server element exceeds the max gRPC message size.

auth cannot break because of the max grpc message size because it doesn't send grpc message when creating its cache. gRPC message size is checked on the receiver side (proxy, node, ...), auth doesn't establish a grpc client against itself.

I would argue the opposite: if we have an auth that cannot build a valid cache, we should never let the rollout proceed and send clients to this auth.

This happened 3 months ago for a very large financial customer: they were running with broken cache for a week and they complained about constantly being disconnected from the UI and being unable to access any resource. A Teleport auth unable to establish watcher is a completely dysfunctional Teleport instance and should not be allowed to serve requests. Keeping the pod unready would have made the issue visible.

smallinsky

The flow LGTM but wonder if we should add hard safety fallback #59907 (comment) to start the auth even with broken cache.

Recently we had few issues where cache was degraded due to collection reaching MaxGRPCMessage Size.

With this alignment if the cache is degraded the customer will lost access to whole cluster permanently if the auth pod is restarted so I think that the we should hardcode some fallback timeout behavior like 5m or 10m to start the auth even with broken cache.

backport-bot-workflows · 2025-10-08T19:49:50Z

@hugoShaka See the table below for backport results.

Branch	Result
branch/v18	Failed

…59907)

…ready (#61620) * Add a way to announce which sevrices should be expected (#59667) I was looking into tying the auth readiness with its cache health (so we don't end up in a state with no ready cache across all auths during a rollout) and I saw that we are currently reporting ready as soon as one of the sevrice heartbeats. We don't keep track of which services should be heartbeating/reporting ready. This PR introduces a new `process.ExpectService(component)` function to declare early that we are starting a service and that the process should not be ready without it. * disable expect service in backport * Expect the auth backend/cache to be initialized before turning ready (#59907)

hugoShaka requested review from espadolini, rosstimothy and smallinsky October 3, 2025 15:05

hugoShaka added the backport/branch/v18 label Oct 3, 2025

github-actions bot added the size/sm label Oct 3, 2025

github-actions bot requested review from kiosion and r0mant October 3, 2025 15:06

hugoShaka force-pushed the hugo/expect-services-before-ready branch from 45355c9 to 1483fd0 Compare October 3, 2025 20:01

espadolini reviewed Oct 6, 2025

View reviewed changes

smallinsky reviewed Oct 6, 2025

View reviewed changes

Base automatically changed from hugo/expect-services-before-ready to master October 6, 2025 18:13

rosstimothy reviewed Oct 7, 2025

View reviewed changes

hugoShaka force-pushed the hugo/ready-wait-for-backend-init branch from 1f0e14f to e06a24f Compare October 7, 2025 16:38

hugoShaka requested review from espadolini, rosstimothy and smallinsky October 7, 2025 16:44

espadolini approved these changes Oct 7, 2025

View reviewed changes

smallinsky reviewed Oct 7, 2025

View reviewed changes

smallinsky approved these changes Oct 7, 2025

View reviewed changes

public-teleport-github-review-bot bot removed request for kiosion and r0mant October 7, 2025 18:49

Expect the auth backend/cache to be initialized before turning ready

c742310

hugoShaka force-pushed the hugo/ready-wait-for-backend-init branch from f22c817 to c742310 Compare October 8, 2025 15:22

hugoShaka enabled auto-merge October 8, 2025 15:22

hugoShaka added this pull request to the merge queue Oct 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2025

hugoShaka added this pull request to the merge queue Oct 8, 2025

Merged via the queue into master with commit 1d6aa82 Oct 8, 2025
40 checks passed

hugoShaka deleted the hugo/ready-wait-for-backend-init branch October 8, 2025 19:48

rhammonds-teleport pushed a commit that referenced this pull request Nov 6, 2025

Expect the auth backend/cache to be initialized before turning ready (#…

b9a65eb

…59907)

mmcallister pushed a commit that referenced this pull request Nov 19, 2025

Expect the auth backend/cache to be initialized before turning ready (#…

72bf92d

…59907)

hugoShaka added a commit that referenced this pull request Nov 20, 2025

Expect the auth backend/cache to be initialized before turning ready (#…

a817050

…59907)

hugoShaka mentioned this pull request Nov 20, 2025

[v18] Expect the auth backend/cache to be initialized before turning ready #61620

Merged

mmcallister pushed a commit that referenced this pull request Nov 20, 2025

Expect the auth backend/cache to be initialized before turning ready (#…

4c2a406

…59907)

espadolini mentioned this pull request Mar 17, 2026

Run the readiness logic synchronously #62178

Merged

Conversation

hugoShaka commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hugoShaka Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rosstimothy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hugoShaka Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smallinsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

backport-bot-workflows bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hugoShaka commented Oct 3, 2025 •

edited

Loading

hugoShaka Oct 6, 2025 •

edited

Loading

hugoShaka Oct 7, 2025 •

edited

Loading