Skip to content

Database authorization resource consumption enhancements#63878

Merged
wethreetrees merged 9 commits intomasterfrom
wethreetrees/getdatabaseservers-oom
Mar 5, 2026
Merged

Database authorization resource consumption enhancements#63878
wethreetrees merged 9 commits intomasterfrom
wethreetrees/getdatabaseservers-oom

Conversation

@wethreetrees
Copy link
Copy Markdown
Contributor

@wethreetrees wethreetrees commented Feb 17, 2026

This change sets out to address reported performance issues related to GetDatabaseServers in specific scenarios, e.g. many database servers with concurrent database connection attempts, or database server set larger than available memory.

Contributes to #63728

In this change, we add a watcher which acts as an in-memory view of all database servers. The call to CachingAccessPoint.GetDatabaseServers in ProxyServer.Authorize has been replaced with a DatabaseServerWatcher lookup using CurrentResourceWithFilter. Previously, every inbound connection allocated and iterated a full copy of all database servers, then discarded all but the matching entries. Under high concurrency with large numbers of databases this caused significant GC pressure and OOM conditions. The new watcher lookup only allocates the servers matching the requested database names.

Note: This approach has been discussed at length and was chosen primarily because it satisfies the short term needs, greatly improves performance at scale, and minimally impacts public interfaces. The intention is to holistically refactor this approach at a later date.

Manual Test Plan

Test Environment

Cloud tenant created in staging

Name: tyler-dbwatcher-testing
Features:

  • RollingAuthUpdate
  • WildcardDNS
  • AccessMonitoring
  • BootstrapInit
  • InjectAWSCredSh
  • R53HealthCheck
  • TLSCertReload
  • KMS
  • EnablePyroscope
  • AutoUpdates
  • BootstrapCNC
  • SetEnvGOMEMLIMIT90
  • ACKDedicatedS3Buckets
  • EKSPodIdentity
  • UnmountServiceAccountToken
  • IAMJoinToken
  • DisableStaticJoinToken
  • DeferConfigChanges
  • QUICProxyPeering
  • DisableAWSCredentialsFile
  • VPADeployment
  • EnableDefaultBedrockSummarizer

Trusted cluster tests were run locally with two instances of teleport.

Test Cases

  • Register a new database
  • Connect to the newly registered database and confirm the database is found with tsh db connect (if you registered a fake database, you should get a “failed to connect” error)
  • Restart the proxy and confirm a database connection works immediately after startup, verifying the watcher initializes correctly on startup
  • Confirm that connecting to an unregistered database returns an appropriate error (e.g. database not found)
  • (Optional - can be skipped if a functional database was registered in the first test case) Register a functional database and confirm successful connection
  • Remove a database and confirm it is no longer connectable (e.g. database not found)
  • Register multiple database agents proxying the same database and confirm successful connection
  • (Optional - this is technically covered by the go test benchmark) Register >3000 databases and confirm connection to one database, verify that no significant memory increase occurs
  • (Trusted cluster - requires self-hosted setup, not testable on cloud) Connect to a database registered on a leaf cluster via the root proxy and confirm the database is found (if you registered a fake database, you should get a “failed to connect” error)

Automated Tests and Benchmarks

  • go test -bench=BenchmarkConnectGetDatabaseServers ./lib/srv/db/common/connect/...
    goos: darwin
    goarch: arm64
    pkg: github.com/gravitational/teleport/lib/srv/db/common/connect
    cpu: Apple M4 Pro
    BenchmarkConnectGetDatabaseServers/total=1000-12                   80637             13659 ns/op            5552 B/op         25 allocs/op
    PASS
    ok      github.com/gravitational/teleport/lib/srv/db/common/connect     2.989s
    
  • go test ./lib/srv/db/common/connect/...
  • go test ./lib/services/... -run TestDatabaseServerWatcher

Benchmark Results

The benchmark was run against master (base) and wethreetrees/getdatabaseservers-oom (watcher). The benchstat comparison is pretty dramatic, but I assure you it's real.

goos: darwin
goarch: arm64
pkg: github.com/gravitational/teleport/lib/srv/db/common/connect
cpu: Apple M4 Pro
                                         │      base      │               watcher               │
                                         │     sec/op     │   sec/op     vs base                │
ConnectGetDatabaseServers/total=1000-12    14339.15µ ± 1%   14.01µ ± 1%  -99.90% (p=0.000 n=10)
ConnectGetDatabaseServers/total=5000-12    71059.63µ ± 4%   61.75µ ± 1%  -99.91% (p=0.000 n=10)
ConnectGetDatabaseServers/total=10000-12   142094.2µ ± 1%   141.9µ ± 2%  -99.90% (p=0.000 n=10)
geomean                                       52.51m        49.69µ       -99.91%

                                         │       base       │               watcher                │
                                         │       B/op       │     B/op      vs base                │
ConnectGetDatabaseServers/total=1000-12     8843.241Ki ± 0%   5.422Ki ± 0%  -99.94% (p=0.000 n=10)
ConnectGetDatabaseServers/total=5000-12    44219.700Ki ± 0%   5.422Ki ± 0%  -99.99% (p=0.000 n=10)
ConnectGetDatabaseServers/total=10000-12   87767.027Ki ± 0%   5.422Ki ± 0%  -99.99% (p=0.000 n=10)
geomean                                        31.74Mi        5.422Ki       -99.98%

                                         │      base       │               watcher               │
                                         │    allocs/op    │ allocs/op   vs base                 │
ConnectGetDatabaseServers/total=1000-12     134049.00 ± 0%   25.00 ± 0%   -99.98% (p=0.000 n=10)
ConnectGetDatabaseServers/total=5000-12     670076.00 ± 0%   25.00 ± 0%  -100.00% (p=0.000 n=10)
ConnectGetDatabaseServers/total=10000-12   1340097.00 ± 0%   25.00 ± 0%  -100.00% (p=0.000 n=10)
geomean                                        493.8k        25.00        -99.99%

changelog: Improved performance and reduced resource usage of the database proxy for clusters with large numbers of registered databases.

@wethreetrees
Copy link
Copy Markdown
Contributor Author

There is still discussion around this, opening as draft for now

@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from 505bb26 to 907a367 Compare February 19, 2026 01:37
Comment thread lib/services/watcher.go
@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch 2 times, most recently from 6af5798 to 15ea459 Compare February 19, 2026 14:50
@wethreetrees wethreetrees marked this pull request as ready for review February 19, 2026 15:38
@github-actions github-actions bot added database-access Database access related issues and PRs size/md labels Feb 19, 2026
Comment thread lib/services/readonly/readonly.go Outdated
Comment thread lib/services/watcher.go
@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from 8e699c9 to abe95fc Compare February 19, 2026 17:37
Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the benchmark results in the PR description with a before and after comparison using benchstat?

Could you add test cases to the test plan that exercise connecting to real databases in a Teleport cluster. We should ideally be adding a similar number of databases as when the OOM was triggered to have confidence that it is no longer an issue.

Comment thread lib/service/service.go
Comment thread lib/srv/db/common/connect/connect_bench_test.go Outdated
Comment thread lib/srv/db/common/connect/connect_bench_test.go Outdated
@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from d8090eb to 6bf1512 Compare February 19, 2026 20:54
Comment thread lib/srv/db/common/connect/connect_bench_test.go Outdated
Comment thread lib/reversetunnel/peer.go
@greedy52
Copy link
Copy Markdown
Contributor

As Tim mentioned, do you mind adding some manual testing on real databases? Since it touches trusted cluster, may be good to test that out too.

@wethreetrees
Copy link
Copy Markdown
Contributor Author

As Tim mentioned, do you mind adding some manual testing on real databases? Since it touches trusted cluster, may be good to test that out too.

For sure. Was hoping to get that updated this morning, got caught up in something else. I will update shortly!

Comment thread lib/services/watcher.go Outdated
Comment thread lib/services/watcher_test.go Outdated
Comment thread lib/services/watcher_test.go Outdated
Comment thread lib/services/readonly/readonly.go Outdated
Comment on lines +295 to +296
// DatabaseServer is a read only variant of [types.DatabaseServer]
type DatabaseServer interface {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get rid of this and just have the watcher use a string containing the database name for the "read only" version of the resource. This is not something that just applies to DatabaseServer, but plenty of things returned by this supposedly "read only" variant are straight up mutable: metadata and database, for example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this. I removed the readonly interface.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather follow the same pattern as every other watcher. This is also limiting if any future use cases need more than the name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pattern is bad we shouldn't keep using it while waiting for some unspecified time in the future when we get to fix it, because that means that more and more code will accumulate using the bad pattern.

If we need more than the name in the future we can change the watcher to return a struct rather than just the name, or maybe a wrapper that actively prohibits access to mutable things like we do with sealedClusterNetworkingConfig and sealedSessionRecordingConfig, but since there's no requirement for the R parameter to be a subset of the interface of the T anymore (that was an unenforced requirement of the GenericWatcher before we introduced ReadOnlyFunc) we can add readonly.DatabaseServer as some type that has a dedicated GetDatabaseName() string method instead. Would you be ok with that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would rather make readonly.Foo actually readonly and be consistent with other watchers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion, looking at this now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted the removal of the readonly DatabaseServer and implemented a pared down version with just a GetDatabaseName() method. Hopefully this is a good middle ground.

@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from 8fdd60f to 6734659 Compare March 4, 2026 03:36
@wethreetrees
Copy link
Copy Markdown
Contributor Author

@rosstimothy @greedy52 I've updated the test plan, please let me know if I missed anything!

@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from 6734659 to 879e857 Compare March 4, 2026 04:22
Copy link
Copy Markdown
Contributor

@okraport okraport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good.

Looking over the test plan let's add one more manual test to sanity test a connection to one real DB then happy to land.

@greedy52 greedy52 requested review from Tener and removed request for gabrielcorado March 4, 2026 15:23
@greedy52
Copy link
Copy Markdown
Contributor

greedy52 commented Mar 4, 2026

I've updated the test plan, please let me know if I missed anything!

Have we tested multiple database agents that proxies the same database in manual testing? LGTM otherwise. thank you 🙏

Replace the call to CachingAccessPoint.GetDatabaseServers in
ProxyServer.Authorize with a DatabaseServerWatcher lookup using
CurrentResourcesWithFilter. Previously, every inbound database connection
allocated and iterated a full copy of all database servers in the cache,
then discarded all but the matching entries. Under high concurrency with
large numbers of registered databases this caused significant GC pressure
and OOM. The watcher lookup only allocates the servers matching the
requested database name.

The watcher is initialized once at startup in service.go and plumbed
through the Cluster interface, following the same pattern
as AppServerWatcher.
Remove auth server and fake cluster. Instantiate an in-memory backend,
presence service, event service, and watcher directly in the benchmark.
@wethreetrees wethreetrees force-pushed the wethreetrees/getdatabaseservers-oom branch from 879e857 to 45855cc Compare March 4, 2026 19:58
@wethreetrees
Copy link
Copy Markdown
Contributor Author

I've updated the test plan, please let me know if I missed anything!

Have we tested multiple database agents that proxies the same database in manual testing? LGTM otherwise. thank you 🙏

This afternoon, I spun up multiple database agents proxying the same database and confirmed connection. Added test case above as well.

Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new approach for readonly.DatabaseServer.

@wethreetrees wethreetrees added this pull request to the merge queue Mar 5, 2026
Merged via the queue into master with commit f6f2352 Mar 5, 2026
43 checks passed
@wethreetrees wethreetrees deleted the wethreetrees/getdatabaseservers-oom branch March 5, 2026 00:56
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@wethreetrees See the table below for backport results.

Branch Result
branch/v18 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/branch/v18 database-access Database access related issues and PRs size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants