Database authorization resource consumption enhancements#63878
Database authorization resource consumption enhancements#63878wethreetrees merged 9 commits intomasterfrom
Conversation
|
There is still discussion around this, opening as draft for now |
505bb26 to
907a367
Compare
6af5798 to
15ea459
Compare
8e699c9 to
abe95fc
Compare
rosstimothy
left a comment
There was a problem hiding this comment.
Could you update the benchmark results in the PR description with a before and after comparison using benchstat?
Could you add test cases to the test plan that exercise connecting to real databases in a Teleport cluster. We should ideally be adding a similar number of databases as when the OOM was triggered to have confidence that it is no longer an issue.
d8090eb to
6bf1512
Compare
|
As Tim mentioned, do you mind adding some manual testing on real databases? Since it touches trusted cluster, may be good to test that out too. |
For sure. Was hoping to get that updated this morning, got caught up in something else. I will update shortly! |
| // DatabaseServer is a read only variant of [types.DatabaseServer] | ||
| type DatabaseServer interface { |
There was a problem hiding this comment.
We could get rid of this and just have the watcher use a string containing the database name for the "read only" version of the resource. This is not something that just applies to DatabaseServer, but plenty of things returned by this supposedly "read only" variant are straight up mutable: metadata and database, for example.
There was a problem hiding this comment.
Love this. I removed the readonly interface.
There was a problem hiding this comment.
I would rather follow the same pattern as every other watcher. This is also limiting if any future use cases need more than the name.
There was a problem hiding this comment.
If the pattern is bad we shouldn't keep using it while waiting for some unspecified time in the future when we get to fix it, because that means that more and more code will accumulate using the bad pattern.
If we need more than the name in the future we can change the watcher to return a struct rather than just the name, or maybe a wrapper that actively prohibits access to mutable things like we do with sealedClusterNetworkingConfig and sealedSessionRecordingConfig, but since there's no requirement for the R parameter to be a subset of the interface of the T anymore (that was an unenforced requirement of the GenericWatcher before we introduced ReadOnlyFunc) we can add readonly.DatabaseServer as some type that has a dedicated GetDatabaseName() string method instead. Would you be ok with that?
There was a problem hiding this comment.
Yes I would rather make readonly.Foo actually readonly and be consistent with other watchers.
There was a problem hiding this comment.
Thanks for the discussion, looking at this now.
There was a problem hiding this comment.
I've reverted the removal of the readonly DatabaseServer and implemented a pared down version with just a GetDatabaseName() method. Hopefully this is a good middle ground.
8fdd60f to
6734659
Compare
|
@rosstimothy @greedy52 I've updated the test plan, please let me know if I missed anything! |
6734659 to
879e857
Compare
okraport
left a comment
There was a problem hiding this comment.
Code looks good.
Looking over the test plan let's add one more manual test to sanity test a connection to one real DB then happy to land.
Have we tested multiple database agents that proxies the same database in manual testing? LGTM otherwise. thank you 🙏 |
Replace the call to CachingAccessPoint.GetDatabaseServers in ProxyServer.Authorize with a DatabaseServerWatcher lookup using CurrentResourcesWithFilter. Previously, every inbound database connection allocated and iterated a full copy of all database servers in the cache, then discarded all but the matching entries. Under high concurrency with large numbers of registered databases this caused significant GC pressure and OOM. The watcher lookup only allocates the servers matching the requested database name. The watcher is initialized once at startup in service.go and plumbed through the Cluster interface, following the same pattern as AppServerWatcher.
Remove auth server and fake cluster. Instantiate an in-memory backend, presence service, event service, and watcher directly in the benchmark.
879e857 to
45855cc
Compare
This afternoon, I spun up multiple database agents proxying the same database and confirmed connection. Added test case above as well. |
espadolini
left a comment
There was a problem hiding this comment.
I like the new approach for readonly.DatabaseServer.
|
@wethreetrees See the table below for backport results.
|
This change sets out to address reported performance issues related to
GetDatabaseServersin specific scenarios, e.g. many database servers with concurrent database connection attempts, or database server set larger than available memory.Contributes to #63728
In this change, we add a watcher which acts as an in-memory view of all database servers. The call to
CachingAccessPoint.GetDatabaseServersinProxyServer.Authorizehas been replaced with aDatabaseServerWatcherlookup usingCurrentResourceWithFilter. Previously, every inbound connection allocated and iterated a full copy of all database servers, then discarded all but the matching entries. Under high concurrency with large numbers of databases this caused significant GC pressure and OOM conditions. The new watcher lookup only allocates the servers matching the requested database names.Note: This approach has been discussed at length and was chosen primarily because it satisfies the short term needs, greatly improves performance at scale, and minimally impacts public interfaces. The intention is to holistically refactor this approach at a later date.
Manual Test Plan
Test Environment
Cloud tenant created in staging
Name:
tyler-dbwatcher-testingFeatures:
Trusted cluster tests were run locally with two instances of teleport.
Test Cases
tsh db connect(if you registered a fake database, you should get a “failed to connect” error)database not found)database not found)Automated Tests and Benchmarks
go test -bench=BenchmarkConnectGetDatabaseServers ./lib/srv/db/common/connect/...go test ./lib/srv/db/common/connect/...go test ./lib/services/... -run TestDatabaseServerWatcherBenchmark Results
The benchmark was run against master (
base) and wethreetrees/getdatabaseservers-oom (watcher). Thebenchstatcomparison is pretty dramatic, but I assure you it's real.changelog: Improved performance and reduced resource usage of the database proxy for clusters with large numbers of registered databases.