[grpctmclient] Add support for (bounded) per-host connection reuse#8368
[grpctmclient] Add support for (bounded) per-host connection reuse#8368ajm188 merged 13 commits intovitessio:mainfrom
Conversation
| rpcClientMap map[string]chan *tmc | ||
| } | ||
|
|
||
| type dialer interface { |
There was a problem hiding this comment.
note that for the cachedConnDialer we don't implement poolDialer, because every call to dial by definition is "pooled"', just for a slightly different definition of pooled
| */ | ||
|
|
||
| package grpctmserver | ||
| package grpctmserver_test |
There was a problem hiding this comment.
made this change so i can call RegisterForTest and not generate an import cycle between here and the grpctmclient tests
| // possible values in all APIs | ||
| type fakeRPCTM struct { | ||
| t *testing.T | ||
| t testing.TB |
There was a problem hiding this comment.
these changes exist to support the one lil' benchmark i wrote for the cached client
|
Looking at this today! Sorry for the delay, I was busy fighting GRPC itself. 👀 |
|
OK! Finished reading through the PR! It looks pretty good, I like the direction this is heading. I have two main concerns:
I've put these two H Y P O T H E S I S to the test by, huh, well, implementing and benchmarking them. You can see my changes in be8fede And the benchmark before/after: Would you mind reviewing my commit and cherry picking it into this PR? I'm hoping you'll agree the code is meaningfully simpler! That's from a perf/correctness point of view. Besides that, the other global refactorings look good to me. Changing the return for the methods into the client + |
|
I left a couple comments on that commit! Basically I think there's one place where we're risking deadlock (so I'm guessing your benchmark happened to not hit that case), and that there are a few places where we're adjusting There's both a performance and correctness concern with that last point: correctness because we may end up evicting the wrong connection and get the cache into a weird state, and performance because there are places where we should be paying the O(n log n) cost of resorting, but we're not. That might also explain the perf difference between your patch and this. I'm also curious what pool size you're testing on; I was able to run this with a pool size of 1000 and got the following stats after a few rounds of |
|
Oh yeah those two comments are on point. Thanks for double checking. I can't push to this branch to fix them though! I'll push a new commit tomorrow morning with proper fixes and benchmark again –- I hope fixing the correctness won't degrade performance. |
| if os.Getenv("VT_PPROF_TEST") != "" { | ||
| file, err := os.Create(fmt.Sprintf("%s.profile.out", t.Name())) | ||
| require.NoError(t, err) | ||
| defer file.Close() | ||
| if err := pprof.StartCPUProfile(file); err != nil { | ||
| t.Errorf("failed to start cpu profile: %v", err) | ||
| return | ||
| } | ||
| defer pprof.StopCPUProfile() | ||
| } |
There was a problem hiding this comment.
By the way, you can get this exact same behavior by simply passing the -cpuprofile flag to go test 👍
|
Unfortunately I didn't have time to look at this, and I'm going to be out on vacation until Tuesday, but I will try this out in some testing environments when I'm back!! |
|
Sounds great. Have a good time! |
| @@ -88,10 +119,11 @@ func (client *Client) dial(tablet *topodatapb.Tablet) (*grpc.ClientConn, tabletm | |||
| if err != nil { | |||
There was a problem hiding this comment.
Now that you've properly wired up a context.Context to this API, we can change the grpcclient.Dial call here to grpcclient.DialContext, which is going to fix an annoying issue, as seen here: #8387 (comment)
Can you update this PR accordingly? That'll unblock the other PR. :)
There was a problem hiding this comment.
Happy to swap to DialContext, but per my comment on the other PR there's more work needed to remove that hack
|
Alright, back! I'm going to be testing this out todayyyy |
|
Okay!! Very happy to report that this works just as well in practice. I was somewhat concerned that dialing outside of the lock and throwing it away would result in too much connection churn, which is what my main goal was in reducing, but I can't make this meaningfully happen in real workloads. I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way 🤷), when we call Looking forward to your thoughts, I'm thinking tomorrow we can polish this up and get it merged!! |
There was a problem hiding this comment.
I'll clean up all these comments to reflect the actual state of the code tomorrow
There was a problem hiding this comment.
Similarly, this comment is flat-out not true in the simpler implementation, so I'll clean that up as well
There was a problem hiding this comment.
Ah, I lied, this is partially true, since the closer func still locks the dialer.m it will still content with actual dials if you reuse a cachedConnDialer after close, but it won't actually mess with the state of the queue. So it's "safe" but not ideal.
There was a problem hiding this comment.
reminder for me to make a final pass to clean up these tests as well
This is great! Code looks ready to me. Let's fix the last few outdated comments and merge. |
Signed-off-by: Andrew Mason <amason@slack-corp.com>
- Move grpctmserver tests to a testpackage to allow grpctmclient tests to import grpctmserver (this prevents an import cycle) - Change fakeRPCTM to take a testing.TB which both benchmark and test structs satisfy Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
|
Okay! I've finished cleaning up, 1151c6d and b5af9f6 contain those changes if you want to view in isolation. I'm going to clean up the commit history because there are a bunch of commits in this branch related to an older implementation that i threw away, so there's no reason to keep it around. (I am keeping a branch locally pointed at the pre-rebase version in case we want to do anything differently) |
Two bugfixes: - one to prevent leaking open connections (by checking the map again after getting the write lock) - one to prevent leaving around closed connections, resulting in errors on later uses (by deleting the freed conn's addr and not the addr we're attempting to dial) Refactor to not duplicate dialing/queue management I don't want mistaken copy-paste to result in a bug between sections that should be otherwise identical Add one more missing "another goroutine dialed" check Add some stats to the cached conn dialer Remove everything from the old, slower cache implementation, pqueue is the way to go lots and lots and lots and lots of comments Signed-off-by: Andrew Mason <amason@slack-corp.com>
…recation easier Refactor sections of the main `dial` method to allow all unlocks to be deferrals Add a test case to exercise evictions and full caches Refactor heap operations to make timing them easier Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
By putting evictable conns at the front, we know that when we append to the end we don't need to resort, so multiple successive evictions can be faster Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
b5af9f6 to
b5908a8
Compare
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse Signed-off-by: Andrew Mason <amason@slack-corp.com>
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse Signed-off-by: Andrew Mason <amason@slack-corp.com>
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse
Description
(Issue to come, I mainly want to see how this does in CI, even though I know it works in practice).
tl;dr: the oneshot dialing strategy of the existing grpc implementation of TabletManagerClient results in a lot of connection churn when doing a high volume of tabletmanager RPCs (which happens when doing VReplicationExec calls on workflows for larger keyspaces, for example). In addition, this can cause the number of osthreads to spike as the go runtime has to park more and more goroutines on the syscall queue in order to wait for those connections, which can result in crashes under high enough load.
This aims to fix that by introducing a per-tablet connection cache. The main reason I don't want to expand the use of the
usePoolflag in the old implementation is:tablet_manager_grpc_concurrency, default 8) connections per tablet, rather than using one connection across all RPCsStrategy:
Phase 1 - Prepare for a new dialer
I refactored the existing cilent to separate the dialing of a connection to the actual making of the rpc call, and then defined an interface both for
dialanddialPoolmethods. These now return both the client interface and anio.Closer, which in the original case just callsCloseon the grpc ClientConn, but we're going to use this in the cachedconn case to manage reference counting the connections and updating the eviction queue.Phase 2 - Adding the priority queue
I went through a few iterations to arrive at this, but it's fairly performant. It's based on the priority queue example found on
container/heapdocs, and sorts connections by fewest "refs" (number of times they been acquired but not yet released), and breaking ties by older lastAccessTime [note: i had the idea to include a background process to sweep up connections withrefs==0 && time.Since(lastAccessTime) > idleTimeoutbut in the first couple passes it was resulting in way too much lock contention and the current implementation is good enough for my uses].Related Issue(s)
None yet.
Checklist
Deployment Notes