[grpctmclient] Add support for (bounded) per-host connection reuse by ajm188 · Pull Request #8368 · vitessio/vitess

ajm188 · 2021-06-22T20:48:17Z

Description

(Issue to come, I mainly want to see how this does in CI, even though I know it works in practice).

tl;dr: the oneshot dialing strategy of the existing grpc implementation of TabletManagerClient results in a lot of connection churn when doing a high volume of tabletmanager RPCs (which happens when doing VReplicationExec calls on workflows for larger keyspaces, for example). In addition, this can cause the number of osthreads to spike as the go runtime has to park more and more goroutines on the syscall queue in order to wait for those connections, which can result in crashes under high enough load.

This aims to fix that by introducing a per-tablet connection cache. The main reason I don't want to expand the use of the usePool flag in the old implementation is:

we still create N (tablet_manager_grpc_concurrency, default 8) connections per tablet, rather than using one connection across all RPCs
those connections then exist for the life of the vtctld process, and are never closed, so if you have enough tablets you will end up with a huge number of open connections after a long enough time.

Strategy:

Phase 1 - Prepare for a new dialer

I refactored the existing cilent to separate the dialing of a connection to the actual making of the rpc call, and then defined an interface both for dial and dialPool methods. These now return both the client interface and an io.Closer, which in the original case just calls Close on the grpc ClientConn, but we're going to use this in the cachedconn case to manage reference counting the connections and updating the eviction queue.

Phase 2 - Adding the priority queue

I went through a few iterations to arrive at this, but it's fairly performant. It's based on the priority queue example found on container/heap docs, and sorts connections by fewest "refs" (number of times they been acquired but not yet released), and breaking ties by older lastAccessTime [note: i had the idea to include a background process to sweep up connections with refs==0 && time.Since(lastAccessTime) > idleTimeout but in the first couple passes it was resulting in way too much lock contention and the current implementation is good enough for my uses].

Related Issue(s)

None yet.

Checklist

Tests were added or are not required - i did my best! we might want to mark these as flaky, but they pass consistently for me so, idk
Documentation was added or is not required - ohhhhhh yeahhhhhhhhh

Deployment Notes

ajm188 · 2021-06-22T20:53:41Z

go/vt/vttablet/grpctmclient/client.go

 	rpcClientMap map[string]chan *tmc
 }

+type dialer interface {


note that for the cachedConnDialer we don't implement poolDialer, because every call to dial by definition is "pooled"', just for a slightly different definition of pooled

ajm188 · 2021-06-22T20:55:04Z

go/vt/vttablet/grpctmserver/server_test.go

 */

-package grpctmserver
+package grpctmserver_test


made this change so i can call RegisterForTest and not generate an import cycle between here and the grpctmclient tests

ajm188 · 2021-06-22T20:55:25Z

go/vt/vttablet/tmrpctest/test_tm_rpc.go

 // possible values in all APIs
 type fakeRPCTM struct {
-	t      *testing.T
+	t      testing.TB


these changes exist to support the one lil' benchmark i wrote for the cached client

vmg · 2021-06-23T09:05:20Z

Looking at this today! Sorry for the delay, I was busy fighting GRPC itself. 👀

vmg · 2021-06-23T16:40:33Z

OK! Finished reading through the PR! It looks pretty good, I like the direction this is heading. I have two main concerns:

I think the locking is not meaningful: with 3 recursive locking states (read m, write m and rw qMu), I can't demonstrate that the system doesn't deadlock (mostly because I'm not a very smart guy but regardless), and particularly for qMu, it seems like a sub-lock for m-write where the extra granularity couldn't possibly reduce actual lock contention. I have a hunch that removing all the locks and keeping a single sync.Mutex wouldn't affect performance in a measurable way.
I think the usage of a full heap is overkill. container/heap in the standard library is a bit of a trainwreck that requires a lot of boilerplate to provide very poor performance (since everything goes through interface{} anyway). My assumption is that simply keeping the eviction queue in a slice and sorting it lazily will provide the same or better performance than a heap (since fixing the heap on every refcount update is prohibitively expensive) and greatly simplify the whole client.

I've put these two H Y P O T H E S I S to the test by, huh, well, implementing and benchmarking them.

You can see my changes in be8fede

And the benchmark before/after:

name                 old time/op  new time/op  delta
CachedConnClient-16   144µs ± 2%   143µs ± 1%   ~     (p=0.113 n=10+9)

Would you mind reviewing my commit and cherry picking it into this PR? I'm hoping you'll agree the code is meaningfully simpler!

That's from a perf/correctness point of view. Besides that, the other global refactorings look good to me. Changing the return for the methods into the client + io.Closer seems like the best way to support pooled and unpooled at the same time. 👍

ajm188 · 2021-06-23T18:06:16Z

I left a couple comments on that commit! Basically I think there's one place where we're risking deadlock (so I'm guessing your benchmark happened to not hit that case), and that there are a few places where we're adjusting refs and lastAccessTime on a conn, but not marking the eviction queue as needing a resort.

There's both a performance and correctness concern with that last point: correctness because we may end up evicting the wrong connection and get the cache into a weird state, and performance because there are places where we should be paying the O(n log n) cost of resorting, but we're not. That might also explain the perf difference between your patch and this.

I'm also curious what pool size you're testing on; I was able to run this with a pool size of 1000 and got the following stats after a few rounds of GetWorkflows calls (which make 4 VRep rpcs to a tablet in each shard in a keyspace):

{
  "tabletmanagerclient_cachedconn_dial_timeouts": 0,
  "tabletmanagerclient_cachedconn_dialtimings": {
    "TotalCount": 17156,
    "TotalTime": 68510690741,
    "Histograms": {
      "rlock_fast": {
        "500000": 14065,
        "1000000": 232,
        "5000000": 745,
        "10000000": 442,
        "50000000": 854,
        "100000000": 0,
        "500000000": 0,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 16338,
        "Time": 20583087505
      },
      "sema_fast": {
        "500000": 31,
        "1000000": 4,
        "5000000": 55,
        "10000000": 45,
        "50000000": 282,
        "100000000": 82,
        "500000000": 1,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 500,
        "Time": 16451257094
      },
      "sema_poll": {
        "500000": 5,
        "1000000": 0,
        "5000000": 0,
        "10000000": 0,
        "50000000": 0,
        "100000000": 178,
        "500000000": 135,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 318,
        "Time": 31476346142
      }
    }
  },
  "tabletmanagerclient_cachedconn_new": 423,
  "tabletmanagerclient_cachedconn_reuse": 16733
}

vmg · 2021-06-23T19:46:35Z

Oh yeah those two comments are on point. Thanks for double checking. I can't push to this branch to fix them though! I'll push a new commit tomorrow morning with proper fixes and benchmark again –- I hope fixing the correctness won't degrade performance.

vmg · 2021-06-24T14:00:00Z

@ajm188 I've rebased my patch and fixed your suggestions: 7bc81d8

I think we're now always re-sorting properly and I don't think I can find any actual performance regressions. Would you mind reviewing it again? I would very much prefer to see simplified locking on this connection cache!

vmg · 2021-06-24T15:18:04Z

go/vt/vttablet/grpctmclient/cached_client_test.go

+	if os.Getenv("VT_PPROF_TEST") != "" {
+		file, err := os.Create(fmt.Sprintf("%s.profile.out", t.Name()))
+		require.NoError(t, err)
+		defer file.Close()
+		if err := pprof.StartCPUProfile(file); err != nil {
+			t.Errorf("failed to start cpu profile: %v", err)
+			return
+		}
+		defer pprof.StopCPUProfile()
+	}


By the way, you can get this exact same behavior by simply passing the -cpuprofile flag to go test 👍

ajm188 · 2021-06-25T02:11:34Z

Unfortunately I didn't have time to look at this, and I'm going to be out on vacation until Tuesday, but I will try this out in some testing environments when I'm back!!

vmg · 2021-06-25T07:25:46Z

Sounds great. Have a good time!

vmg · 2021-06-28T15:05:42Z

go/vt/vttablet/grpctmclient/client.go

@@ -88,10 +119,11 @@ func (client *Client) dial(tablet *topodatapb.Tablet) (*grpc.ClientConn, tabletm
 	if err != nil {


Now that you've properly wired up a context.Context to this API, we can change the grpcclient.Dial call here to grpcclient.DialContext, which is going to fix an annoying issue, as seen here: #8387 (comment)

Can you update this PR accordingly? That'll unblock the other PR. :)

Happy to swap to DialContext, but per my comment on the other PR there's more work needed to remove that hack

ajm188 · 2021-06-29T13:44:02Z

Alright, back! I'm going to be testing this out todayyyy

ajm188 · 2021-06-29T18:35:00Z

Okay!! Very happy to report that this works just as well in practice. I was somewhat concerned that dialing outside of the lock and throwing it away would result in too much connection churn, which is what my main goal was in reducing, but I can't make this meaningfully happen in real workloads.

I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way 🤷), when we call newdial we know that conn has refs > 0 and therefore would never be sorted to the front of the evict queue, which means that we can just append it to the back and not set the evictSorted=false bit, which will allow us to evict multiple conns in succession without needing to mark the queue as needing a resort. This works about the same in the benchmarks, which I doubt meaningfully capture this special case optimization:

benchstat old.txt new.txt
CachedConnClientSteadyStateRedials-8    56.0ms ± 1%  56.2ms ± 1%   ~     (p=0.400 n=3+3)
CachedConnClientSteadyStateEvictions-8  19.5ms ± 4%  19.4ms ± 3%   ~     (p=0.700 n=3+3)

Looking forward to your thoughts, I'm thinking tomorrow we can polish this up and get it merged!!

ajm188 · 2021-06-29T23:52:39Z

go/vt/vttablet/grpctmclient/cached_client.go

I'll clean up all these comments to reflect the actual state of the code tomorrow

ajm188 · 2021-06-29T23:53:24Z

go/vt/vttablet/grpctmclient/cached_client.go

Similarly, this comment is flat-out not true in the simpler implementation, so I'll clean that up as well

Ah, I lied, this is partially true, since the closer func still locks the dialer.m it will still content with actual dials if you reuse a cachedConnDialer after close, but it won't actually mess with the state of the queue. So it's "safe" but not ideal.

ajm188 · 2021-06-29T23:55:24Z

go/vt/vttablet/grpctmclient/cached_client_test.go

reminder for me to make a final pass to clean up these tests as well

vmg · 2021-06-30T08:50:20Z

I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way shrug), when we call newdial we know that conn has refs > 0 and therefore would never be sorted to the front of the evict queue, which means that we can just append it to the back and not set the evictSorted=false bit, which will allow us to evict multiple conns in succession without needing to mark the queue as needing a resort.

This is great! Code looks ready to me. Let's fix the last few outdated comments and merge.