Use GetTabletsByCell in healthcheck by deepthi · Pull Request #14693 · vitessio/vitess

deepthi · 2023-12-06T00:25:36Z

Description

VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.

This PR does a few more things:

GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error.
Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency.
We pass topoReadConcurrency from healthcheck into GetTabletsForCell.
The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known.

The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1).

We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in #3965 are alleviated by fetching all tablets in one call to the topo.

Related Issue(s)

Fixes #14277

Checklist

"Backport to:" labels have been added if this change should be back-ported
Tests were added or are not required
Did the new or modified tests pass consistently locally and on the CI
Documentation was added or is not required

Deployment Notes

Signed-off-by: deepthi <deepthi@planetscale.com>

…Cell Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

vitess-bot · 2023-12-06T00:25:39Z

mattlord

Nice work on this! Makes for a great improvement.

go/vt/topo/tablet.go

go/vt/topo/tablet_test.go

… from stdlib Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

go/stats/counters.go

go/vt/topo/tablet.go

deepthi · 2023-12-07T18:07:14Z

I'm working on a test case as suggested by @mattlord, so we'll merge after that is done.

Signed-off-by: deepthi <deepthi@planetscale.com>

…blet even if one or more fail Signed-off-by: deepthi <deepthi@planetscale.com>

go/vt/topo/tablet.go

go/vt/topo/tablet_test.go

mattlord

I had some minor comments. The most consequential ones -- although still relatively minor -- were related to using a semaphore and deferring mutex unlocking. Let me know what you think. I can come back to this quickly and re-approve at any time.

go/vt/discovery/topology_watcher.go

mattlord · 2023-12-11T14:04:54Z

go/vt/discovery/topology_watcher_test.go

-		}
-		if _, err := ts.UpdateTabletFields(context.Background(), tablet.Alias, func(t *topodatapb.Tablet) error {
+		})
+		require.Nil(t, err, "UpdateTabletFields failed")


Any reason to use require.Nil instead of require.NoError here and in a few other places? It's fine, just curious.

Because the func has two return values, not just a single return value of type error. I'd have preferred the NoError form, but it doesn't work for this case.

I meant that you can use this instead (in the other cases you're just using the error return value directly from the function):

require.NoError(t, err, "UpdateTabletFields failed")

AFAIK they are equivalent, but it's a little more explicit. Not a problem at all either way.

mattlord · 2023-12-11T14:09:19Z

go/vt/discovery/topology_watcher_test.go

+	require.Nil(t, err, "FixShardReplication failed")
 	tw.loadTablets()
-	checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "GetTablet": 0, "RemoveTablet": 1})
+	checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "RemoveTablet": 1})


It looks like this new call is equivalent to the old one. I tend to prefer the previous as it's more explicit about what we expect (w/o looking at how checkOpCounts works), but up to you.

We don't call GetTablet at all any more, so this change reflects that fact. I thought about deprecating and eventually deleting the GetTablet metric, but it didn't seem worth it.
Maybe it is worth changing the tests to explicitly expect 0 calls to GetTablet as a way of catching regressions in this behavior.

go/vt/discovery/topology_watcher_test.go

go/vt/topo/tablet_test.go

go/vt/topo/memorytopo/memorytopo.go

go/vt/topo/tablet_test.go

…tch previous behavior, address review Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi · 2023-12-12T00:34:24Z

@mattlord I think I've addressed all the comments. This is ready for final review. If it looks good to you, please remove the "Do Not Merge" label.

mattlord

I have some minor notes -- sorry to keep going back and forth on this. IMO it's worth being extra cautious in this case though.

mattlord · 2023-12-12T01:49:58Z

go/vt/topo/tablet.go

 			tabletInfo, err := ts.GetTablet(ctx, tabletAlias)
-			mutex.Lock()
+			<-sem
+			mu.Lock()


I think that we can get rid of this mutex usage here and replace the related returnError logic with sync.Once?

The other option is to use a concurrency.FirstErrorRecorder. Those both feel more standard to me.

I'm being paranoid (probably overly so) as it's easy to end up with deadlocks and panics. I guess we have unit test coverage though and thus -race tests for the function?

mattlord · 2023-12-12T02:01:29Z

go/vt/topo/tablet.go

 			tabletInfo, err := ts.GetTablet(ctx, tabletAlias)
-			mutex.Lock()
+			<-sem
+			mu.Lock()


Whenever we do end up locking it, we should unlock it in a defer here as we release it at the end of the closure/goroutine anyway.

mattlord · 2023-12-12T02:21:34Z

go/vt/topo/tablet.go

 			defer wg.Done()
+			if err := sem.Acquire(ctx, 1); err != nil {
+				// Only happens if context is cancelled.
+				mu.Lock()


I think that we should do this:

mu.Lock() defer mu.Unlock() ... return

We definitely want to return in this block as we didn't acquire the semaphore so we don't want to continue on in the goroutine.

Might be getting hard to follow at this point, so this is what I was thinking: https://gist.github.com/mattlord/089986f60bffe33d1d7b3faa21942e06

Oh yeah I was looking at the code again and realized I forgot a return 😮‍💨
Glad you caught it too. I think we actually need a lot more test coverage than we have, but I'm calling that debt at this point. We are slightly better off than before given that at least I added 4 unit tests.
Error conditions are usually the hardest to test, and that is where most the missing coverage tends to be.

You can use different functions in sync.Once.Do, it will only execute the Do method once. But it doesn't matter. It's not much different than what is there now.

mattlord · 2023-12-12T04:15:32Z

go/vt/topo/tablet.go

+// Server.FindAllShardsInKeyspace.
+type GetTabletsByCellOptions struct {
+	// Concurrency controls the maximum number of concurrent calls to GetTablet.
+	// For backwards compatibility, concurrency of 0 is considered unlimited.


I don't think this comment is right anymore, is it? I say that because we currently only use the value if it's > 0 (at least in GetTabletMap).

Yeah.. this has gone through 3 different implementations at this point, and somewhere in there we lost the infinite concurrency. I'm inclined to think 32 is fine as a default and we don't need to worry about infinite concurrency at all 🤷

Signed-off-by: deepthi <deepthi@planetscale.com>

mattlord

LGTM (again)! 🙂 I'm glad to see that we finally got this done. Thank you for that.

Signed-off-by: deepthi <deepthi@planetscale.com>

This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.

* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.

deepthi added 4 commits December 5, 2023 13:44

remove unused interface

e56cbbe

Signed-off-by: deepthi <deepthi@planetscale.com>

topo: implement optional concurrency limit for GetTabletsByCell

645aeef

Signed-off-by: deepthi <deepthi@planetscale.com>

topo: add new error type ResourceExhausted and use it in GetTabletsBy…

7fa919c

…Cell Signed-off-by: deepthi <deepthi@planetscale.com>

healthcheck: get all tablets in one topo call if possible

6a7cf27

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi requested review from ajm188, frouioui, harshit-gangal, mattlord, notfelineit, rohit-nayak-ps, shlomi-noach and systay as code owners December 6, 2023 00:25

vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Dec 6, 2023

github-actions bot added this to the v19.0.0 milestone Dec 6, 2023

deepthi added Type: Performance Component: Query Serving and removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Dec 6, 2023

mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Dec 6, 2023

mattlord approved these changes Dec 6, 2023

View reviewed changes

deepthi added 2 commits December 6, 2023 11:14

stats: revert ZeroAll to not clear the map, change reset to use clear…

8578326

… from stdlib Signed-off-by: deepthi <deepthi@planetscale.com>

healthcheck: clean up topology watcher unit tests

9ab528e

Signed-off-by: deepthi <deepthi@planetscale.com>

vmg approved these changes Dec 7, 2023

View reviewed changes

go/stats/counters.go Show resolved Hide resolved

go/vt/topo/tablet.go Show resolved Hide resolved

deepthi added 3 commits December 8, 2023 14:29

healthcheck: remove unnecessary indirection via NewTopologyWatcher

433acb6

Signed-off-by: deepthi <deepthi@planetscale.com>

topo: add unit test for GetTablets fallback to one tablet at a time

aff3952

Signed-off-by: deepthi <deepthi@planetscale.com>

GetTablets: go back to using waitGroup so that we try to Get every ta…

fd960dd

…blet even if one or more fail Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi commented Dec 8, 2023

View reviewed changes

go/vt/topo/tablet.go Show resolved Hide resolved

deepthi commented Dec 8, 2023

View reviewed changes

go/vt/topo/tablet_test.go Show resolved Hide resolved

mattlord reviewed Dec 11, 2023

View reviewed changes

mattlord self-requested a review December 11, 2023 14:38

deepthi added the Do Not Merge label Dec 12, 2023

topo watcher: remove some more indirection, make behavior on error ma…

70ff39c

…tch previous behavior, address review Signed-off-by: deepthi <deepthi@planetscale.com>

mattlord reviewed Dec 12, 2023

View reviewed changes

mattlord self-requested a review December 12, 2023 03:58

mattlord reviewed Dec 12, 2023

View reviewed changes

topo: add missing return, lock/unlock more idiomatically

76fe861

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi removed the Do Not Merge label Dec 12, 2023

mattlord approved these changes Dec 12, 2023

View reviewed changes

deepthi merged commit 5d05612 into vitessio:main Dec 12, 2023

deepthi deleted the ds-get-tablets branch December 12, 2023 21:01

ejortegau pushed a commit to slackhq/vitess that referenced this pull request Dec 13, 2023

Use GetTabletsByCell in healthcheck (vitessio#14693)

a5cbcd1

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi mentioned this pull request Jan 29, 2024

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

Merged

5 tasks

wangweicugw mentioned this pull request Feb 2, 2024

Use GetTabletsByCell in healthcheck jd-opensource/vtdriver#149

Open

deepthi mentioned this pull request Sep 12, 2024

RFC: improve efficiency of tablet filtering in go/vt/discovery and topo #16761

Open

ejortegau mentioned this pull request Sep 16, 2024

Backport 14693 - Use GetTabletsByCell in healthcheck slackhq/vitess#514

Merged

timvaillancourt mentioned this pull request Oct 25, 2024

Improve efficiency of vtorc topo calls #17071

Merged

5 tasks

timvaillancourt mentioned this pull request Nov 22, 2024

Feature Request: ensure topo calls use --topo_read_concurrency #17275

Closed

Conversation

deepthi commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

Uh oh!

vitess-bot bot commented Dec 6, 2023

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepthi commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepthi commented Dec 12, 2023

Uh oh!

mattlord left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattlord Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattlord Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattlord Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

deepthi commented Dec 6, 2023 •

edited

Loading

deepthi commented Dec 7, 2023 •

edited

Loading

mattlord left a comment •

edited

Loading

mattlord Dec 12, 2023 •

edited

Loading

mattlord Dec 12, 2023 •

edited

Loading

mattlord Dec 12, 2023 •

edited

Loading