Use GetTabletsByCell in healthcheck#14693
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
…Cell Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
mattlord
left a comment
There was a problem hiding this comment.
Nice work on this! Makes for a great improvement.
… from stdlib Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
|
I'm working on a test case as suggested by @mattlord, so we'll merge after that is done. |
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
…blet even if one or more fail Signed-off-by: deepthi <deepthi@planetscale.com>
mattlord
left a comment
There was a problem hiding this comment.
I had some minor comments. The most consequential ones -- although still relatively minor -- were related to using a semaphore and deferring mutex unlocking. Let me know what you think. I can come back to this quickly and re-approve at any time.
| } | ||
| if _, err := ts.UpdateTabletFields(context.Background(), tablet.Alias, func(t *topodatapb.Tablet) error { | ||
| }) | ||
| require.Nil(t, err, "UpdateTabletFields failed") |
There was a problem hiding this comment.
Any reason to use require.Nil instead of require.NoError here and in a few other places? It's fine, just curious.
There was a problem hiding this comment.
Because the func has two return values, not just a single return value of type error. I'd have preferred the NoError form, but it doesn't work for this case.
There was a problem hiding this comment.
I meant that you can use this instead (in the other cases you're just using the error return value directly from the function):
require.NoError(t, err, "UpdateTabletFields failed")
AFAIK they are equivalent, but it's a little more explicit. Not a problem at all either way.
| require.Nil(t, err, "FixShardReplication failed") | ||
| tw.loadTablets() | ||
| checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "GetTablet": 0, "RemoveTablet": 1}) | ||
| checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "RemoveTablet": 1}) |
There was a problem hiding this comment.
It looks like this new call is equivalent to the old one. I tend to prefer the previous as it's more explicit about what we expect (w/o looking at how checkOpCounts works), but up to you.
There was a problem hiding this comment.
We don't call GetTablet at all any more, so this change reflects that fact. I thought about deprecating and eventually deleting the GetTablet metric, but it didn't seem worth it.
Maybe it is worth changing the tests to explicitly expect 0 calls to GetTablet as a way of catching regressions in this behavior.
…tch previous behavior, address review Signed-off-by: deepthi <deepthi@planetscale.com>
|
@mattlord I think I've addressed all the comments. This is ready for final review. If it looks good to you, please remove the "Do Not Merge" label. |
| tabletInfo, err := ts.GetTablet(ctx, tabletAlias) | ||
| mutex.Lock() | ||
| <-sem | ||
| mu.Lock() |
There was a problem hiding this comment.
I think that we can get rid of this mutex usage here and replace the related returnError logic with sync.Once?
The other option is to use a concurrency.FirstErrorRecorder. Those both feel more standard to me.
I'm being paranoid (probably overly so) as it's easy to end up with deadlocks and panics. I guess we have unit test coverage though and thus -race tests for the function?
| tabletInfo, err := ts.GetTablet(ctx, tabletAlias) | ||
| mutex.Lock() | ||
| <-sem | ||
| mu.Lock() |
There was a problem hiding this comment.
Whenever we do end up locking it, we should unlock it in a defer here as we release it at the end of the closure/goroutine anyway.
| defer wg.Done() | ||
| if err := sem.Acquire(ctx, 1); err != nil { | ||
| // Only happens if context is cancelled. | ||
| mu.Lock() |
There was a problem hiding this comment.
I think that we should do this:
mu.Lock()
defer mu.Unlock()
...
return
We definitely want to return in this block as we didn't acquire the semaphore so we don't want to continue on in the goroutine.
There was a problem hiding this comment.
Might be getting hard to follow at this point, so this is what I was thinking: https://gist.github.com/mattlord/089986f60bffe33d1d7b3faa21942e06
There was a problem hiding this comment.
Oh yeah I was looking at the code again and realized I forgot a return 😮💨
Glad you caught it too. I think we actually need a lot more test coverage than we have, but I'm calling that debt at this point. We are slightly better off than before given that at least I added 4 unit tests.
Error conditions are usually the hardest to test, and that is where most the missing coverage tends to be.
There was a problem hiding this comment.
You can use different functions in sync.Once.Do, it will only execute the Do method once. But it doesn't matter. It's not much different than what is there now.
go/vt/topo/tablet.go
Outdated
| // Server.FindAllShardsInKeyspace. | ||
| type GetTabletsByCellOptions struct { | ||
| // Concurrency controls the maximum number of concurrent calls to GetTablet. | ||
| // For backwards compatibility, concurrency of 0 is considered unlimited. |
There was a problem hiding this comment.
I don't think this comment is right anymore, is it? I say that because we currently only use the value if it's > 0 (at least in GetTabletMap).
There was a problem hiding this comment.
Yeah.. this has gone through 3 different implementations at this point, and somewhere in there we lost the infinite concurrency. I'm inclined to think 32 is fine as a default and we don't need to worry about infinite concurrency at all 🤷
Signed-off-by: deepthi <deepthi@planetscale.com>
mattlord
left a comment
There was a problem hiding this comment.
LGTM (again)! 🙂 I'm glad to see that we finally got this done. Thank you for that.
Signed-off-by: deepthi <deepthi@planetscale.com>
This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
Description
VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.
This PR does a few more things:
topoReadConcurrencyfrom healthcheck into GetTabletsForCell.--refresh_known_tabletsflag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known.The basic fix has already been tried in production and shown to reduce the number of
Getcalls from vtgate -> topo from O(n) to O(1).We can consider deprecating and deleting
--refresh_known_tabletsin a future release. The concerns that originally motivated adding that flag in #3965 are alleviated by fetching all tablets in one call to the topo.Related Issue(s)
Fixes #14277
Checklist
Deployment Notes