Improve TopoServer Performance and Efficiency For Keyspace Shards#15047
Improve TopoServer Performance and Efficiency For Keyspace Shards#15047deepthi merged 22 commits intovitessio:mainfrom
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
@mattlord, I think it would be great to add a benchmark test for this. Is it feasible to add an E2E benchmark test, maybe not with a 128 shards, but at least a significant amount? |
ajm188
left a comment
There was a problem hiding this comment.
this rules, just one question, but happy to approve whenever you're ready for a final pass
| if len(result) == 0 { | ||
| return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "%v has no serving shards", keyspace) | ||
| } | ||
| // Sort the shards by KeyRange for deterministic results. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #15047 +/- ##
==========================================
+ Coverage 47.49% 47.70% +0.21%
==========================================
Files 1149 1155 +6
Lines 239387 240181 +794
==========================================
+ Hits 113692 114577 +885
+ Misses 117102 117001 -101
- Partials 8593 8603 +10 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
01a2344 to
c874653
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
deepthi
left a comment
There was a problem hiding this comment.
Nice optimization.
As a follow up we can probably do another PR to replace the call to GetServingShards in materializer.go with a special purpose func that returns only the first shard. Because that's all we ever use in that function.
go/vt/topo/keyspace.go
Outdated
| if IsErrType(err, NoNode) { | ||
| // The path doesn't exist, let's see if the keyspace exists. | ||
| _, kerr := ts.GetKeyspace(ctx, keyspace) | ||
| if kerr == nil { |
There was a problem hiding this comment.
nit: I believe it is idiomatic to handle the not-nil case first. In fact we should probably link to the standard golang style guide on our website for developers to use as guidance. It will be even better if we can somehow enforce golang style in CI so that it doesn't depend on reviewers' preferences.
There was a problem hiding this comment.
I couldn't find anything in the style guide on this and there are places we do it this way, even within this same file (Alain in GetShardNames()), but it's definitely more typical and there's no reason NOT to do that here so I'll change it.
There was a problem hiding this comment.
generally true, yes (exception is in a switch-case)
i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet
go/vt/topo/keyspace.go
Outdated
| // This uses a heuristic based on the number of vCPUs available -- where it's | ||
| // assumed that as larger machines are used for Vitess deployments they will | ||
| // be able to do more concurrently. | ||
| var DefaultConcurrency = runtime.NumCPU() |
There was a problem hiding this comment.
We are defaulting this to 32 for GetTablets so we now have two separate defaults depending on whether we are getting tablets or shards. The reasoning for 32 is covered by this comment. #14693 (comment)
I'm not completely opposed to changing the default for GetTablets to this value, but I'd prefer to change this to 32 for the reasons listed in that comment.
There was a problem hiding this comment.
OK, I see that you were basing it on fs.Int64Var(&topoReadConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads."), but you didn't reference the value. I'll see if I can use the variable somehow in both places.
There was a problem hiding this comment.
I addressed both of your comments here: c7a27d0
ajm188
left a comment
There was a problem hiding this comment.
approving, pending resolution of the concurrency default (which i don't have strong feelings on)
go/vt/topo/keyspace.go
Outdated
| if IsErrType(err, NoNode) { | ||
| // The path doesn't exist, let's see if the keyspace exists. | ||
| _, kerr := ts.GetKeyspace(ctx, keyspace) | ||
| if kerr == nil { |
There was a problem hiding this comment.
generally true, yes (exception is in a switch-case)
i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet
| shards, err := exec.ts.FindAllShardsInKeyspace(ctx, keyspace, &topo.FindAllShardsInKeyspaceOptions{ | ||
| Concurrency: topo.DefaultConcurrency, // Limit concurrency to avoid overwhelming the topo server. | ||
| }) |
There was a problem hiding this comment.
Nit picking here but we are defining a default &topo.FindAllShardsInKeyspaceOptions in three different places, perhaps we should create a global default variable and use that here and in the two other places.
There was a problem hiding this comment.
In this file, in go/vt/topo/test/shard.go and in go/vt/vtctl/workflow/utils.go
go/vt/topo/keyspace.go
Outdated
| result := make(map[string]*ShardInfo, len(listResults)) | ||
| for _, entry := range listResults { | ||
| // The shard key looks like this: /vitess/global/keyspaces/commerce/shards/-80/Shard | ||
| shardKey := string(entry.Key) | ||
| shardName := path.Base(path.Dir(shardKey)) // The base part of the dir is "-80" | ||
| // Validate the extracted shard name. |
There was a problem hiding this comment.
nit picking here too, it might read better if we extract the content of this entire if block. it seems like we only need to send in listResults as an argument and return results, err, something like:
if err == nil {
return handleResults(listResults)
}Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
deepthi
left a comment
There was a problem hiding this comment.
It will be nice to address the int64 vs int issue. Rest LGTM.
go/vt/topo/keyspace.go
Outdated
| var DefaultConcurrency int64 | ||
|
|
||
| func registerFlags(fs *pflag.FlagSet) { | ||
| fs.Int64Var(&DefaultConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads.") |
There was a problem hiding this comment.
I like this! Instead of hard-coding the default for all but healthcheck, we now allow this to be customized for any binary that imports the topo package.
go/vt/topo/keyspace.go
Outdated
| // This file contains keyspace utility functions. | ||
|
|
||
| // Default concurrency to use in order to avoid overhwelming the topo server. | ||
| var DefaultConcurrency int64 |
There was a problem hiding this comment.
Ugh, I just realized the flag value was always int64, and there's no good reason for that. Should we change this to int? Is there any reason not to?
There was a problem hiding this comment.
No reason not to. int is int64 on 64-bit platforms, and for 32-bit they can't really be using bigger numbers anyway 🤷
There was a problem hiding this comment.
You can use int64 on 32 bit machines -- two 32 bit pieces are used. We (the royal we) were using long long / int64 when needed when most machines were still 32 bit architectures.
There was a problem hiding this comment.
I'm fine moving this to int though. I initially started there but then preferred not to change the existing flag.
There was a problem hiding this comment.
Changed it here (and now using it in another location where we should be): 99a7f86
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
| DefaultHealthCheckTimeout = 1 * time.Minute | ||
|
|
||
| // DefaultTopoReadConcurrency is used as the default value for the topoReadConcurrency parameter of a TopologyWatcher. | ||
| DefaultTopoReadConcurrency int = 5 |
There was a problem hiding this comment.
Good catch. This was unused 🤦
Signed-off-by: deepthi <deepthi@planetscale.com>
|
Failures in the Code coverage workflow are being caused by flaky tests which will be fixed separately. |
Description
There are various cases — e.g. when working with VReplication workflows — where we get all of the [serving] shards in a keyspace. To do this we were first getting a list of all shard names, then getting each shard record serially. This resulted in many topo server calls and when the topo server has high latency it could cause various commands to timeout and have some knock on effects on other things due to long running reads blocking other operations (especially e.g. with older etcd versions).
For example, if you had a keyspace with 128 shards then you would make 129 topo server calls when getting all [serving] shards:
With this PR you would go from 129 topo server calls in this example case down to 1 when the topo server supports key prefix scans (all but ZooKeeper do), and when the topo server does not (or the response message is beyond the max message size) then we fall back to the shard by shard method but do so concurrently. So either way, the total time taken to get all of the shards in a keyspace should improve dramatically.
Related Issues
Checklist