[vtadmin] Aggregate schema sizes by ajm188 · Pull Request #7751 · vitessio/vitess

ajm188 · 2021-03-26T13:58:58Z

Description

This PR enhances the Get/Find Schema(s) endpoints to aggregate row counts and data lengths (more broadly, "sizes") across tablets from all shards in a given keyspace, so that vtadmin can display them to the user.

To make this work, I refactored a lot (but not all) of the logic around orchestrating and making GetSchema vtctld rpcs from the API layer to the Cluster layer, including the addition of a cluster.GetSchemaOptions type to control the behavior of these cluster-level methods, as well as to allow us to continue to prefetch tablets as an optimization. We use that GetSchemaOptions type to also plumb through the size-aggregation options. There's extensive inline documentation about how the different options in this type affect the behavior, so I won't repeat them here.

I also pulled out our tablet filtering code to a public function as vtadminproto.FilterTablets and added some additional trace annotation helper functions.

Related Issue(s)

[vtadmin] Aggregate table sizes across-shards in schema-related API endpoints #7650

Checklist

Should this PR be backported? no
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

Signed-off-by: Andrew Mason <amason@slack-corp.com>

… later) Signed-off-by: Andrew Mason <amason@slack-corp.com>

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Changes are to make it behave more similarly to the old `GetSchema`, so that I could repurpose those tests. At this point, we can delete `GetSchema` and rename `GetSchemaForKeyspace` Signed-off-by: Andrew Mason <amason@slack-corp.com>

Signed-off-by: Andrew Mason <amason@slack-corp.com>

doeg

This looks great to me!!! A handful of lil non-blocking comments. Thank you again because I know this was a ton of work.

So excited to use this on the front-end. 😈

doeg · 2021-03-26T14:43:04Z

go/vt/vtadmin/cluster/cluster.go

+	BaseRequest *vtctldatapb.GetSchemaRequest
+	// SizeOpts control whether the (*Cluster).GetSchema method performs
+	// cross-shard table size aggregation (via the AggregateSizes field), and
+	// the behavior of that size aggregation (via the IncludeNonServingShards


Could you help me understand the use cases for the two values of IncludeNonServingShards? (Thinking of vtadmin-web but maybe there are others!) An entirely non-blocking question, of course. :)

Like, maybe it's as simple as a checkbox in the UI, and operators will already know when they care about one vs the other....? And/or is the diff between the two values interesting to show? (Also, is true the best default value, you think?)

I had the same question. Wondering if we need this right now and maybe it is not worth having part of the API at the moment. My initial intuition is that only aggregating serving shards, seems sufficient. What other use cases you had in mind?

Going to tag @rafael just to consolidate the discussion on this thread, since it's in at least two places now.

I'm actually having trouble articulating why we would ever want this (maybe if we were mid-split and wanted to see what the true total size of provisioned storage was vs what the logical size is (because in a split you would be double-counting between a source and its destination shards), but beyond that I'm not sure).

I think, though, that if we are going to keep it, false is the right default here. Because it's probably confusing if you weren't explicitly opting in to the potential double-counting.

Buttttt like I just said, maybe we should remove it entirely and just always skip non-serving shards. I'm curious to hear what you both think!

I'd be in favor of always skipping non serving shards for now. If we see an use case in the future, then we iterate and add it.

Omitting it sounds good to me, especially since it'll simplify the front-end a lil bit too. :)

doeg · 2021-03-26T15:35:24Z

go/vt/vtadmin/cluster/cluster.go

+		span, ctx := trace.NewSpan(ctx, "Cluster.FindAllShardsInKeyspace")
+
+		AnnotateSpan(c, span)
+		span.Annotate("keyspace", keyspace)
+
+		resp, err := c.Vtctld.FindAllShardsInKeyspace(ctx, &vtctldatapb.FindAllShardsInKeyspaceRequest{
+			Keyspace: keyspace,
+		})
+
+		span.Finish()


Really minor + non-blocking point -- did you consider a (c *Cluster) FindAllShardsInKeyspace function? That might simplify this a tiny bit, especially with the annotations. (It took me 0.2 seconds to realize that in this case we don't want a defer span.Finish() call.) 😊

yeah, that sounds good! (i didn't because i didn't want to write more tests :shamebell:)

doeg · 2021-03-26T15:38:33Z

go/vt/vtadmin/cluster/cluster.go

+		if len(keyspaceTablets) == 0 {
+			// consider how to include info about the tablets we looked at, but
+			// that's also potentially a very long list .... maybe we should
+			// just log it (yes, do that).


Should this be a formal TODO? Haha 🌻

I uhh meant for this to be for my eyes only 😬

doeg · 2021-03-26T15:48:42Z

go/vt/vtadmin/cluster/cluster.go

+
+				if _, ok = tableSize.ByShard[tablet.Tablet.Shard]; ok {
+					// We managed to query for the same shard twice, that's ...
+					// weird. but do we care? maybe just log? idk!


lol I do truly love this comment, but it is somewhat out of character with the rest so I wanted to point it out just in case

Maybe we should error? That seems like it shouldn't happen, so if it does happen we better investigate??

doeg · 2021-03-26T15:49:17Z

go/vt/vtadmin/cluster/cluster.go

+		// Instead of starting at false, we start with whatever the base request
+		// specified. If we have exactly one tablet to query (i.e. we're not
+		// doing multi-shard aggregation), it's possible the request was to
+		// literally just get the table sizes; we shouldn't assume. If we have
+		// more than one tablet to query, then we are doing size aggregation,
+		// and we'll flip this to true after spawning the first GetSchema rpc.
+		sizesOnly = opts.BaseRequest.TableSizesOnly


This makes a lot of sense + I appreciate you annotating it inline!

rafael

I did my initial pass! Let me know your thoughts.

go/vt/vtadmin/api.go

rafael · 2021-03-26T18:54:13Z

go/vt/vtadmin/cluster/cluster.go

+	// (described above) to find one SERVING tablet for each shard in the
+	// keyspace. If IncludeNonServingShards is false, then we will skip any
+	// shards for which IsMasterServing is false.
+	SizeOpts *vtadminpb.GetSchemaTableSizeOptions


Names are always hard. SizeOpts is a bit opaque to me. Maybe TableSizeAggregationOpts ? That is too mouthful...

I think we should come up with a better name.

How about just TableSizeOptions and have it mirror the type name? I think it's a reasonable tradeoff between clarity and verbosity. Let me know what you think!

rafael · 2021-03-26T18:57:05Z

go/vt/vtadmin/cluster/cluster.go

+	BaseRequest *vtctldatapb.GetSchemaRequest
+	// SizeOpts control whether the (*Cluster).GetSchema method performs
+	// cross-shard table size aggregation (via the AggregateSizes field), and
+	// the behavior of that size aggregation (via the IncludeNonServingShards


I had the same question. Wondering if we need this right now and maybe it is not worth having part of the API at the moment. My initial intuition is that only aggregating serving shards, seems sufficient. What other use cases you had in mind?

go/vt/vtadmin/cluster/cluster.go

rafael · 2021-03-26T19:07:39Z

go/vt/vtadmin/cluster/cluster.go

+
+				if _, ok = tableSize.ByShard[tablet.Tablet.Shard]; ok {
+					// We managed to query for the same shard twice, that's ...
+					// weird. but do we care? maybe just log? idk!


Maybe we should error? That seems like it shouldn't happen, so if it does happen we better investigate??

rafael · 2021-03-26T19:10:10Z

go/vt/vtadmin/cluster/cluster.go

+		sizesOnly = opts.BaseRequest.TableSizesOnly
+	)
+
+	for _, tablet := range tabletsToQuery {


Here similar to my other comment about throttling. I'm not too concern, because I think this should scale pretty well as it's rpc to different hosts. But it does makes a bit nervous that for large keyspaces we can be making too many of this in parallel.

Wondering if putting a guardrail where we say: we only do X amount of parallelism will be good to have in place early on.

Yeah, I think when we do the "bound topo RPCs to max concurrency" there should be two limits: tablet RPC max concurrency, and topo RPC max concurrency (maybe others). Then each of those gets configured (per cluster, so larger clusters could in theory have higher limits) and we spawn a waitgroup/pool/etc with that value as the capacity.

Signed-off-by: Andrew Mason <amason@slack-corp.com>

doeg

This looks great + even more streamlined than before. :chef-kiss: Just one non-blocking question. :)

(I defer to you and Rafa on the parts on throttling.) :D

doeg · 2021-03-28T17:21:25Z

go/vt/vtadmin/cluster/cluster.go

+type FindAllShardsInKeyspaceOptions struct {
+	// SkipDial indicates that the cluster can assume the vtctldclient has
+	// already dialed up a connection to a vtctld.
+	SkipDial bool


I'm curious why SkipDial is exposed as an option given that all uses (so far) are true. Requiring the caller to dial a vtctld connection seems like a fair contract to me + it seems like it'd simplify the interface quite a bit. Let me know if I missed something though!

I'm of two minds about requiring callers to know the details enough to call Dial (though, this option just does the same thing, only one thin layer removed).

I think long term what I want is for a cluster to be smart about the last time it dialed, and "just do the right thing", and the caller doesn't need to know or care. I'd like to leave this as-is for now and noodle on that some more.

Andrew Mason added 20 commits March 13, 2021 08:14

WIP: first pass and adding sizes to protobufs

b09ab1d

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: first stab at an aggregator version of GetSchema

1a576f4

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: refactor out FilterTablets

de087ac

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: change protos to uint64

8fd5d51

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: update code for uint64

2693c11

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: add Dial

330d603

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: more typed errors

1d92c60

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: refactor to use GetSchemaForKeyspace in api (I'll back-name this…

af751d4

… later) Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: remove last instance of old cluster.GetSchema, some minor touches

1d45afb

Signed-off-by: Andrew Mason <amason@slack-corp.com>

WIP: add spans, add error wraps

f47f149

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Remove old method, rename new method to match, and docs docs docs

6a1c5cf

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Add cluster-level GetSchema tests for size aggregation

dd6c1ab

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Update existing api tests for GetSchema(s)/FindSchema, remove dead code

65d8153

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Thread sizeopts through all vtadmin schema endpoints

13908c0

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Update http adapters to grab sizeopts out of the query params

d7498e2

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Add test for multi-cluster GetSchema size-aggr

4d6bbf9

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Add size aggr test case for FindSchema

f06582f

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Add sizeaggr test case for GetSchemas

30edd7d

Signed-off-by: Andrew Mason <amason@slack-corp.com>

nits: whitespace, replace stub log lines

be136d8

Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 requested review from doeg, rafael, rohit-nayak-ps and setassociative March 26, 2021 13:58

doeg approved these changes Mar 26, 2021

View reviewed changes

rafael reviewed Mar 26, 2021

View reviewed changes

Andrew Mason added 4 commits March 26, 2021 22:55

Refactor FindAllShardsInKeyspace to a cluster-level method

6de47eb

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Extract helper function for collecting tabletsToQuery for GetSchema

39a7eae

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Extract helper function for orchestrating the per-tablet GetSchema rpcs

9777f0d

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Remove IncludeNonServingShards from TableSizeOptions

9d5a725

Signed-off-by: Andrew Mason <amason@slack-corp.com>

PR feedback around naming, logging, and error handling

db26e63

Signed-off-by: Andrew Mason <amason@slack-corp.com>

doeg approved these changes Mar 28, 2021

View reviewed changes

doeg mentioned this pull request Mar 30, 2021

[vtadmin] Aggregate table sizes across-shards in schema-related API endpoints #7650

Closed

rafael approved these changes Mar 31, 2021

View reviewed changes

rafael merged commit b6a8acb into vitessio:master Mar 31, 2021

rafael deleted the am_vtadmin_aggregate_schema_sizes branch March 31, 2021 15:57

askdba added the Component: VTAdmin VTadmin interface label Apr 1, 2021

askdba added this to the v10.0 milestone Apr 1, 2021

doeg mentioned this pull request Apr 9, 2021

[vtadmin-api] Reintroduce include_non_serving_shards opt to GetSchema #7814

Merged

8 tasks

ajm188 mentioned this pull request Apr 14, 2021

[vtadmin] Provide configuration options to bound the number of inflight tablet/topo RPCs #7866

Closed

ajm188 mentioned this pull request Jul 18, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Conversation

ajm188 commented Mar 26, 2021

Description

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

Uh oh!

doeg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafael left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doeg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

doeg left a comment •

edited

Loading