[vtadmin] schema cache by ajm188 · Pull Request #10120 · vitessio/vitess

ajm188 · 2022-04-20T19:03:18Z

Description

This PR introduces a caching mechanism for vtadmin, with our first use-case being schemas.

It does not use the go/cache package, primarily because I wanted a mechanism to enqueue backfills as a first-class concern, which I'll talk about a bit later.

Heeeeeeeeeeere we go:

Cache Design

The cache design is pretty much what you would expect in generics: we store a mapping of keys (of type K) to values (of type V, which can be any-thing).

You might expect the cache to be defined as Cache[string, V any], but that requires every callsite to know how to turn something into a key.
By pushing this into an interface (Keyer) that keys have to implement, there can only be one place (the Key() method) where cache keys are created, which reduces duplication of code and therefore the chance two places accidentally use different key schemes for the same cache.
Further, I chose not to leverage the existing Stringer interface, in case we ever need to turn a struct into a string for reasons other than "unique cache key".

Anyway.

Our generic cache is built on top of go-cache and provides only a subset of the functionality.
We can expand this set if we end up needing more features, but for now Add() and Get() are sufficient.

The Backfill Queue

Or: "The Reason We Aren't Using vitess.io/vitess/go/cache".

The schema endpoints in vtadmin have a bunch of different options which control the size and shape of the data returned to the caller.
We do cross-shard size aggregations, can request subsets of tables in a keyspace (including fetching just a single table's TableDefinition), include or exclude views, and so on.

We could build a cache that incorporates all of these options into the key, and cache each permutation of possible request/response pairs for every keyspace in a cluster, but:

We would be storing a ton of duplicate data.
We would get really poor cache reuse.

a. Imagine requesting /schemas/ks1/ and then /schemas/ks1?tables=t1,t2 — we would have to make two full round-trips from vtadmin-api to the cluster to first fetch all the table definitions and then again to get t1 and t2 even though we literally just fetched them.

So, what we do instead is cache only the largest response payload (in other words, the "give me all of everything" request equivalent) in the cache.
Then, when pulling data out of the cache, we look at the particular request's filtering options to whittle down the full payload into the subset that the caller actually wants.

However, this means that if the first request (i.e. "nothing cached") is not for the full payload, we won't be able to fill the cache before returning.
So instead, we make one blocking round-trip to the cluster to get the payload for that request, and then in the background instruct the cache to go get the full payload, via EnqueueBackfill().

Then, future requests will have the fully-cached payload, which they can extract subsets of depending on their specific options.

Cache Busting

Finally, we need a way to force a cache eviction/bypass/refresh/whatever, so we aren't stuck serving stale data waiting for responses to get evicted organically.

I wanted to solve this globally so that we would not need to update every single request proto message when we got around to adding caching for it.

So!
vtadmin-api now takes a global flag called --cache-refresh-key (despite using it as the title of this section, I dislike the term "cache busting" and this was the best alternative I could come up with — suggestions welcome!).

This gets incorporated as an HTTP header called X-<the-flag-value> and a gRPC metadata key, each of which is used to propagate "please bypass and refresh the cache for this request" from the caller down to a particular cluster method.

This is covered by the functions ShouldRefreshFromIncomingContext (gRPC metadata case) and ShouldRefreshFromRequest (HTTP case), as well as in our HTTP-to-gRPC adapter layer, which propagates from an HTTP header to gRPC metadata.

Outstanding TODOs

The main thing missing from schema caching in particular is FindSchemas.
Because we need not only to know the schema definition(s) but which cluster(s) those definitions live in, we cannot cache solely at the cluster level.

We'll need to add a second cache at the API level to track which clusters have which schemas, and then look to just those clusters for cached schemas.
This PR is already a pretty significant chunk of work, so I am punting on that for now.

Related Issue(s)

Checklist

"Backport me!" label has been added if this change should be backported
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Signed-off-by: Andrew Mason <andrew@planetscale.com>

…erics yet Signed-off-by: Andrew Mason <andrew@planetscale.com>

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 · 2022-04-26T10:39:22Z

I think I've fixed the race-y tests (which kinda blew up the diff a bit more 😅), so this is ready for review! cc @vmg @doeg @notfelineit

Signed-off-by: Andrew Mason <andrew@planetscale.com>

vmg · 2022-04-27T07:45:15Z

I'll be 👁️ this today. Lots of code!

Signed-off-by: Andrew Mason <andrew@planetscale.com>

vmg

Sorry for the delay on this! I took a look this morning. I don't fully understand the double-cache usage, so I left a suggestion. Also some changes to racy stuff with contexts. Cheers!

go/vt/vtadmin/cache/cache.go

vmg · 2022-05-04T09:42:30Z

go/vt/vtadmin/cache/cache.go

+//
+type Cache[Key Keyer, Value any] struct {
+	cache     *cache.Cache
+	fillcache *cache.Cache


I don't fully understand the behavior of fillcache. It seems to be keeping track of the last time we enqueued a fill, so it doesn't happen too often, but if that's the only purpose, it doesn't need to be a cache. It can simply be a map[Key]time.Time.

(just make sure to delete from the map after each read)

yep, you have the intended purpose exactly right! i was using the cache because it's threadsafe, if i take your suggestion i'll need to manually manage a mutex for it (which is fine, i think in hindsight i was just being a little lazy and "well, i've already got caches on the brain")

(just make sure to delete from the map after each read)

I actually don't think so, but double check my thinking:

Imagine your backfill queue was [k1, k1, k1] and your wait interval was 10 minutes.

On the first loop, you fill k1, and record time.Now().
On the second loop, you try to fill k1 again, but it's been less than 10 minutes, so you drop the request.
If you delete the key from fillcache, the third loop would then re-fill k1 even though it hasn't been 10 minutes.

So I think the map has to grow forever. The only reason to delete would be if, when you read, the fill time is outside the wait period, but then you're going to re-fill the cache which will just put the key right back in the map. The only "space" you would save is if your re-fills are failing (i.e. fillFunc returns err != nil, which should happen (ideally) rarely.

Huh, wouldn't that also apply to the cache though?

yeah in most practical cases, probably (but we also don't explicitly delete from the cache currently, we just expire things)

well, okay, we never explicitly delete things, but if you specify a non-zero CleanupInterval, then go-cache runs a background thread that deletes things that are expired, so infrequently-fetched schemas will get purged over time

go/vt/vtadmin/cache/cache.go

Signed-off-by: Andrew Mason <andrew@planetscale.com>

vmg

Looks good! You can reduce the locking in Get though.

vmg · 2022-05-20T10:22:56Z

go/vt/vtadmin/cache/cache.go

+	// Record the time we last cached this key, to check against
+	c.lastFill[key] = time.Now().UTC()
+	// Then cache the actual value.
+	return c.cache.Add(key, val, d)


I would keep c.cache.Add outside of the locking for c.m. It adds unnecessary contention.

vmg · 2022-05-20T10:24:09Z

go/vt/vtadmin/cache/cache.go

+		key := req.k.Key()
+
+		c.m.Lock()
+		if t, ok := c.lastFill[key]; ok {


I see you've decided not to clean-up lastFill, which probably makes sense for most growth patterns. I would comment this behavior explicitly.

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Andrew Mason added 4 commits April 20, 2022 15:04

Initial generic backfill cache implementation for vtadmin

4edafc4

Signed-off-by: Andrew Mason <andrew@planetscale.com>

disable structcheck for vtadmin/cache because it does not support gen…

b341819

…erics yet Signed-off-by: Andrew Mason <andrew@planetscale.com>

Fix expiration semantics

d804752

Signed-off-by: Andrew Mason <andrew@planetscale.com>

cache configs

1db42c0

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 force-pushed the andrew/vtadmin/schema-cache branch from 2c8b071 to 1db42c0 Compare April 20, 2022 19:04

Andrew Mason added 6 commits April 21, 2022 11:50

Stop taking bare strings in our cache layer

9dfca40

Signed-off-by: Andrew Mason <andrew@planetscale.com>

let's give this a whirl

63f4701

Signed-off-by: Andrew Mason <andrew@planetscale.com>

log so we know we are using the cache

412702b

Signed-off-by: Andrew Mason <andrew@planetscale.com>

refactor toward supporting GetSchemas

c166e83

Signed-off-by: Andrew Mason <andrew@planetscale.com>

refactor, support GetSchemas, add tracing annotations

91f7380

Signed-off-by: Andrew Mason <andrew@planetscale.com>

switch away from json.marshal, add log back

e92f40c

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 force-pushed the andrew/vtadmin/schema-cache branch from eab3ae3 to e92f40c Compare April 21, 2022 19:55

Andrew Mason added 12 commits April 21, 2022 16:07

hey dummy what if you actually filled the cache

ae6d213

Signed-off-by: Andrew Mason <andrew@planetscale.com>

fix logic for determining when we need an additional backfill

9834c88

Signed-off-by: Andrew Mason <andrew@planetscale.com>

method extraction

8aaee63

Signed-off-by: Andrew Mason <andrew@planetscale.com>

the Final Refactor :chefs-kiss:

c645184

Signed-off-by: Andrew Mason <andrew@planetscale.com>

add some tests

8f22f9d

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Add license headers, and some docs

3458fe2

Signed-off-by: Andrew Mason <andrew@planetscale.com>

add config hook point

5078086

Signed-off-by: Andrew Mason <andrew@planetscale.com>

start testing backfill logic -- i found a bug!! yay :lolsob:

f16c9b3

Signed-off-by: Andrew Mason <andrew@planetscale.com>

use unbuffered channel so we don't need a (potentially flaky) sleep

b3bb564

Signed-off-by: Andrew Mason <andrew@planetscale.com>

more tests

b5187dc

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Install mechanism to propagate cache-refresh requests

9550a8e

Signed-off-by: Andrew Mason <andrew@planetscale.com>

add Debug for caches

ff1b5d8

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 requested a review from vmg April 25, 2022 13:41

ajm188 added Type: Feature Component: VTAdmin VTadmin interface release notes labels Apr 25, 2022

ajm188 marked this pull request as ready for review April 25, 2022 13:42

ajm188 requested a review from doeg as a code owner April 25, 2022 13:42

ajm188 requested review from notfelineit and rohit-nayak-ps as code owners April 25, 2022 13:42

Andrew Mason added 7 commits April 25, 2022 14:37

Fix LoadOne to correctly handle not-found

41dd470

Signed-off-by: Andrew Mason <andrew@planetscale.com>

fix some data races

c2ad8a2

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Actually use the config

fe36b4b

Signed-off-by: Andrew Mason <andrew@planetscale.com>

more clones, also proto.Clone has weird 'zero value' vs nil behavior

06b733f

Signed-off-by: Andrew Mason <andrew@planetscale.com>

this test does not work if the cache has values

384b33f

Signed-off-by: Andrew Mason <andrew@planetscale.com>

punt on fixing api.TestGetSchemas

3e0642e

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Fix test setup for FindSchema to support backfill

c0a64f3

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Andrew Mason added 2 commits April 26, 2022 14:07

add io.Closer impls for api.API and cluster.Cluster

4990944

Signed-off-by: Andrew Mason <andrew@planetscale.com>

finally, at long last, fix the racy tests for reals

1904198

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 force-pushed the andrew/vtadmin/schema-cache branch from 0f87cc0 to 1904198 Compare April 26, 2022 18:07

close clusters concurrently

9d0c233

Signed-off-by: Andrew Mason <andrew@planetscale.com>

vmg requested changes May 4, 2022

View reviewed changes

Andrew Mason added 4 commits May 5, 2022 06:36

Just use a map to track lastFill times

df1701f

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ctx/cancel->time.After

ece1865

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Simpler Debug impl

4e3fe30

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Remove unnecessary second Done() check

b25708d

Signed-off-by: Andrew Mason <andrew@planetscale.com>

doeg mentioned this pull request May 6, 2022

Document vtadmin flags in programs/reference/vtadmin vitessio/website#1019

Merged

doeg mentioned this pull request May 18, 2022

[vtadmin] Add loading placeholder for entity table views + update default redirect to /schemas #10331

Merged

3 tasks

vmg approved these changes May 20, 2022

View reviewed changes

Andrew Mason added 3 commits May 21, 2022 06:30

Merge branch 'vmain' into andrew/vtadmin/schema-cache

f1abecb

Signed-off-by: Andrew Mason <andrew@planetscale.com>

PR feedback: reduce contention + add comments

48f8d0a

Signed-off-by: Andrew Mason <andrew@planetscale.com>

Merge branch 'vmain' into andrew/vtadmin/schema-cache

a9d79c2

Signed-off-by: Andrew Mason <andrew@planetscale.com>

ajm188 merged commit 24452e1 into vitessio:main May 22, 2022

ajm188 deleted the andrew/vtadmin/schema-cache branch May 22, 2022 10:26

Conversation

ajm188 commented Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Cache Design

The Backfill Queue

Cache Busting

Outstanding TODOs

Related Issue(s)

Checklist

Deployment Notes

Uh oh!

ajm188 commented Apr 26, 2022

Uh oh!

vmg commented Apr 27, 2022

Uh oh!

vmg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vmg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajm188 commented Apr 20, 2022 •

edited

Loading