Skip to content

[vtadmin] schema cache#10120

Merged
ajm188 merged 39 commits intovitessio:mainfrom
planetscale:andrew/vtadmin/schema-cache
May 22, 2022
Merged

[vtadmin] schema cache#10120
ajm188 merged 39 commits intovitessio:mainfrom
planetscale:andrew/vtadmin/schema-cache

Conversation

@ajm188
Copy link
Contributor

@ajm188 ajm188 commented Apr 20, 2022

Description

This PR introduces a caching mechanism for vtadmin, with our first use-case being schemas.

It does not use the go/cache package, primarily because I wanted a mechanism to enqueue backfills as a first-class concern, which I'll talk about a bit later.

Heeeeeeeeeeere we go:

Cache Design

The cache design is pretty much what you would expect in generics: we store a mapping of keys (of type K) to values (of type V, which can be any-thing).

You might expect the cache to be defined as Cache[string, V any], but that requires every callsite to know how to turn something into a key.
By pushing this into an interface (Keyer) that keys have to implement, there can only be one place (the Key() method) where cache keys are created, which reduces duplication of code and therefore the chance two places accidentally use different key schemes for the same cache.
Further, I chose not to leverage the existing Stringer interface, in case we ever need to turn a struct into a string for reasons other than "unique cache key".

Anyway.

Our generic cache is built on top of go-cache and provides only a subset of the functionality.
We can expand this set if we end up needing more features, but for now Add() and Get() are sufficient.

The Backfill Queue

Or: "The Reason We Aren't Using vitess.io/vitess/go/cache".

The schema endpoints in vtadmin have a bunch of different options which control the size and shape of the data returned to the caller.
We do cross-shard size aggregations, can request subsets of tables in a keyspace (including fetching just a single table's TableDefinition), include or exclude views, and so on.

We could build a cache that incorporates all of these options into the key, and cache each permutation of possible request/response pairs for every keyspace in a cluster, but:

  1. We would be storing a ton of duplicate data.

  2. We would get really poor cache reuse.

    a. Imagine requesting /schemas/ks1/ and then /schemas/ks1?tables=t1,t2 — we would have to make two full round-trips from vtadmin-api to the cluster to first fetch all the table definitions and then again to get t1 and t2 even though we literally just fetched them.

So, what we do instead is cache only the largest response payload (in other words, the "give me all of everything" request equivalent) in the cache.
Then, when pulling data out of the cache, we look at the particular request's filtering options to whittle down the full payload into the subset that the caller actually wants.

However, this means that if the first request (i.e. "nothing cached") is not for the full payload, we won't be able to fill the cache before returning.
So instead, we make one blocking round-trip to the cluster to get the payload for that request, and then in the background instruct the cache to go get the full payload, via EnqueueBackfill().

Then, future requests will have the fully-cached payload, which they can extract subsets of depending on their specific options.

Cache Busting

Finally, we need a way to force a cache eviction/bypass/refresh/whatever, so we aren't stuck serving stale data waiting for responses to get evicted organically.

I wanted to solve this globally so that we would not need to update every single request proto message when we got around to adding caching for it.

So!
vtadmin-api now takes a global flag called --cache-refresh-key (despite using it as the title of this section, I dislike the term "cache busting" and this was the best alternative I could come up with — suggestions welcome!).

This gets incorporated as an HTTP header called X-<the-flag-value> and a gRPC metadata key, each of which is used to propagate "please bypass and refresh the cache for this request" from the caller down to a particular cluster method.

This is covered by the functions ShouldRefreshFromIncomingContext (gRPC metadata case) and ShouldRefreshFromRequest (HTTP case), as well as in our HTTP-to-gRPC adapter layer, which propagates from an HTTP header to gRPC metadata.

Outstanding TODOs

The main thing missing from schema caching in particular is FindSchemas.
Because we need not only to know the schema definition(s) but which cluster(s) those definitions live in, we cannot cache solely at the cluster level.

We'll need to add a second cache at the API level to track which clusters have which schemas, and then look to just those clusters for cached schemas.
This PR is already a pretty significant chunk of work, so I am punting on that for now.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Andrew Mason added 4 commits April 20, 2022 15:04
Signed-off-by: Andrew Mason <andrew@planetscale.com>
…erics yet

Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188 ajm188 force-pushed the andrew/vtadmin/schema-cache branch from 2c8b071 to 1db42c0 Compare April 20, 2022 19:04
Andrew Mason added 6 commits April 21, 2022 11:50
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188 ajm188 force-pushed the andrew/vtadmin/schema-cache branch from eab3ae3 to e92f40c Compare April 21, 2022 19:55
Andrew Mason added 12 commits April 21, 2022 16:07
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188 ajm188 requested a review from vmg April 25, 2022 13:41
@ajm188 ajm188 marked this pull request as ready for review April 25, 2022 13:42
@ajm188 ajm188 requested a review from doeg as a code owner April 25, 2022 13:42
Andrew Mason added 7 commits April 25, 2022 14:37
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188
Copy link
Contributor Author

ajm188 commented Apr 26, 2022

I think I've fixed the race-y tests (which kinda blew up the diff a bit more 😅), so this is ready for review! cc @vmg @doeg @notfelineit

Andrew Mason added 2 commits April 26, 2022 14:07
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188 ajm188 force-pushed the andrew/vtadmin/schema-cache branch from 0f87cc0 to 1904198 Compare April 26, 2022 18:07
@vmg
Copy link
Collaborator

vmg commented Apr 27, 2022

I'll be 👁️ this today. Lots of code!

Signed-off-by: Andrew Mason <andrew@planetscale.com>
Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this! I took a look this morning. I don't fully understand the double-cache usage, so I left a suggestion. Also some changes to racy stuff with contexts. Cheers!

//
type Cache[Key Keyer, Value any] struct {
cache *cache.Cache
fillcache *cache.Cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the behavior of fillcache. It seems to be keeping track of the last time we enqueued a fill, so it doesn't happen too often, but if that's the only purpose, it doesn't need to be a cache. It can simply be a map[Key]time.Time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just make sure to delete from the map after each read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, you have the intended purpose exactly right! i was using the cache because it's threadsafe, if i take your suggestion i'll need to manually manage a mutex for it (which is fine, i think in hindsight i was just being a little lazy and "well, i've already got caches on the brain")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just make sure to delete from the map after each read)

I actually don't think so, but double check my thinking:

Imagine your backfill queue was [k1, k1, k1] and your wait interval was 10 minutes.

On the first loop, you fill k1, and record time.Now().
On the second loop, you try to fill k1 again, but it's been less than 10 minutes, so you drop the request.
If you delete the key from fillcache, the third loop would then re-fill k1 even though it hasn't been 10 minutes.

So I think the map has to grow forever. The only reason to delete would be if, when you read, the fill time is outside the wait period, but then you're going to re-fill the cache which will just put the key right back in the map. The only "space" you would save is if your re-fills are failing (i.e. fillFunc returns err != nil, which should happen (ideally) rarely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, wouldn't that also apply to the cache though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in most practical cases, probably (but we also don't explicitly delete from the cache currently, we just expire things)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, okay, we never explicitly delete things, but if you specify a non-zero CleanupInterval, then go-cache runs a background thread that deletes things that are expired, so infrequently-fetched schemas will get purged over time

Andrew Mason added 4 commits May 5, 2022 06:36
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! You can reduce the locking in Get though.

// Record the time we last cached this key, to check against
c.lastFill[key] = time.Now().UTC()
// Then cache the actual value.
return c.cache.Add(key, val, d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep c.cache.Add outside of the locking for c.m. It adds unnecessary contention.

key := req.k.Key()

c.m.Lock()
if t, ok := c.lastFill[key]; ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've decided not to clean-up lastFill, which probably makes sense for most growth patterns. I would comment this behavior explicitly.

Andrew Mason added 3 commits May 21, 2022 06:30
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
@ajm188 ajm188 merged commit 24452e1 into vitessio:main May 22, 2022
@ajm188 ajm188 deleted the andrew/vtadmin/schema-cache branch May 22, 2022 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: VTAdmin schema caching

2 participants