make the resilient topo cache even more resilient and informative by demmer · Pull Request #3641 · vitessio/vitess

demmer · 2018-02-08T16:31:00Z

Overview

Improve the status UI for the resilient topo cache, make GetSrvKeyspaceNames fetch asynchronously, and fix a logic bug in the watch-based caching for GetSrvKeyspace.

This incorporates the ideas first proposed by @yangxuanjia in #3640 while addressing my concerns about the refresh interval and properly managing locks and state.

Details

This first came up because I wanted better indication in the status UI about the state of the cache. In particular, it used to either show the cached value or show the most recent error, but would not indicate if the cache was valid or not.

To address this, I changed the status UI to display the current cached value (if any), the last error (if any), as well as information about the cache freshness / TTL.

Next, as discussed in slack and in #3640 I moved the topo in GetSrvKeyspaceNames into a separate goroutine, and kept tracking state in the entry to make sure only one topo request is outstanding. This makes sure that a cached value can always be returned immediately even if the topo service is unresponsive or slow.

However, unlike in the other proposal, if there is no cached value, all requests block waiting for the fetcher goroutine to complete. This seems like an acceptable tradeoff to me since it avoids errors on the initial query, and for the most part should insulate callers from topo service unavailability or slowness.

Finally, I noticed a logic bug in the earlier changes that I made to cache the watched SrvKeyspace object. Previously the lastValueTime was updated on every change, but that meant that if the watch failed after some time where no changes were detected, the cache did not properly keep the last value for the full TTL as designed. To address this, I changed it to update lastValueTime whenever an established watch returns, either due to error or the value being in the cache.

Testing

I added a unit test for the last case and did a decent amount of manual testing both using zookeeper and consul.

I'd like to add some tests for the logic in the status UI since that turned out to be surprisingly tricky to get right, as well as the case where the topo service is blocked.

demmer · 2018-02-08T16:33:51Z

Oh and one more thing - I think that for completeness we should also rework the logic that sets up the watch to also follow this pattern to perform the topo call asynchronously and after dropping the mutex, again to ensure that queries can continue to flow if there is a valid cached value but the topo service is unresponsive.

alainjobart · 2018-02-08T21:01:58Z

I got stuck on the use of the WaitGroup to synchronize the background processing. I suggest changing that to use a channel instead:

if the channel is non-nil, it means someone is already asking for the data in the background.
if the channel is nil, create one. When the fetching is done, retake the mutex, close the channel, nil the value.
the wait code (that's outside the mutex, where the WG.Wait() is now) would then instead be a select(), and the added bonus here is you can wait for either <-ctx.Done(), and the channel closure.

I think that would be more go-like to do it this way. A WaitGroup somewhat implies you start a bunch of tasks, and wait for them all to finish. This is not what is done here, and is confusing (at least I was confused hehe). Because then you have to think and make sure you can't Add(1) to a WG of value 0 at the same time as a Wait() is running (which doesn't happen here, but you have to think a lot about it).

yangxuanjia · 2018-02-09T02:58:52Z

i suggest if entry.value has value, do not set "entry.value = nil" even if it expired.

demmer · 2018-02-09T03:19:53Z

If that is the behavior you want then you can effectively get it by setting an extremely long TTL like 1 year.

yangxuanjia · 2018-02-09T03:39:33Z

if topo ok, we want a little TTL. But if the topo all down, we want use cache even if it's expired, util we recover the topo. if cache expired and topo all down, the entry.value =nil, we will get nothing, user can't connect vtgate normal. this is not we want.

demmer · 2018-02-09T05:07:23Z

Ah yes. Same for us :)

That’s why I added a second flag to control the refresh interval separate from the TTL.

That way you can update the state more frequently but still keep the cached value for a long time.

yangxuanjia · 2018-02-09T05:55:02Z

oh, I see. server.cacheTTL is more important, it decide cacheValid which user can get entry in cache.
like set cacheTTL = 7day, set server.cacheRefresh= 1second, it looks work good.

yangxuanjia · 2018-02-11T01:59:40Z

LGTM

demmer · 2018-02-12T15:00:01Z

@alainjobart I think this should be ready for final review assuming Travis passes.

I made the following additional changes:

Move all the GetSrvKeyspace watch into the separate goroutine so that it doesn't hold the lock or block cache access when calling out to the topo service.
Convert from WaitGroup to sync.Cond for the cases when there is no cached value and we are waiting for a topo service response. This seemed like a cleaner abstraction than a channel as you suggested.
Add a test that simulates an unresponsive topo service by explicitly grabbing the mutex on the memorytopo during the test to block out other requests.

alainjobart · 2018-02-12T17:20:49Z

This looks good. The only reservation I have is the Wait() on the sync.Cond may block for a long time, even after ctx.Done(). That means a go routine waiting to execute may be dangling, even when the caller has timed out. That's the main reason I was suggesting a channel, so you can stop waiting on either the channel, or ctx.Done(). With a sync.Cond, you can't wait on either the Cond or ctx.Done().

Previously there was no way to tell if the resilient server had a valid cached value from the debug UI, as any error would trump the cached value. To address this, display the current cached value (if any), the last error (if any), as well as the cache freshness / TTL for all portions of the debug status UI.

This enables simulation of an unresponsive topo service.

Although the existing code already checks the cache TTL before returning the value, the cached entry could be displayed in the status UI past its TTL which isn't what we want.

Previously the lastValueTime was only updated when the watch detected a change to the value. This meant that when the watch failed, it would not necessarily cache the value for the full TTL, since the time in cache was computed from the last change, not from the time when the watch stopped. To address this, always update lastValueTime when the watch was running but stops, regardless of whether or not it got a value or an error. Also make a couple of other changes to the tracking of lastErrorTime and the last value to make the status UI more reflective of the underlying state of the cache.

demmer · 2018-02-14T00:08:44Z

@alainjobart Oh of course!

Thanks for the feedback. I reworked to use a channel like you originally suggested and added tests to ensure that in fact it does the right thing when the context times out.

I also rebased and squashed things down so the guts of the new logic is in 170a542

demmer · 2018-02-14T00:08:59Z

(Oh and this is ready for final review).

Rework GetSrvKeyspaceNames to asynchronously spin up a separate goroutine and drop the mutex before calling out to the topo service. This ensures that a valid cached value will be returned immediately even if it is time to refresh the value, and that the mutex isn't held while fetching. Also rework GetSrvKeyspace so that the goroutine used for the watch is spun up before the initial watch is established, following the same pattern. This way we can continue to return cached entries and the debug UI doesn't get locked up even if the topo service is unresponsive.

alainjobart

A few minor things, but looks great!

alainjobart · 2018-02-14T00:37:12Z