Skip to content

make the resilient topo cache even more resilient and informative#3641

Merged
alainjobart merged 6 commits intovitessio:masterfrom
tinyspeck:resilient-topo-more-resilient-and-informative
Feb 14, 2018
Merged

make the resilient topo cache even more resilient and informative#3641
alainjobart merged 6 commits intovitessio:masterfrom
tinyspeck:resilient-topo-more-resilient-and-informative

Conversation

@demmer
Copy link
Copy Markdown
Member

@demmer demmer commented Feb 8, 2018

Overview

Improve the status UI for the resilient topo cache, make GetSrvKeyspaceNames fetch asynchronously, and fix a logic bug in the watch-based caching for GetSrvKeyspace.

This incorporates the ideas first proposed by @yangxuanjia in #3640 while addressing my concerns about the refresh interval and properly managing locks and state.

Details

This first came up because I wanted better indication in the status UI about the state of the cache. In particular, it used to either show the cached value or show the most recent error, but would not indicate if the cache was valid or not.

To address this, I changed the status UI to display the current cached value (if any), the last error (if any), as well as information about the cache freshness / TTL.

Next, as discussed in slack and in #3640 I moved the topo in GetSrvKeyspaceNames into a separate goroutine, and kept tracking state in the entry to make sure only one topo request is outstanding. This makes sure that a cached value can always be returned immediately even if the topo service is unresponsive or slow.

However, unlike in the other proposal, if there is no cached value, all requests block waiting for the fetcher goroutine to complete. This seems like an acceptable tradeoff to me since it avoids errors on the initial query, and for the most part should insulate callers from topo service unavailability or slowness.

Finally, I noticed a logic bug in the earlier changes that I made to cache the watched SrvKeyspace object. Previously the lastValueTime was updated on every change, but that meant that if the watch failed after some time where no changes were detected, the cache did not properly keep the last value for the full TTL as designed. To address this, I changed it to update lastValueTime whenever an established watch returns, either due to error or the value being in the cache.

Testing

I added a unit test for the last case and did a decent amount of manual testing both using zookeeper and consul.

I'd like to add some tests for the logic in the status UI since that turned out to be surprisingly tricky to get right, as well as the case where the topo service is blocked.

@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 8, 2018

Oh and one more thing - I think that for completeness we should also rework the logic that sets up the watch to also follow this pattern to perform the topo call asynchronously and after dropping the mutex, again to ensure that queries can continue to flow if there is a valid cached value but the topo service is unresponsive.

@alainjobart
Copy link
Copy Markdown
Contributor

I got stuck on the use of the WaitGroup to synchronize the background processing. I suggest changing that to use a channel instead:

  • if the channel is non-nil, it means someone is already asking for the data in the background.
  • if the channel is nil, create one. When the fetching is done, retake the mutex, close the channel, nil the value.
  • the wait code (that's outside the mutex, where the WG.Wait() is now) would then instead be a select(), and the added bonus here is you can wait for either <-ctx.Done(), and the channel closure.

I think that would be more go-like to do it this way. A WaitGroup somewhat implies you start a bunch of tasks, and wait for them all to finish. This is not what is done here, and is confusing (at least I was confused hehe). Because then you have to think and make sure you can't Add(1) to a WG of value 0 at the same time as a Wait() is running (which doesn't happen here, but you have to think a lot about it).

@yangxuanjia
Copy link
Copy Markdown
Contributor

i suggest if entry.value has value, do not set "entry.value = nil" even if it expired.

@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 9, 2018

If that is the behavior you want then you can effectively get it by setting an extremely long TTL like 1 year.

@yangxuanjia
Copy link
Copy Markdown
Contributor

yangxuanjia commented Feb 9, 2018

if topo ok, we want a little TTL. But if the topo all down, we want use cache even if it's expired, util we recover the topo. if cache expired and topo all down, the entry.value =nil, we will get nothing, user can't connect vtgate normal. this is not we want.

@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 9, 2018

Ah yes. Same for us :)

That’s why I added a second flag to control the refresh interval separate from the TTL.

That way you can update the state more frequently but still keep the cached value for a long time.

@yangxuanjia
Copy link
Copy Markdown
Contributor

oh, I see. server.cacheTTL is more important, it decide cacheValid which user can get entry in cache.
like set cacheTTL = 7day, set server.cacheRefresh= 1second, it looks work good.

@yangxuanjia
Copy link
Copy Markdown
Contributor

LGTM

@demmer demmer force-pushed the resilient-topo-more-resilient-and-informative branch 2 times, most recently from 929c0ef to 46b149b Compare February 12, 2018 14:54
@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 12, 2018

@alainjobart I think this should be ready for final review assuming Travis passes.

I made the following additional changes:

  • Move all the GetSrvKeyspace watch into the separate goroutine so that it doesn't hold the lock or block cache access when calling out to the topo service.
  • Convert from WaitGroup to sync.Cond for the cases when there is no cached value and we are waiting for a topo service response. This seemed like a cleaner abstraction than a channel as you suggested.
  • Add a test that simulates an unresponsive topo service by explicitly grabbing the mutex on the memorytopo during the test to block out other requests.

@alainjobart
Copy link
Copy Markdown
Contributor

This looks good. The only reservation I have is the Wait() on the sync.Cond may block for a long time, even after ctx.Done(). That means a go routine waiting to execute may be dangling, even when the caller has timed out. That's the main reason I was suggesting a channel, so you can stop waiting on either the channel, or ctx.Done(). With a sync.Cond, you can't wait on either the Cond or ctx.Done().

Previously there was no way to tell if the resilient server had a
valid cached value from the debug UI, as any error would trump the
cached value.

To address this, display the current cached value (if any), the last
error (if any), as well as the cache freshness / TTL for all portions
of the debug status UI.
This enables simulation of an unresponsive topo service.
Although the existing code already checks the cache TTL before returning
the value, the cached entry could be displayed in the status UI past its
TTL which isn't what we want.
Previously the lastValueTime was only updated when the watch detected
a change to the value. This meant that when the watch failed, it would
not necessarily cache the value for the full TTL, since the time in
cache was computed from the last change, not from the time when the
watch stopped.

To address this, always update lastValueTime when the watch was running
but stops, regardless of whether or not it got a value or an error.
Also make a couple of other changes to the tracking of lastErrorTime
and the last value to make the status UI more reflective of the
underlying state of the cache.
@demmer demmer force-pushed the resilient-topo-more-resilient-and-informative branch from 46b149b to 170a542 Compare February 14, 2018 00:04
@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 14, 2018

@alainjobart Oh of course!

Thanks for the feedback. I reworked to use a channel like you originally suggested and added tests to ensure that in fact it does the right thing when the context times out.

I also rebased and squashed things down so the guts of the new logic is in 170a542

@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 14, 2018

(Oh and this is ready for final review).

Rework GetSrvKeyspaceNames to asynchronously spin up a separate goroutine
and drop the mutex before calling out to the topo service. This ensures
that a valid cached value will be returned immediately even if it is
time to refresh the value, and that the mutex isn't held while fetching.

Also rework GetSrvKeyspace so that the goroutine used for the watch is
spun up before the initial watch is established, following the same
pattern.

This way we can continue to return cached entries and the debug UI
doesn't get locked up even if the topo service is unresponsive.
@demmer demmer force-pushed the resilient-topo-more-resilient-and-informative branch from 170a542 to 13cfb4e Compare February 14, 2018 00:35
Copy link
Copy Markdown
Contributor

@alainjobart alainjobart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, but looks great!

refreshingChan := entry.refreshingChan
entry.mutex.Unlock()
select {
case _, _ = <-refreshingChan:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be:
case <-refreshingChan:
(same as next line)

watchStartingChan := entry.watchStartingChan
entry.mutex.Unlock()
select {
case _, _ = <-watchStartingChan:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment:
case <-watchStartingChan:

entry.insertionTime = time.Time{}
entry.value = nil
}
// Rrefresh the state in a background goroutine if no refresh is already
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rrefresh -> Refresh

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

watchStateRunning
)

type watchState int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can declare that one before the const, and then type the consts:

const (
watchStateIdle watchState = iota
watchStateStarting
watchStateRunning
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// valueTime is the time when the watch last obtained a non-nil value.
// watchStartingCond is used to serialize callers for the first attempt
// to establish the watch
watchStartingChan chan bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these are usually: chan struct{}
and not bool (applies to the other one too)
(ctx.Done() is <-chan struct{})

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good to know. Done.

// Lock the entry, and do everything holding the lock except
// querying the underlying topo server.
//
// This means that unless the topo server is very slow, two concurrent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the 'unless the topo server is very slow' part any more, is that still true?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Updated.

@demmer demmer force-pushed the resilient-topo-more-resilient-and-informative branch from 83611cf to bb3939e Compare February 14, 2018 00:55
@demmer demmer force-pushed the resilient-topo-more-resilient-and-informative branch from bb3939e to fdbbc67 Compare February 14, 2018 00:56
@alainjobart
Copy link
Copy Markdown
Contributor

alainjobart commented Feb 14, 2018

LGTM will merge after travis passes

Approved with PullApprove

@alainjobart alainjobart merged commit 5933ffa into vitessio:master Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants