make the resilient topo cache even more resilient and informative#3641
Conversation
|
Oh and one more thing - I think that for completeness we should also rework the logic that sets up the watch to also follow this pattern to perform the topo call asynchronously and after dropping the mutex, again to ensure that queries can continue to flow if there is a valid cached value but the topo service is unresponsive. |
|
I got stuck on the use of the WaitGroup to synchronize the background processing. I suggest changing that to use a channel instead:
I think that would be more go-like to do it this way. A WaitGroup somewhat implies you start a bunch of tasks, and wait for them all to finish. This is not what is done here, and is confusing (at least I was confused hehe). Because then you have to think and make sure you can't Add(1) to a WG of value 0 at the same time as a Wait() is running (which doesn't happen here, but you have to think a lot about it). |
|
i suggest if entry.value has value, do not set "entry.value = nil" even if it expired. |
|
If that is the behavior you want then you can effectively get it by setting an extremely long TTL like 1 year. |
|
if topo ok, we want a little TTL. But if the topo all down, we want use cache even if it's expired, util we recover the topo. if cache expired and topo all down, the entry.value =nil, we will get nothing, user can't connect vtgate normal. this is not we want. |
|
Ah yes. Same for us :) That’s why I added a second flag to control the refresh interval separate from the TTL. That way you can update the state more frequently but still keep the cached value for a long time. |
|
oh, I see. server.cacheTTL is more important, it decide cacheValid which user can get entry in cache. |
|
LGTM |
929c0ef to
46b149b
Compare
|
@alainjobart I think this should be ready for final review assuming Travis passes. I made the following additional changes:
|
|
This looks good. The only reservation I have is the Wait() on the sync.Cond may block for a long time, even after ctx.Done(). That means a go routine waiting to execute may be dangling, even when the caller has timed out. That's the main reason I was suggesting a channel, so you can stop waiting on either the channel, or ctx.Done(). With a sync.Cond, you can't wait on either the Cond or ctx.Done(). |
Previously there was no way to tell if the resilient server had a valid cached value from the debug UI, as any error would trump the cached value. To address this, display the current cached value (if any), the last error (if any), as well as the cache freshness / TTL for all portions of the debug status UI.
This enables simulation of an unresponsive topo service.
Although the existing code already checks the cache TTL before returning the value, the cached entry could be displayed in the status UI past its TTL which isn't what we want.
Previously the lastValueTime was only updated when the watch detected a change to the value. This meant that when the watch failed, it would not necessarily cache the value for the full TTL, since the time in cache was computed from the last change, not from the time when the watch stopped. To address this, always update lastValueTime when the watch was running but stops, regardless of whether or not it got a value or an error. Also make a couple of other changes to the tracking of lastErrorTime and the last value to make the status UI more reflective of the underlying state of the cache.
46b149b to
170a542
Compare
|
@alainjobart Oh of course! Thanks for the feedback. I reworked to use a channel like you originally suggested and added tests to ensure that in fact it does the right thing when the context times out. I also rebased and squashed things down so the guts of the new logic is in 170a542 |
|
(Oh and this is ready for final review). |
Rework GetSrvKeyspaceNames to asynchronously spin up a separate goroutine and drop the mutex before calling out to the topo service. This ensures that a valid cached value will be returned immediately even if it is time to refresh the value, and that the mutex isn't held while fetching. Also rework GetSrvKeyspace so that the goroutine used for the watch is spun up before the initial watch is established, following the same pattern. This way we can continue to return cached entries and the debug UI doesn't get locked up even if the topo service is unresponsive.
170a542 to
13cfb4e
Compare
alainjobart
left a comment
There was a problem hiding this comment.
A few minor things, but looks great!
go/vt/srvtopo/resilient_server.go
Outdated
| refreshingChan := entry.refreshingChan | ||
| entry.mutex.Unlock() | ||
| select { | ||
| case _, _ = <-refreshingChan: |
There was a problem hiding this comment.
this can just be:
case <-refreshingChan:
(same as next line)
go/vt/srvtopo/resilient_server.go
Outdated
| watchStartingChan := entry.watchStartingChan | ||
| entry.mutex.Unlock() | ||
| select { | ||
| case _, _ = <-watchStartingChan: |
There was a problem hiding this comment.
same comment:
case <-watchStartingChan:
go/vt/srvtopo/resilient_server.go
Outdated
| entry.insertionTime = time.Time{} | ||
| entry.value = nil | ||
| } | ||
| // Rrefresh the state in a background goroutine if no refresh is already |
go/vt/srvtopo/resilient_server.go
Outdated
| watchStateRunning | ||
| ) | ||
|
|
||
| type watchState int |
There was a problem hiding this comment.
nit: you can declare that one before the const, and then type the consts:
const (
watchStateIdle watchState = iota
watchStateStarting
watchStateRunning
)
go/vt/srvtopo/resilient_server.go
Outdated
| // valueTime is the time when the watch last obtained a non-nil value. | ||
| // watchStartingCond is used to serialize callers for the first attempt | ||
| // to establish the watch | ||
| watchStartingChan chan bool |
There was a problem hiding this comment.
nit: these are usually: chan struct{}
and not bool (applies to the other one too)
(ctx.Done() is <-chan struct{})
go/vt/srvtopo/resilient_server.go
Outdated
| // Lock the entry, and do everything holding the lock except | ||
| // querying the underlying topo server. | ||
| // | ||
| // This means that unless the topo server is very slow, two concurrent |
There was a problem hiding this comment.
I'm not sure about the 'unless the topo server is very slow' part any more, is that still true?
83611cf to
bb3939e
Compare
bb3939e to
fdbbc67
Compare
Overview
Improve the status UI for the resilient topo cache, make GetSrvKeyspaceNames fetch asynchronously, and fix a logic bug in the watch-based caching for GetSrvKeyspace.
This incorporates the ideas first proposed by @yangxuanjia in #3640 while addressing my concerns about the refresh interval and properly managing locks and state.
Details
This first came up because I wanted better indication in the status UI about the state of the cache. In particular, it used to either show the cached value or show the most recent error, but would not indicate if the cache was valid or not.
To address this, I changed the status UI to display the current cached value (if any), the last error (if any), as well as information about the cache freshness / TTL.
Next, as discussed in slack and in #3640 I moved the topo in GetSrvKeyspaceNames into a separate goroutine, and kept tracking state in the entry to make sure only one topo request is outstanding. This makes sure that a cached value can always be returned immediately even if the topo service is unresponsive or slow.
However, unlike in the other proposal, if there is no cached value, all requests block waiting for the fetcher goroutine to complete. This seems like an acceptable tradeoff to me since it avoids errors on the initial query, and for the most part should insulate callers from topo service unavailability or slowness.
Finally, I noticed a logic bug in the earlier changes that I made to cache the watched SrvKeyspace object. Previously the lastValueTime was updated on every change, but that meant that if the watch failed after some time where no changes were detected, the cache did not properly keep the last value for the full TTL as designed. To address this, I changed it to update lastValueTime whenever an established watch returns, either due to error or the value being in the cache.
Testing
I added a unit test for the last case and did a decent amount of manual testing both using zookeeper and consul.
I'd like to add some tests for the logic in the status UI since that turned out to be surprisingly tricky to get right, as well as the case where the topo service is blocked.