add resilient topo server caching for the full srv keyspace object#3610
Conversation
94d008f to
cce0d3e
Compare
alainjobart
left a comment
There was a problem hiding this comment.
This looks good, thanks for doing this.
I agree with your future work list too, good stuff.
There was a problem hiding this comment.
Probably extra code here that needs to go?
0165c35 to
6f87074
Compare
|
@alainjobart I finished up the future thoughts so this ready for review. Note that I ended up making a bunch of changes to try to make the test comprehensive and reliable without slowing down too much, which a challenge since many of these behaviors are dependent on real clocks... To be safe I marked it as flaky to try to appease Travis but this might end up needing more work if it is too unreliable. |
alainjobart
left a comment
There was a problem hiding this comment.
Looks good overall. A few minor comments.
Thanks for doing this!
go/vt/srvtopo/resilient_server.go
Outdated
There was a problem hiding this comment.
if c.Err == topo.ErrNoNode here?
There was a problem hiding this comment.
Ah yes good catch thanks.
Fixed but I should probably add a test for this.
go/vt/srvtopo/resilient_server.go
Outdated
There was a problem hiding this comment.
maybe:
defer cancel()
right at the beginning of the go func() would be clearer?
go/vt/srvtopo/resilient_server.go
Outdated
There was a problem hiding this comment.
You made that change because the test was giving you a warning in the unit_race?
If it's just for the test, I would change the line that sets the value to be much earlier in the test, before it's used, so you don't need this.
There was a problem hiding this comment.
Yeah exactly -- for some reason I just tried reverting that change to reproduce it and it doesn't happen any more. Yay races.
Regardless your idea is way better so I just did that.
go/vt/srvtopo/resilient_server.go
Outdated
There was a problem hiding this comment.
I wonder where to add this, but it would be nice to have scenarios that describe the behavior, something like:
- for entries we don't watch (like the list of Keyspaces), we refresh the list only after srv_topo_cache_ttl.
- for entries we watch, and setting the watch fails, we will use the last know value for srv_topo_cache_ttl.
There was a problem hiding this comment.
Good idea. I added a block comment describing the behavior in the code itself. I can try to edit the flag description to be more clear but I would worry that would be too long to be useful.
|
@alainjobart This is now passing tests and I think should be ready to go. |
In order to simulate an infrastructure failure with the topo service, add a SetError hook to the memorytopo factory with hooks to return the given error for all calls and propagation to all active watches.
…names Add support to cache the last value obtained for the SrvKeyspace so vtgate can continue to route queries for the duration of the cache TTL even when the local topo service is unavailable.
Separate out the cache TTL from the refresh interval for the keyspace names caching. This way during an error state it will not hammer the topo server and at the same time we can set a long cache TTL to handle potential topo service unavailability.
Piggybacking on the srv_topo_cache_refresh flag, keep track of the last time the watch was attempted and don't try to re-establish the watch until the refresh interval expires. Also add more error logging and metrics for when the cache is used for GetSrvKeyspace.
The logic depends on the refresh being smaller than or equal to the TTL so add a startup assertion that it is configured that way. Also update all test cases that override the ttl to also override the refresh interval to avoid hitting this assertion.
7bad15c to
faea906
Compare
|
@alainjobart As per conversation in Slack I've cleaned up the commit history and rebased to remove the various back and forth trying to get the tests to pass. The remaining commits seem logical enough to me to keep them as-is. |
What
Add support to the ResilientServer to cache the whole SrvKeyspace object for the configured TTL and a configurable refresh interval in addition to the TTL.
Why
This change fixes errors that we have observed where vtgate fails to route queries during intermittent consul cluster issues.
Details
For entries we don't watch (like the list of Keyspaces), we refresh the cached list from the topo after srv_topo_cache_refresh elapses. If the fetch fails, we hold onto the cached value until srv_topo_cache_ttl elapses.
For entries we watch (like the SrvKeyspace for a given cell), if setting the watch fails, we will use the last known value until srv_topo_cache_ttl elapses and we only try to re-establish the watch once every srv_topo_cache_refresh interval.
I also reused some of the metrics that were being counted for the GetSrvKeyspaceNames handler for the GetSrvKeyspace watch so we can see what is going on under the hood better.
Testing
In addition to the new unit tests added in this PR, I tested these changes on one of our dev vtgate hosts on which I stopped the consul agent causing all topo calls to fail, and confirmed that queries could be routed for the duration of the cache TTL, but once the TTL elapsed the queries started failing as expected.