add resilient topo server caching for the full srv keyspace object by demmer · Pull Request #3610 · vitessio/vitess

demmer · 2018-01-31T05:37:16Z

What

Add support to the ResilientServer to cache the whole SrvKeyspace object for the configured TTL and a configurable refresh interval in addition to the TTL.

Why

This change fixes errors that we have observed where vtgate fails to route queries during intermittent consul cluster issues.

Details

For entries we don't watch (like the list of Keyspaces), we refresh the cached list from the topo after srv_topo_cache_refresh elapses. If the fetch fails, we hold onto the cached value until srv_topo_cache_ttl elapses.

For entries we watch (like the SrvKeyspace for a given cell), if setting the watch fails, we will use the last known value until srv_topo_cache_ttl elapses and we only try to re-establish the watch once every srv_topo_cache_refresh interval.

I also reused some of the metrics that were being counted for the GetSrvKeyspaceNames handler for the GetSrvKeyspace watch so we can see what is going on under the hood better.

Testing

In addition to the new unit tests added in this PR, I tested these changes on one of our dev vtgate hosts on which I stopped the consul agent causing all topo calls to fail, and confirmed that queries could be routed for the duration of the cache TTL, but once the TTL elapsed the queries started failing as expected.

alainjobart

This looks good, thanks for doing this.

I agree with your future work list too, good stuff.

alainjobart · 2018-01-31T16:19:26Z

go/vt/srvtopo/resilient_server_test.go

Probably extra code here that needs to go?

demmer · 2018-02-01T05:52:31Z

@alainjobart I finished up the future thoughts so this ready for review.

Note that I ended up making a bunch of changes to try to make the test comprehensive and reliable without slowing down too much, which a challenge since many of these behaviors are dependent on real clocks...

To be safe I marked it as flaky to try to appease Travis but this might end up needing more work if it is too unreliable.

alainjobart

Looks good overall. A few minor comments.

Thanks for doing this!

alainjobart · 2018-02-01T20:44:37Z

go/vt/srvtopo/resilient_server.go

if c.Err == topo.ErrNoNode here?

Ah yes good catch thanks.
Fixed but I should probably add a test for this.

alainjobart · 2018-02-01T20:47:57Z

go/vt/srvtopo/resilient_server.go

maybe:
defer cancel()
right at the beginning of the go func() would be clearer?

good idea - fixed.

alainjobart · 2018-02-01T20:49:38Z

go/vt/srvtopo/resilient_server.go

You made that change because the test was giving you a warning in the unit_race?
If it's just for the test, I would change the line that sets the value to be much earlier in the test, before it's used, so you don't need this.

Yeah exactly -- for some reason I just tried reverting that change to reproduce it and it doesn't happen any more. Yay races.

Regardless your idea is way better so I just did that.

alainjobart · 2018-02-01T20:56:16Z

go/vt/srvtopo/resilient_server.go

I wonder where to add this, but it would be nice to have scenarios that describe the behavior, something like:

for entries we don't watch (like the list of Keyspaces), we refresh the list only after srv_topo_cache_ttl.

for entries we watch, and setting the watch fails, we will use the last know value for srv_topo_cache_ttl.

Good idea. I added a block comment describing the behavior in the code itself. I can try to edit the flag description to be more clear but I would worry that would be too long to be useful.

demmer · 2018-02-02T04:32:16Z

@alainjobart This is now passing tests and I think should be ready to go.

In order to simulate an infrastructure failure with the topo service, add a SetError hook to the memorytopo factory with hooks to return the given error for all calls and propagation to all active watches.

…names Add support to cache the last value obtained for the SrvKeyspace so vtgate can continue to route queries for the duration of the cache TTL even when the local topo service is unavailable.

Separate out the cache TTL from the refresh interval for the keyspace names caching. This way during an error state it will not hammer the topo server and at the same time we can set a long cache TTL to handle potential topo service unavailability.

Piggybacking on the srv_topo_cache_refresh flag, keep track of the last time the watch was attempted and don't try to re-establish the watch until the refresh interval expires. Also add more error logging and metrics for when the cache is used for GetSrvKeyspace.

The logic depends on the refresh being smaller than or equal to the TTL so add a startup assertion that it is configured that way. Also update all test cases that override the ttl to also override the refresh interval to avoid hitting this assertion.

demmer · 2018-02-05T05:37:25Z

@alainjobart As per conversation in Slack I've cleaned up the commit history and rebased to remove the various back and forth trying to get the tests to pass.

The remaining commits seem logical enough to me to keep them as-is.

alainjobart · 2018-02-05T15:02:17Z

LGTM

googlebot added the cla: yes label Jan 31, 2018

demmer requested a review from alainjobart January 31, 2018 05:37

demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 94d008f to cce0d3e Compare January 31, 2018 05:56

alainjobart reviewed Jan 31, 2018

View reviewed changes

go/vt/srvtopo/resilient_server_test.go Outdated

Copy link
Copy Markdown

Contributor

alainjobart Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably extra code here that needs to go?

demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 0165c35 to 6f87074 Compare February 1, 2018 05:51

alainjobart reviewed Feb 1, 2018

View reviewed changes

demmer added 11 commits February 4, 2018 21:30

add memorytopo hook to force an error to be returned

8dbe5c4

In order to simulate an infrastructure failure with the topo service, add a SetError hook to the memorytopo factory with hooks to return the given error for all calls and propagation to all active watches.

add resilient topo server caching for the whole SrvKeyspace not just …

0fff4f7

…names Add support to cache the last value obtained for the SrvKeyspace so vtgate can continue to route queries for the duration of the cache TTL even when the local topo service is unavailable.

improve metrics and logging and improve the tests

2dc34d3

mark the resilient server test as flaky to appease travis

28a370a

fix typo caught by PR review

1a6c84a

reorganize cancel as per PR feedback

d115d46

add a comment explaining the behavior of the tuning parameters

4de39c2

handle a race in the test on slow execution machines

faea906

demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 7bad15c to faea906 Compare February 5, 2018 05:35

alainjobart merged commit 638a807 into vitessio:master Feb 5, 2018

demmer mentioned this pull request Feb 5, 2018

Slack sync upstream 2018 02 05 tinyspeck/vitess#70

Merged

Conversation

demmer commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Details

Testing

Uh oh!

alainjobart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

demmer commented Feb 1, 2018

Uh oh!

alainjobart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

demmer commented Feb 2, 2018

Uh oh!

demmer commented Feb 5, 2018

Uh oh!

alainjobart commented Feb 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

demmer commented Jan 31, 2018 •

edited

Loading

alainjobart commented Feb 5, 2018 •

edited

Loading