Skip to content

add resilient topo server caching for the full srv keyspace object#3610

Merged
alainjobart merged 11 commits intovitessio:masterfrom
tinyspeck:resilient-topo-server-should-be-more-resilient
Feb 5, 2018
Merged

add resilient topo server caching for the full srv keyspace object#3610
alainjobart merged 11 commits intovitessio:masterfrom
tinyspeck:resilient-topo-server-should-be-more-resilient

Conversation

@demmer
Copy link
Copy Markdown
Member

@demmer demmer commented Jan 31, 2018

What

Add support to the ResilientServer to cache the whole SrvKeyspace object for the configured TTL and a configurable refresh interval in addition to the TTL.

Why

This change fixes errors that we have observed where vtgate fails to route queries during intermittent consul cluster issues.

Details

For entries we don't watch (like the list of Keyspaces), we refresh the cached list from the topo after srv_topo_cache_refresh elapses. If the fetch fails, we hold onto the cached value until srv_topo_cache_ttl elapses.

For entries we watch (like the SrvKeyspace for a given cell), if setting the watch fails, we will use the last known value until srv_topo_cache_ttl elapses and we only try to re-establish the watch once every srv_topo_cache_refresh interval.

I also reused some of the metrics that were being counted for the GetSrvKeyspaceNames handler for the GetSrvKeyspace watch so we can see what is going on under the hood better.

Testing

In addition to the new unit tests added in this PR, I tested these changes on one of our dev vtgate hosts on which I stopped the consul agent causing all topo calls to fail, and confirmed that queries could be routed for the duration of the cache TTL, but once the TTL elapsed the queries started failing as expected.

@demmer demmer requested a review from alainjobart January 31, 2018 05:37
@demmer demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 94d008f to cce0d3e Compare January 31, 2018 05:56
Copy link
Copy Markdown
Contributor

@alainjobart alainjobart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for doing this.

I agree with your future work list too, good stuff.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably extra code here that needs to go?

@demmer demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 0165c35 to 6f87074 Compare February 1, 2018 05:51
@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 1, 2018

@alainjobart I finished up the future thoughts so this ready for review.

Note that I ended up making a bunch of changes to try to make the test comprehensive and reliable without slowing down too much, which a challenge since many of these behaviors are dependent on real clocks...

To be safe I marked it as flaky to try to appease Travis but this might end up needing more work if it is too unreliable.

Copy link
Copy Markdown
Contributor

@alainjobart alainjobart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. A few minor comments.

Thanks for doing this!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if c.Err == topo.ErrNoNode here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good catch thanks.
Fixed but I should probably add a test for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:
defer cancel()
right at the beginning of the go func() would be clearer?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea - fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made that change because the test was giving you a warning in the unit_race?
If it's just for the test, I would change the line that sets the value to be much earlier in the test, before it's used, so you don't need this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly -- for some reason I just tried reverting that change to reproduce it and it doesn't happen any more. Yay races.

Regardless your idea is way better so I just did that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where to add this, but it would be nice to have scenarios that describe the behavior, something like:

  • for entries we don't watch (like the list of Keyspaces), we refresh the list only after srv_topo_cache_ttl.
  • for entries we watch, and setting the watch fails, we will use the last know value for srv_topo_cache_ttl.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I added a block comment describing the behavior in the code itself. I can try to edit the flag description to be more clear but I would worry that would be too long to be useful.

@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 2, 2018

@alainjobart This is now passing tests and I think should be ready to go.

demmer added 11 commits February 4, 2018 21:30
In order to simulate an infrastructure failure with the topo service,
add a SetError hook to the memorytopo factory with hooks to return the
given error for all calls and propagation to all active watches.
…names

Add support to cache the last value obtained for the SrvKeyspace so vtgate
can continue to route queries for the duration of the cache TTL even when
the local topo service is unavailable.
Separate out the cache TTL from the refresh interval for the keyspace
names caching.

This way during an error state it will not hammer the topo server and
at the same time we can set a long cache TTL to handle potential
topo service unavailability.
Piggybacking on the srv_topo_cache_refresh flag, keep track of the
last time the watch was attempted and don't try to re-establish the
watch until the refresh interval expires.

Also add more error logging and metrics for when the cache is used
for GetSrvKeyspace.
The logic depends on the refresh being smaller than or equal to the
TTL so add a startup assertion that it is configured that way.

Also update all test cases that override the ttl to also override
the refresh interval to avoid hitting this assertion.
@demmer demmer force-pushed the resilient-topo-server-should-be-more-resilient branch from 7bad15c to faea906 Compare February 5, 2018 05:35
@demmer
Copy link
Copy Markdown
Member Author

demmer commented Feb 5, 2018

@alainjobart As per conversation in Slack I've cleaned up the commit history and rebased to remove the various back and forth trying to get the tests to pass.

The remaining commits seem logical enough to me to keep them as-is.

@alainjobart
Copy link
Copy Markdown
Contributor

alainjobart commented Feb 5, 2018

LGTM

Approved with PullApprove

@alainjobart alainjobart merged commit 638a807 into vitessio:master Feb 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants