Skip to content

[vt/discovery] Add region context in GetAggregateStats#4459

Closed
rafael wants to merge 4 commits intovitessio:masterfrom
tinyspeck:add-region-context-aggr-stats
Closed

[vt/discovery] Add region context in GetAggregateStats#4459
rafael wants to merge 4 commits intovitessio:masterfrom
tinyspeck:add-region-context-aggr-stats

Conversation

@rafael
Copy link
Copy Markdown
Member

@rafael rafael commented Dec 16, 2018

Description

While testing a multi-cell setup using the regions feature, I discovered that there was a bug in the original implementation.

As part of that PR healthy tablet_stats was modified to take the region into consideration. However, this was not the case for the aggregates. This has the side effect that the resolver was not being able to find tablets that are in a different cell, but same region.

My first idea was to change the aggregates to not be by cell but by region, but that creates some complications as these aggregates are used as part of TargetStatsListener interface which broadcast changes of targets. At the moment,l2vtgates rely on this interface.

I tried different approaches and right now I'm leaning towards adding aggregatesPerRegion to tablet_stats_cache and change the implementation of GetAggregateStats so it takes region into consideration when looking for aggregates stats.

I was pondering if it would make sense to make the Target aware of the region, but that seems like a very invasive change.

Tests

  • I added some unit tests that were missing.
  • My initial approach was to build aggregates per region on the fly, but that introduced a performance regression in GetAggregateStats... I added some benchmarks to make sure these changes do not introduce a regression in the resolver.
  • Tested these changes in our dev cluster.

TODO

  • This implementation fixes this problem for vtgates. l2vtgates would need to be updated to have a similar fix. If we decide that the current approach is viable, I think we can do that as part of a different PR
  • Add an end-to-end test for multi-cell region functionality. Also, would like to add that once we decide on the approach here.

Rafael Chacon added 3 commits December 15, 2018 12:18
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
* This change allows to have constant time resolution of aggregates per region

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael rafael requested a review from sougou as a code owner December 16, 2018 01:26
…ct def

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael
Copy link
Copy Markdown
Member Author

rafael commented Dec 16, 2018

@alainjobart you know in more depth this part of the codebase. May I get your feedback on this?

@inexplicable could you take a look too please? You might also have some extra context here.

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Dec 16, 2018

This should probably be a separate and more detailed write-up. But I think we l2vtgate is now obsolete. The reason is because there are now service-mesh tools like envoy that can perform the same work generically.

I believe this also obsoletes the need for aggregate stats, although I need to read through all the use cases.

@rafael
Copy link
Copy Markdown
Member Author

rafael commented Dec 19, 2018

@sougou - created PR with this fix that also removes l2vtgates.

@rafael
Copy link
Copy Markdown
Member Author

rafael commented Dec 21, 2018

Closing this in favor of #4476

@rafael rafael closed this Dec 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants