Topology watcher refreshKnownTablets option#3965
Merged
demmer merged 5 commits intovitessio:masterfrom May 29, 2018
Merged
Conversation
Using the newly added counters, add verification that the various operation counts occur as expected. This required also adding calls to topo.FixShardReplication in the to avoid differences in the operation counts between the two types of topology watchers. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
Unlike ResetAll, ZeroAll keeps all the same keys in the map but changes all the values to zero. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
Instead of tracking all the tablets by the TabletToMapKey value, use the alias as the key to all the data structures used in the scan comparisons. This change mostly doesn't change the behavior at all, with one exception when a tablet with a known alias changes the value of its address key. Previously the watcher would call AddTablet, then RemoveTablet, now it explicitly calls ReplaceTablet, which has the same net effect and seems more correct. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
Add a refreshKnownTablets option for the TopologyWatcher and a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes. However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
sougou
reviewed
May 23, 2018
| tw.mu.Lock() | ||
| for _, tAlias := range tabletAliases { | ||
| if !tw.refreshKnownTablets { | ||
| aliasStr := topoproto.TabletAliasString(tAlias) |
Contributor
There was a problem hiding this comment.
Looks like you decided to not use the alias as the key instead. But healthcheck is still using TabletToMapKey. How do the two coordinate correctly?
Member
Author
There was a problem hiding this comment.
The TopologyWatcher now uses the alias as the key for the internal tablets map and all the temporary data structures.
When it calls into healthcheck to add/remove/replace the tablet, it passes the full tablet record. At that point HC recomputes its own hash key from the address map. I think we could (and probably should) switch that to store tablets keyed by the alias as well, but it's not necessary as part of this change.
Contributor
|
This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge. |
Member
Author
|
I also want to caveat that we haven’t tried this in an actual test deployment (yet).
…-m
On May 22, 2018, at 8:26 PM, Sugu Sougoumarane ***@***.***> wrote:
This is good for me. @alainjobart can you eyeball? If you don't have the time, we can just merge.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
sougou
approved these changes
May 29, 2018
4 tasks
ejortegau
added a commit
to slackhq/vitess
that referenced
this pull request
Sep 16, 2024
This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
ejortegau
added a commit
to slackhq/vitess
that referenced
this pull request
Oct 4, 2024
* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds an option to the topology watcher to reduce the topo k/v polling load in environments where the tablets never change their address/port information once launched.
Motivation
The TopologyWatcher module in vtgate is responsible for periodically checking the topo service to find out whether new tablets have been provisioned so that HealthCheck can be notified.
To support environments in which the tablet may change address/port information, i.e. when the association between a tablet alias and the host/port map isn't stable over time, the default behavior gets a list of all the tablet aliases and then re-reads the topo k/v for each tablet.
This operation is by far the majority of the k/v polling load from a vtgate, and as the cluster grows, the rate of k/v requests is
NumVtgates * NumVttablets / PollingInterval, which grows quickly as the cluster grows.Changes
To reduce this load, this PR adds a
refreshKnownTabletsoption to theTopologyWatcherand a corresponding flag in discovery gateway. The default behavior is unchanged which means that each vtgate will periodically re-read the TabletInfo record for each tablet in case the address/port map changes.However the new flag can disable these queries for environments in which the association between a tablet alias and the host/port map never changes. This greatly reduces the load on the topo service since most of the k/v requests are for refreshing the TabletInfo and there's no efficient way to watch for this data.
Testing
I added extensive unit tests for this but have not (yet) verified in a real environment.