Skip to content

Only refresh required tablet's information in VTOrc#11220

Merged
GuptaManan100 merged 10 commits intovitessio:mainfrom
planetscale:vtorc-conservative-refresh
Sep 15, 2022
Merged

Only refresh required tablet's information in VTOrc#11220
GuptaManan100 merged 10 commits intovitessio:mainfrom
planetscale:vtorc-conservative-refresh

Conversation

@GuptaManan100
Copy link
Contributor

@GuptaManan100 GuptaManan100 commented Sep 14, 2022

Description

In #10115 and #10150 we added the capability of refreshing VTOrc's ephemeral information before it ran any fix. This was required to help us guarantee safety however, the first iteration of this change was inefficient.

We used to refresh all the tablets that are in the VTOrc instance's purview for each and every recovery. As part of the PRs a TODO for fixing this was also added -

// TODO (@GuptaManan100): Refresh only the shard tablet information instead of all the tablets

This change couldn't be immediately accomplished because we first required the cleanup of cluster_alias, cluster_name, and suggested_cluster_alias. This cleanup was addressed in #11193.

This PR addresses this TODO that was introduced then in order to make the recoveries more efficient and faster. Instead of the proposed refreshing all tablets in a shard in the TODO, this PR takes it a step further and only refreshes the tablets that are required.

To this end, all the recoveries have been categorized into two types, the ones that are cluster-wide recoveries and the ones that aren't.

If we are about to run a cluster-wide recovery like electNewPrimary or recoverDeadPrimary, then it is imperative to first refresh all the tablets of a shard because a new tablet could have been promoted, and we need to have this visibility before we run a cluster operation of our own.

Non-cluster-wide recoveries are only concerned with the specific tablet on which the failure occurred and the primary instance of the shard. For example, ConnectedToWrongPrimary analysis only cares for which tablet is the current primary and the host-port set on the tablet in question. So, we only need to refresh the tablet info records (to know if the primary tablet has changed), and the replication data of the new primary and this tablet.

These changes make VTOrc recoveries much faster, while still guaranteeing correctness like they used to 🧞

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

… code and use it for finding if their is an actionable recovery and the recovery function

Signed-off-by: Manan Gupta <manan@planetscale.com>
…e when getting replication analysis

Signed-off-by: Manan Gupta <manan@planetscale.com>
…n instead of the big-hammer approach of refreshing everything

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTOrc Vitess Orchestrator integration labels Sep 14, 2022
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Sep 14, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive. Additionally, flag names should use dashes (-) as word separators rather than underscores (_).
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

AnalyzedInstanceDataCenter string
AnalyzedInstanceRegion string
AnalyzedKeyspace string
AnalyzedShard string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the name of the shard that was analyzed too now that we want to restrict the number of tablets we want to refresh

Comment on lines +259 to +261
if forceDiscovery {
log.Infof("Force discovered - %+v", instance)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition of logging is intentional. Until we have a metrics page where we export the internal database information of VTOrc, this is going to be very useful in debugging. I had it in my mind to add this log and I am just piggy-backing on this PR.

Comment on lines +313 to +336
func shardPrimary(keyspace string, shard string) (primary *topodatapb.Tablet, err error) {
query := `SELECT
info,
hostname,
port,
tablet_type,
primary_timestamp
FROM
vitess_tablet
WHERE
keyspace = ? AND shard = ?
ORDER BY
tablet_type ASC,
primary_timestamp DESC
LIMIT 1
`
err = db.Db.QueryOrchestrator(query, sqlutils.Args(keyspace, shard), func(m sqlutils.RowMap) error {
if primary == nil {
primary = &topodatapb.Tablet{}
return prototext.Unmarshal([]byte(m.GetString("info")), primary)
}
return nil
})
return primary, err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another small enhancement that was made in this PR, wherein we can use the tablet information we have collected to find the shard primary tablet. We use something similar in GetReplicationAnalysis to find who the primary is.
Previously we used to do a topo-server call to read the shard and tablet record, but it isn't required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added tests for this function and when I added the tests I also found a bug in my original implementation 🤣. I don't just need to descending sort the tablet_types, we need to filter on them!

// Can't do this now since SuggestedClusterAlias, ClusterName, ClusterAlias aren't consistent
// and passing any one causes issues in some failures
analysisEntries, err := inst.GetReplicationAnalysis("", &inst.ReplicationAnalysisHints{})
analysisEntries, err := inst.GetReplicationAnalysis(analysisEntry.ClusterDetails.ClusterName, &inst.ReplicationAnalysisHints{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change that could have happened in #11193 but I am piggy-backing on this PR to not create a separate one just for this change.

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…tion

Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@rsajwani rsajwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

return nil, err
// shardPrimary finds the primary of the given keyspace-shard by reading the orchestrator backend
func shardPrimary(keyspace string, shard string) (primary *topodatapb.Tablet, err error) {
query := `SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment: should we not put all query execution in some retryable template function?

Copy link
Contributor Author

@GuptaManan100 GuptaManan100 Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be a good addition, but so far we have not really needed, because even if the read fails, we just fail the recovery and then retry later.

primary_timestamp DESC
LIMIT 1
`
err = db.Db.QueryOrchestrator(query, sqlutils.Args(keyspace, shard, topodatapb.TabletType_PRIMARY), func(m sqlutils.RowMap) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of cleaning up, we should rename some of these functions. QueryOrchestrator is not the right name for this function. Similarly we have OpenOrchestrator too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had it in my mind to get rid of the Orchestrator references everywhere, from Parameters, flags, package names, function names, etc. I'll do in a follow-up PR so that it is easier to review

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit b5d8281 into vitessio:main Sep 15, 2022
@GuptaManan100 GuptaManan100 deleted the vtorc-conservative-refresh branch September 15, 2022 07:50
notfelineit pushed a commit to planetscale/vitess that referenced this pull request Sep 21, 2022
…itessio#1061)

* refactor: make recovery Function code as the identifier of a function code and use it for finding if their is an actionable recovery and the recovery function

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remvoe a TODO in checkIfAlreadyFixed by sending the cluster name when getting replication analysis

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: refactor refersh code logic to only refresh required information instead of the big-hammer approach of refreshing everything

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add logs and refresh for analyzed instance after a recovery

Signed-off-by: Manan Gupta <manan@planetscale.com>

* refactor: fix typing error in comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: use context.Background() instead of nil

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add testing for refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add tests for shardPrimary function and also fix its implementation

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: address review comments

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: use cmp with proto.Equal

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants