Skip to content

Refresh replicas and rdonly after MigrateServedTypes except on skipRefreshState.#7327

Merged
rafael merged 1 commit intovitessio:masterfrom
tinyspeck:ma-refresh-fix
Feb 25, 2021
Merged

Refresh replicas and rdonly after MigrateServedTypes except on skipRefreshState.#7327
rafael merged 1 commit intovitessio:masterfrom
tinyspeck:ma-refresh-fix

Conversation

@makinje16
Copy link
Contributor

@makinje16 makinje16 commented Jan 20, 2021

Description

This PR changes MigrateServedTypes to refresh both Replicas and Rdonly tablets after every migration. Ran into an issue during a migration in which a replica was promoted to primary but hadn't been refreshed since we initially only refreshed tablets with servedType. With the primaries migrated and refreshed but not the replicas, when there is a failover before the replicas have had their state refreshed, the replica will still believe it is in a split state causing it not to serve until RefreshState is done on the host.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build
  • VTAdmin

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, just one comment that needs to be updated now. I restarted the failed jobs for you.

I can't tell if this is flaking or if your change broke these tests https://github.com/vitessio/vitess/pull/7327/checks?check_run_id=1737826998#step:5:1932

        	            	</html>" does not contain "TabletControl.DisableQueryService set"
        	Test:       	TestMergeShardingIntShardingKey

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work for the master, but the discovery wouldn't be very efficient

This is no longer true, correct?

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makinje16 did a first pass.

As discussed, this still needs to be squashed and sign the commits to comply with the DCO requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment should be something like:

// Refresh both source and destinations to make sure they all have the latest shardInfo data.

@rohit-nayak-ps
Copy link
Member

The fix we (@rafael, @sougou and I) discussed yesterday: to expand the check in CheckTabletQueryService() to also look for the additional status master tablet with filtered replication on for non-serving tablets allowed the test to pass further than where it was failing yesterday.

But now I get that error due to replicas and rdonly partitions appearing in the SrvKeyspace. I guess one of the tablet refreshes introduced by this PR is causing this.

@rafael , @makinje16 , I think you will need to log the srvkeyspaces in the master branch and in your PR and see at which point the partitions are deviating. I don't really know enough about the old workflows to debug this efficiently.

@rafael
Copy link
Member

rafael commented Feb 18, 2021

But now I get that error due to replicas and rdonly partitions appearing in the SrvKeyspace. I guess one of the tablet refreshes introduced by this PR is causing this.

Which error exactly are you seeing? I'm curious if it's part of the same. Is it really SrvKeyspace that is deviating or that the tablet is starting query service because all the tablet control metadata has been removed and we clean up the state when finishing the migration.

@rohit-nayak-ps
Copy link
Member

Which error exactly are you seeing? I'm curious if it's part of the same. Is it really SrvKeyspace that is deviating or that the tablet is starting query service because all the tablet control metadata has been removed and we clean up the state when finishing the migration.

if len(shardServedTypes) > 0 {
    return fmt.Errorf("cannot migrate MASTER away from %v/%v until everything else is migrated. Make sure that the following types are migrated first: %v", si.Keyspace(), si.ShardName(), strings.Join(shardServedTypes, ", "))
}

Looks like the replica/rdonly tablets might be serving again after a refresh.

@rafael
Copy link
Member

rafael commented Feb 24, 2021

After some debugging and refreshing a bit our memories of how this part of the system works, here is the main key-takeaway on why we had so much trouble getting this test to pass:

We noticed something that doesn’t look right in the canServe function in tm_state.go that could be problematic after cutover.

This is the function:

func (ts *tmState) canServe(tabletType topodatapb.TabletType) string {
	if !topo.IsRunningQueryService(tabletType) {
		return fmt.Sprintf("not a serving tablet type(%v)", tabletType)
	}
	if ts.tabletControls[tabletType] {
		return "TabletControl.DisableQueryService set"
	}
	if tabletType == topodatapb.TabletType_MASTER && ts.isResharding {
		return "master tablet with filtered replication on"
	}
	return ""
}

Currently, we remove tabletControls after finishing the migration. This means that if we refresh the source they could go back to serving.

Currently it works a bit by luck for the MASTER type: we create the reverse replication stream as part of the migration and query service can't start if this is present. However, RDONLY and REPLICA will go back to serving as soon as we refresh the state on those tablets.

The concrete proposal to improve this edge case is to add extra validations in canServe in tm_state.go so a tablet can't start the query service if it's range is not part of the SrvKeyspaceGraph. This change requires some extra refactoring and it's out of scope for the immediate concern of making sure we refresh tablets on the destination. @makinje16 is already working on this and will have a follow up PR for this.

In the meantime, this LGTM to merge as is.

Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good, but I have a question on one of the comments.

@deepthi deepthi removed request for doeg and systay February 24, 2021 23:15
Signed-off-by: Malcolm Akinje <makinje@slack-corp.com>
@rafael
Copy link
Member

rafael commented Feb 25, 2021

All comments have been addressed. Merging!

@rafael rafael merged commit e40e4c5 into vitessio:master Feb 25, 2021
@rafael rafael deleted the ma-refresh-fix branch February 25, 2021 18:42
@askdba askdba added this to the v10.0 milestone Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants