Turn off repairReplication if other reparenting events are going on by zmagg · Pull Request #4024 · vitessio/vitess

zmagg · 2018-06-13T01:52:23Z

Problem

Described more in detail in Slack, currently the tabletmanager's repairReplication races with Orchestrator's failover mechanisms, which results in a co-master situation in our topology during external reparenting events run by Orchestrator. The tabletmanager's repairReplication could also race with Vitess's internal reparenting actions.

Solution

Add support for turning off repairReplication as part of the tablet health check if other reparenting events are going on.

This implementation changes repairReplication to use the same shard lock in the topo as the PlannedReparent and EmergencyReparent operations do, eliminating the race condition between repairReplication and internal Vitess reparenting.

To minimize the race between repairReplication and Orchestrator driven external reparenting, during repairReplication, check to see if Orchestrator is doing any reparenting actively, and if so, do not repairReplication.

Before beginning any replication manipulation, use the orchestrator side lock of
BeginMaintenance/EndMaintenance to indicate that Orchestrator should not begin any
reparenting. There's still a race if Orchestrator begins failover before it receives the BeginMaintenance event. Would love to hear suggestions for how to eliminate this race entirely.

As it turns out, PlannedReparenting and EmergencyReparenting already indicate to Orchestrator that they're manipulating the topology and call out to Orchestrator's BeginMaintenance before beginning topo manipulations. (i.e. https://github.com/vitessio/vitess/blob/master/go/vt/wrangler/reparent.go#L403) This helps reduce the risk of Orchestrator and Vitess-internal reparenting from interfering with each other.

If we decide to go with this approach, there are a few other things I'd like to add as well:
[ ] add more metrics about how often repairReplication is called and ran
[ ] write unit tests

sougou · 2018-06-13T17:09:42Z

go/vt/vtctl/vtctl.go

 				"<tablet alias>",
 				"Changes metadata in the topology server to acknowledge a shard master change performed by an external tool. See the Reparenting guide for more information:" +
 					"https://github.com/vitessio/vitess/blob/master/doc/Reparenting.md#external-reparents."},
+			{"PreExternallyReparentShard", commandPreExternallyReparentShard,


I think it will be simpler to have repairReplication check with Orchestrator if it's in the middle of a maintenance instead. There are a few advantages:

Orchestrator has one less dependency.

You don't have to worry about the lifecycle of this state. What if Orc is not able to update this after it sets the value?

During repairReplication, lock the shard in the topo as it's locked for reparenting events / delete events. Also during repairReplication, check to see if Orchestrator is doing any reparenting actively, and if so, do not repairReplication. Before repairingReplication, use the _orchestrator_ side lock of BeginMaintenance to indicate that orchestrator should not begin any reparenting. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg · 2018-06-20T06:47:11Z

go/vt/vttablet/tabletmanager/rpc_replication.go

-					log.Warningf("Orchestrator EndMaintenance failed: %v", err)
-				}
-			}()
+	defer func() {


This is perhaps not super clear. I'm removing the conditional if agent.Tablet().Type == topodatapb.TabletType_MASTER here as, in the case where the tablet was a REPLICA, and we're calling into this function from repairReplication we also want to end Orchestrator maintenance at the end of this function. It's still best effort to end orchestrator maintenance even in that case, as Orchestrator maintenance windows are time bounded.

I think this is ok for now. Ideally, whoever calls begins maintenance should be the one ending it (in a defer).
So, in the case of repairReplication, the block that calls the beginMaintenance should defer endMaintenance.
But then, we should also move other call sites to the vtctld workflow (wrangler), where begin and end should be called as part of the reparent commands. This probably means that vtctld needs to know about Orc. So, this may be a more involved change.

I agree, that sounds a lot better. I'll do that refactoring in a quick followup to this PR.

another context (planned reparent / emergency reparent). Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

sougou · 2018-06-29T04:08:40Z

Something weird going on here. The tests are failing, but logs are truncated. Restarting the tests hasn't helped. Can you try running these locally?

zmagg · 2018-06-29T17:47:51Z

@sougou hmm they pass locally (or fail only when the flaky tests fail). Restarting the tests one more time...

to acquire the shard lock, as reparenting codepaths already have the shardlock. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

audit-recovery Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg · 2018-07-01T23:55:07Z

! TravisCI was failing accurately--it was the reparent integration tests which caught a bug. Fixed the bug in a4e7434 .

In f3ebce3 I fixed another bug we caught in doing some internal splitting&reparenting with this build out in dev.

I did some more manual testing in our dev environment of this patch, which I describe below. I'd also be down to add some integration tests with some guidance, as there are a few moving parts to test.

Generally though, @sougou do the changes look good to you and are you 👍 on merging?

Some manual testing I did in dev:

PlannedReparenting / EmergencyReparenting both work.
Tested that external reparenting via orchestrator works as expected when the master tablet hard dies.
@ameetkotian suggested that we test what happens if a master tablet dies and comes back. I tested this two ways, one by just rebooting the master tablet and the other by restarting the tablet/mysqld services manually.

In both cases, Orchestrator successfully preformed an external reparent, and the restarted tablet came back online not as a co-master. 🎉

In the reboot case, the vttablet service came back online but the mysqld service did not--and the tablet TabletType was in RESTORE mode.

In the restarting-services case, in Vitess's view of the world the tablet came back online as a replica with replication stopped. In Orchestrator's view, it came back online as a master in its own shard without any replicas attached to it. A ReparentTablet API call (to let the restarted tablet know about the new master) followed by a ResetSlave to start back up replication puts the tablet back in the original shard as a replica with both the Vitess/Orchestrator views of the world in sync.

This is out of scope for fixing this issue and this PR, but ideally if a master tablet is restarted it could come back online as a running replica in the same shard without human intervention. I know there are good reasons not to do this in some setups (as @shlomi-noach says here: openark/orchestrator#190 (comment) ) but when running in semi-sync mode, it seems like it's not as dangerous to bring a restarted/rediscovered tablet back online as a replica if done carefully.

…tive-recovery-empty-response Fix bug found in dev around repairReplication patch in PR #4024

zmagg force-pushed the if-reparenting-turn-off-repair-replication branch 2 times, most recently from 37ebc32 to f6be9af Compare June 13, 2018 02:06

sougou reviewed Jun 13, 2018

View reviewed changes

zmagg force-pushed the if-reparenting-turn-off-repair-replication branch from f6be9af to 353475b Compare June 20, 2018 06:01

zmagg added 2 commits June 19, 2018 23:04

Remove the named return.

a1826fd

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

If BeginMaintenance fails we should skip repairReplication

643f147

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg commented Jun 20, 2018

View reviewed changes

sougou mentioned this pull request Jun 20, 2018

replication: don't auto-fix #4019

Closed

sougou approved these changes Jun 21, 2018

View reviewed changes

Fix tests: Only acquire the shard lock if it isn't already acquired in

b79b1dd

another context (planned reparent / emergency reparent). Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg added 2 commits June 29, 2018 13:45

Refactor things so that only repairReplication based setMaster paths try

a4e7434

to acquire the shard lock, as reparenting codepaths already have the shardlock. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Handle the case where Orchestrator returns an empty message to

f3ebce3

audit-recovery Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg force-pushed the if-reparenting-turn-off-repair-replication branch from 515743b to f3ebce3 Compare June 29, 2018 23:59

sougou approved these changes Jul 9, 2018

View reviewed changes

sougou merged commit 30463e2 into master Jul 9, 2018

zmagg mentioned this pull request Jul 12, 2018

Fix bug found in dev around repairReplication patch in PR #4024 #4081

Merged

sougou added a commit that referenced this pull request Jul 12, 2018

Merge pull request #4081 from tinyspeck/fix-repair-replication-orc-ac…

debb5ac

…tive-recovery-empty-response Fix bug found in dev around repairReplication patch in PR #4024

sougou deleted the if-reparenting-turn-off-repair-replication branch December 8, 2018 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn off repairReplication if other reparenting events are going on #4024

Turn off repairReplication if other reparenting events are going on #4024
sougou merged 6 commits intomasterfrom
if-reparenting-turn-off-repair-replication

zmagg commented Jun 13, 2018 •

edited

Loading

Uh oh!

sougou Jun 13, 2018

Uh oh!

zmagg Jun 20, 2018

Uh oh!

sougou Jun 21, 2018

Uh oh!

zmagg Jun 29, 2018

Uh oh!

sougou commented Jun 29, 2018

Uh oh!

zmagg commented Jun 29, 2018

Uh oh!

zmagg commented Jul 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zmagg commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

sougou Jun 13, 2018

Choose a reason for hiding this comment

Uh oh!

zmagg Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

sougou Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

zmagg Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

sougou commented Jun 29, 2018

Uh oh!

zmagg commented Jun 29, 2018

Uh oh!

zmagg commented Jul 1, 2018

Some manual testing I did in dev:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zmagg commented Jun 13, 2018 •

edited

Loading