Skip to content

Turn off repairReplication if other reparenting events are going on #4024

Merged
sougou merged 6 commits intomasterfrom
if-reparenting-turn-off-repair-replication
Jul 9, 2018
Merged

Turn off repairReplication if other reparenting events are going on #4024
sougou merged 6 commits intomasterfrom
if-reparenting-turn-off-repair-replication

Conversation

@zmagg
Copy link
Copy Markdown
Contributor

@zmagg zmagg commented Jun 13, 2018

Problem

Described more in detail in Slack, currently the tabletmanager's repairReplication races with Orchestrator's failover mechanisms, which results in a co-master situation in our topology during external reparenting events run by Orchestrator. The tabletmanager's repairReplication could also race with Vitess's internal reparenting actions.

Solution

Add support for turning off repairReplication as part of the tablet health check if other reparenting events are going on.

This implementation changes repairReplication to use the same shard lock in the topo as the PlannedReparent and EmergencyReparent operations do, eliminating the race condition between repairReplication and internal Vitess reparenting.

To minimize the race between repairReplication and Orchestrator driven external reparenting, during repairReplication, check to see if Orchestrator is doing any reparenting actively, and if so, do not repairReplication.

Before beginning any replication manipulation, use the orchestrator side lock of
BeginMaintenance/EndMaintenance to indicate that Orchestrator should not begin any
reparenting. There's still a race if Orchestrator begins failover before it receives the BeginMaintenance event. Would love to hear suggestions for how to eliminate this race entirely.

As it turns out, PlannedReparenting and EmergencyReparenting already indicate to Orchestrator that they're manipulating the topology and call out to Orchestrator's BeginMaintenance before beginning topo manipulations. (i.e. https://github.com/vitessio/vitess/blob/master/go/vt/wrangler/reparent.go#L403) This helps reduce the risk of Orchestrator and Vitess-internal reparenting from interfering with each other.

If we decide to go with this approach, there are a few other things I'd like to add as well:
[ ] add more metrics about how often repairReplication is called and ran
[ ] write unit tests

@zmagg zmagg force-pushed the if-reparenting-turn-off-repair-replication branch 2 times, most recently from 37ebc32 to f6be9af Compare June 13, 2018 02:06
"<tablet alias>",
"Changes metadata in the topology server to acknowledge a shard master change performed by an external tool. See the Reparenting guide for more information:" +
"https://github.com/vitessio/vitess/blob/master/doc/Reparenting.md#external-reparents."},
{"PreExternallyReparentShard", commandPreExternallyReparentShard,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be simpler to have repairReplication check with Orchestrator if it's in the middle of a maintenance instead. There are a few advantages:

  • Orchestrator has one less dependency.
  • You don't have to worry about the lifecycle of this state. What if Orc is not able to update this after it sets the value?

During repairReplication, lock the shard in the topo as it's locked for
reparenting events / delete events.

Also during repairReplication, check to see if Orchestrator is doing any
reparenting actively, and if so, do not repairReplication. Before
repairingReplication, use the _orchestrator_ side lock of
BeginMaintenance to indicate that orchestrator should not begin any
reparenting.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg zmagg force-pushed the if-reparenting-turn-off-repair-replication branch from f6be9af to 353475b Compare June 20, 2018 06:01
zmagg added 2 commits June 19, 2018 23:04
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
log.Warningf("Orchestrator EndMaintenance failed: %v", err)
}
}()
defer func() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps not super clear. I'm removing the conditional if agent.Tablet().Type == topodatapb.TabletType_MASTER here as, in the case where the tablet was a REPLICA, and we're calling into this function from repairReplication we also want to end Orchestrator maintenance at the end of this function. It's still best effort to end orchestrator maintenance even in that case, as Orchestrator maintenance windows are time bounded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok for now. Ideally, whoever calls begins maintenance should be the one ending it (in a defer).
So, in the case of repairReplication, the block that calls the beginMaintenance should defer endMaintenance.
But then, we should also move other call sites to the vtctld workflow (wrangler), where begin and end should be called as part of the reparent commands. This probably means that vtctld needs to know about Orc. So, this may be a more involved change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that sounds a lot better. I'll do that refactoring in a quick followup to this PR.

another context (planned reparent / emergency reparent).

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@sougou
Copy link
Copy Markdown
Contributor

sougou commented Jun 29, 2018

Something weird going on here. The tests are failing, but logs are truncated. Restarting the tests hasn't helped. Can you try running these locally?

@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Jun 29, 2018

@sougou hmm they pass locally (or fail only when the flaky tests fail). Restarting the tests one more time...

zmagg added 2 commits June 29, 2018 13:45
to acquire the shard lock, as reparenting codepaths already have the
shardlock.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
audit-recovery

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg zmagg force-pushed the if-reparenting-turn-off-repair-replication branch from 515743b to f3ebce3 Compare June 29, 2018 23:59
@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Jul 1, 2018

! TravisCI was failing accurately--it was the reparent integration tests which caught a bug. Fixed the bug in a4e7434 .

In f3ebce3 I fixed another bug we caught in doing some internal splitting&reparenting with this build out in dev.

I did some more manual testing in our dev environment of this patch, which I describe below. I'd also be down to add some integration tests with some guidance, as there are a few moving parts to test.

Generally though, @sougou do the changes look good to you and are you 👍 on merging?

Some manual testing I did in dev:

  1. PlannedReparenting / EmergencyReparenting both work.
  2. Tested that external reparenting via orchestrator works as expected when the master tablet hard dies.
  3. @ameetkotian suggested that we test what happens if a master tablet dies and comes back. I tested this two ways, one by just rebooting the master tablet and the other by restarting the tablet/mysqld services manually.

In both cases, Orchestrator successfully preformed an external reparent, and the restarted tablet came back online not as a co-master. 🎉

In the reboot case, the vttablet service came back online but the mysqld service did not--and the tablet TabletType was in RESTORE mode.

In the restarting-services case, in Vitess's view of the world the tablet came back online as a replica with replication stopped. In Orchestrator's view, it came back online as a master in its own shard without any replicas attached to it. A ReparentTablet API call (to let the restarted tablet know about the new master) followed by a ResetSlave to start back up replication puts the tablet back in the original shard as a replica with both the Vitess/Orchestrator views of the world in sync.

This is out of scope for fixing this issue and this PR, but ideally if a master tablet is restarted it could come back online as a running replica in the same shard without human intervention. I know there are good reasons not to do this in some setups (as @shlomi-noach says here: openark/orchestrator#190 (comment) ) but when running in semi-sync mode, it seems like it's not as dangerous to bring a restarted/rediscovered tablet back online as a replica if done carefully.

@sougou sougou merged commit 30463e2 into master Jul 9, 2018
sougou added a commit that referenced this pull request Jul 12, 2018
…tive-recovery-empty-response

Fix bug found in dev around repairReplication patch in PR #4024
@sougou sougou deleted the if-reparenting-turn-off-repair-replication branch December 8, 2018 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants