Turn off repairReplication if other reparenting events are going on #4024
Turn off repairReplication if other reparenting events are going on #4024
Conversation
37ebc32 to
f6be9af
Compare
go/vt/vtctl/vtctl.go
Outdated
| "<tablet alias>", | ||
| "Changes metadata in the topology server to acknowledge a shard master change performed by an external tool. See the Reparenting guide for more information:" + | ||
| "https://github.com/vitessio/vitess/blob/master/doc/Reparenting.md#external-reparents."}, | ||
| {"PreExternallyReparentShard", commandPreExternallyReparentShard, |
There was a problem hiding this comment.
I think it will be simpler to have repairReplication check with Orchestrator if it's in the middle of a maintenance instead. There are a few advantages:
- Orchestrator has one less dependency.
- You don't have to worry about the lifecycle of this state. What if Orc is not able to update this after it sets the value?
During repairReplication, lock the shard in the topo as it's locked for reparenting events / delete events. Also during repairReplication, check to see if Orchestrator is doing any reparenting actively, and if so, do not repairReplication. Before repairingReplication, use the _orchestrator_ side lock of BeginMaintenance to indicate that orchestrator should not begin any reparenting. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
f6be9af to
353475b
Compare
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
| log.Warningf("Orchestrator EndMaintenance failed: %v", err) | ||
| } | ||
| }() | ||
| defer func() { |
There was a problem hiding this comment.
This is perhaps not super clear. I'm removing the conditional if agent.Tablet().Type == topodatapb.TabletType_MASTER here as, in the case where the tablet was a REPLICA, and we're calling into this function from repairReplication we also want to end Orchestrator maintenance at the end of this function. It's still best effort to end orchestrator maintenance even in that case, as Orchestrator maintenance windows are time bounded.
There was a problem hiding this comment.
I think this is ok for now. Ideally, whoever calls begins maintenance should be the one ending it (in a defer).
So, in the case of repairReplication, the block that calls the beginMaintenance should defer endMaintenance.
But then, we should also move other call sites to the vtctld workflow (wrangler), where begin and end should be called as part of the reparent commands. This probably means that vtctld needs to know about Orc. So, this may be a more involved change.
There was a problem hiding this comment.
I agree, that sounds a lot better. I'll do that refactoring in a quick followup to this PR.
another context (planned reparent / emergency reparent). Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
|
Something weird going on here. The tests are failing, but logs are truncated. Restarting the tests hasn't helped. Can you try running these locally? |
|
@sougou hmm they pass locally (or fail only when the flaky tests fail). Restarting the tests one more time... |
to acquire the shard lock, as reparenting codepaths already have the shardlock. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
audit-recovery Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
515743b to
f3ebce3
Compare
|
! TravisCI was failing accurately--it was the reparent integration tests which caught a bug. Fixed the bug in a4e7434 . In f3ebce3 I fixed another bug we caught in doing some internal splitting&reparenting with this build out in dev. I did some more manual testing in our dev environment of this patch, which I describe below. I'd also be down to add some integration tests with some guidance, as there are a few moving parts to test. Generally though, @sougou do the changes look good to you and are you 👍 on merging? Some manual testing I did in dev:
In both cases, Orchestrator successfully preformed an external reparent, and the restarted tablet came back online not as a co-master. 🎉 In the reboot case, the vttablet service came back online but the mysqld service did not--and the tablet TabletType was in In the restarting-services case, in Vitess's view of the world the tablet came back online as a replica with replication stopped. In Orchestrator's view, it came back online as a master in its own shard without any replicas attached to it. A This is out of scope for fixing this issue and this PR, but ideally if a master tablet is restarted it could come back online as a running replica in the same shard without human intervention. I know there are good reasons not to do this in some setups (as @shlomi-noach says here: openark/orchestrator#190 (comment) ) but when running in semi-sync mode, it seems like it's not as dangerous to bring a restarted/rediscovered tablet back online as a replica if done carefully. |
…tive-recovery-empty-response Fix bug found in dev around repairReplication patch in PR #4024
Problem
Described more in detail in Slack, currently the tabletmanager's repairReplication races with Orchestrator's failover mechanisms, which results in a co-master situation in our topology during external reparenting events run by Orchestrator. The tabletmanager's repairReplication could also race with Vitess's internal reparenting actions.
Solution
Add support for turning off repairReplication as part of the tablet health check if other reparenting events are going on.
This implementation changes repairReplication to use the same shard lock in the topo as the PlannedReparent and EmergencyReparent operations do, eliminating the race condition between repairReplication and internal Vitess reparenting.
To minimize the race between repairReplication and Orchestrator driven external reparenting, during repairReplication, check to see if Orchestrator is doing any reparenting actively, and if so, do not repairReplication.
Before beginning any replication manipulation, use the orchestrator side lock of
BeginMaintenance/EndMaintenance to indicate that Orchestrator should not begin any
reparenting. There's still a race if Orchestrator begins failover before it receives the BeginMaintenance event. Would love to hear suggestions for how to eliminate this race entirely.
As it turns out, PlannedReparenting and EmergencyReparenting already indicate to Orchestrator that they're manipulating the topology and call out to Orchestrator's
BeginMaintenancebefore beginning topo manipulations. (i.e. https://github.com/vitessio/vitess/blob/master/go/vt/wrangler/reparent.go#L403) This helps reduce the risk of Orchestrator and Vitess-internal reparenting from interfering with each other.If we decide to go with this approach, there are a few other things I'd like to add as well:
[ ] add more metrics about how often repairReplication is called and ran
[ ] write unit tests