Make Workflow SplitDiff task parallel across Shards#4126
Make Workflow SplitDiff task parallel across Shards#4126sougou merged 4 commits intovitessio:masterfrom
Conversation
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
|
Have y'all seen my MultiSplitDiff PR? It's been festering but that's what we use at Square. It's frickin' awesome. |
|
@tirsen - If you mean: #3781. Yes! I've been waiting for that to merge. Do you think you'll have time soon to push it across the finish line? I think this change is a bit orthogonal to that one, as this to speed up the Workflow (i.e to allow workflows to execute multiple splitdiff at the same time). I think with your change in, we can update the horizontal_sharding_workflow to use MultiSplitDiff and things will be really good. |
demmer
left a comment
There was a problem hiding this comment.
Overall I like the idea but I noticed a couple things that should be looked into.
| } | ||
| defer agent.unlock() | ||
|
|
||
| if tabletType == topodatapb.TabletType_DRAINED && agent.Tablet().Type == topodatapb.TabletType_DRAINED { |
There was a problem hiding this comment.
Please add an explanatory comment for why this check exists.
go/vt/worker/split_diff.go
Outdated
| return fmt.Errorf("FindWorkerTablet() failed for %v/%v/%v: %v", sdw.cell, sdw.keyspace, ss.Shard, err) | ||
| // During an horizontal shard split, multiple workers could race to get | ||
| // a RDONLY tablet in the source shard. When this happen, one of the workers | ||
| // will fail to drain the tablet and FindWorkerTablet will return an error. |
There was a problem hiding this comment.
please clarify... something to the effect of "one of the workers will fail to set the DRAIN state"
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
There was a problem hiding this comment.
Hmm... the tag comment right below this explicitly states that the intent is to put the tag on before changing the slave type, so either we need to rethink this part or at least change the comment now that it's no longer true.
There was a problem hiding this comment.
Oh good catch. I think we can solve this by refreshing tablet state after adding this tag. Let's get @sougou input on this.
There was a problem hiding this comment.
I looked at implementation of ChangeSlaveType and it has a built-in refresh, which is probably why the tag change is done before.
The refresh that @rafael talks about is a higher level one, and I think it's for reacting to change in serving state. But since this is a lower level function that others can use independent of serving state, we should preserve the original order.
* Also fixes cleanup bug. It should revert to tabletType provided, not RDONLY. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
| // one of these calls to FindWorkerTablet will succeed and the rest will fail. | ||
| // The following, makes sures we keep trying to find a worker tablet when this error occur. | ||
| shortCtx, cancel := context.WithTimeout(ctx, *remoteActionsTimeout) | ||
| for { |
There was a problem hiding this comment.
I'm not sure this loop will help.
The context gets passed in, and the downstream calls are themselves looping waiting for context to expire. I see one difference in waitForHealthyTablets where a waitForHealthyTabletsTimeout gets added, but its default value is same as remoteActionsTimeout.
Can you check the error that caused the race? Then we can identify the code path, and ideally fix it at the point where the error origintated.
There was a problem hiding this comment.
Hey Sugu - I've validated in my local environment that this fixes the problem.
The loop by itself does not help. The fix is the loop + the change where now when you try to call drain on a tablet that is already drained, you will get an error.
Before this change, the race allow two vtworkers to use the same tablet and if I recall correctly one of the actual errors you get is when trying to synchronizeReplication. Both workers are trying to StopBlp and they get errors.
sougou
left a comment
There was a problem hiding this comment.
It all makes sense now. Thanks for the explanation.
|
@rafael I think that PR is mostly waiting for @sougou and @michael-berlin to make up their mind if it should replace the current SplitDiff but I think we can probably first merge it as a simple alternative to the existing SplitDiff and then look into if we want to fully replace it. |
|
At this point, it's mostly up to me. There are a few reasons why I've been sitting on this:
I'm thinking we should merge vreplication and implement this as a change to SplitDiff after that merge. |
Description
SplitDifftasks in a workflow to beParallel. The change is for the most part straightforward, but it required some sort of coordination between the workers to not race when trying to choose aRDONLYtablet in the source shards. I addressed this by doing some optimistic locking and makingChangeTypefail if a tablet is already drained. If the worker gets this error, it will try to find anotherRDONLYtablet.SplitDiffto not require a destination shard as parameter, but rather work more likeSplitCloneand figure out by itself what to do. I think we should do that and I'll be opening an issue for that, but for now I think this is a good compromise and we get us unblock with parallelSplitDifftasks in Workflows.Tests
ChangeTypechange and updated some of the unit tests for Workflows to reflect the new validations.