Make Workflow SplitDiff task parallel across Shards by rafael · Pull Request #4126 · vitessio/vitess

rafael · 2018-08-09T15:25:34Z

Description

The following PR changes SplitDiff tasks in a workflow to be Parallel. The change is for the most part straightforward, but it required some sort of coordination between the workers to not race when trying to choose a RDONLY tablet in the source shards. I addressed this by doing some optimistic locking and making ChangeType fail if a tablet is already drained. If the worker gets this error, it will try to find another RDONLY tablet.
@demmer / @sougou - We discussed that a better high level approach for this is to refactor SplitDiff to not require a destination shard as parameter, but rather work more like SplitClone and figure out by itself what to do. I think we should do that and I'll be opening an issue for that, but for now I think this is a good compromise and we get us unblock with parallel SplitDiff tasks in Workflows.

Tests

I added an integration test for the ChangeType change and updated some of the unit tests for Workflows to reflect the new validations.
I verified this change in my local environment.

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

tirsen · 2018-08-09T15:32:18Z

Have y'all seen my MultiSplitDiff PR? It's been festering but that's what we use at Square. It's frickin' awesome.

rafael · 2018-08-09T16:22:19Z

@tirsen - If you mean: #3781. Yes! I've been waiting for that to merge. Do you think you'll have time soon to push it across the finish line?

I think this change is a bit orthogonal to that one, as this to speed up the Workflow (i.e to allow workflows to execute multiple splitdiff at the same time).

I think with your change in, we can update the horizontal_sharding_workflow to use MultiSplitDiff and things will be really good.

demmer

Overall I like the idea but I noticed a couple things that should be looked into.

demmer · 2018-08-09T19:22:37Z

go/vt/vttablet/tabletmanager/rpc_actions.go

 	}
 	defer agent.unlock()

+	if tabletType == topodatapb.TabletType_DRAINED && agent.Tablet().Type == topodatapb.TabletType_DRAINED {


Please add an explanatory comment for why this check exists.

demmer · 2018-08-09T19:23:09Z

go/vt/worker/split_diff.go

-				return fmt.Errorf("FindWorkerTablet() failed for %v/%v/%v: %v", sdw.cell, sdw.keyspace, ss.Shard, err)
+			// During an horizontal shard split, multiple workers could race to get
+			// a RDONLY tablet in the source shard. When this happen, one of the workers
+			// will fail to drain the tablet and FindWorkerTablet will return an error.


please clarify... something to the effect of "one of the workers will fail to set the DRAIN state"

demmer · 2018-08-09T19:27:58Z

go/vt/worker/topo_utils.go

+	if err != nil {
+		return nil, err
+	}
+


Hmm... the tag comment right below this explicitly states that the intent is to put the tag on before changing the slave type, so either we need to rethink this part or at least change the comment now that it's no longer true.

Oh good catch. I think we can solve this by refreshing tablet state after adding this tag. Let's get @sougou input on this.

I looked at implementation of ChangeSlaveType and it has a built-in refresh, which is probably why the tag change is done before.

The refresh that @rafael talks about is a higher level one, and I think it's for reacting to change in serving state. But since this is a lower level function that others can use independent of serving state, we should preserve the original order.

* Also fixes cleanup bug. It should revert to tabletType provided, not RDONLY. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

sougou · 2018-08-12T17:46:51Z

go/vt/worker/split_diff.go

+			// one of these calls to FindWorkerTablet will succeed and the rest will fail.
+			// The following, makes sures we keep trying to find a worker tablet when this error occur.
+			shortCtx, cancel := context.WithTimeout(ctx, *remoteActionsTimeout)
+			for {


I'm not sure this loop will help.

The context gets passed in, and the downstream calls are themselves looping waiting for context to expire. I see one difference in waitForHealthyTablets where a waitForHealthyTabletsTimeout gets added, but its default value is same as remoteActionsTimeout.

Can you check the error that caused the race? Then we can identify the code path, and ideally fix it at the point where the error origintated.

Hey Sugu - I've validated in my local environment that this fixes the problem.

The loop by itself does not help. The fix is the loop + the change where now when you try to call drain on a tablet that is already drained, you will get an error.

Before this change, the race allow two vtworkers to use the same tablet and if I recall correctly one of the actual errors you get is when trying to synchronizeReplication. Both workers are trying to StopBlp and they get errors.

sougou

It all makes sense now. Thanks for the explanation.

tirsen · 2018-08-15T12:41:38Z

@rafael I think that PR is mostly waiting for @sougou and @michael-berlin to make up their mind if it should replace the current SplitDiff but I think we can probably first merge it as a simple alternative to the existing SplitDiff and then look into if we want to fully replace it.

sougou · 2018-08-15T15:01:59Z

At this point, it's mostly up to me. There are a few reasons why I've been sitting on this:

There's a lot of duplication of SplitDiff. So, I've been on the fence about making the two as one.
My vreplication PR will likely break this (or vice versa).
The tests were failing, and I haven't been able to find why for the time I've spent so far.

I'm thinking we should merge vreplication and implement this as a change to SplitDiff after that merge.

Rafael Chacon added 3 commits August 7, 2018 17:03

Make SplitDiff parallel

fe8cbf4

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

Adds optimistic lock when trying to find a worker during SplitDiff

c97f720

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

Reword comment for clariry

93ca028

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

rafael requested review from demmer and sougou August 9, 2018 15:27

demmer reviewed Aug 9, 2018

View reviewed changes

Address PR comments

64b6430

* Also fixes cleanup bug. It should revert to tabletType provided, not RDONLY. Signed-off-by: Rafael Chacon <rafael@slack-corp.com>

rafael changed the title ~~Make Workflor SplitDiff task parallel across Shards~~ Make Workflow SplitDiff task parallel across Shards Aug 12, 2018

sougou reviewed Aug 12, 2018

View reviewed changes

sougou approved these changes Aug 12, 2018

View reviewed changes

sougou merged commit a1d5a98 into vitessio:master Aug 12, 2018

zmagg mentioned this pull request Sep 7, 2018

Slack vitess upstream sync 2018 09 06.r0 tinyspeck/vitess#109

Merged

Conversation

rafael commented Aug 9, 2018

Description

Tests

Uh oh!

tirsen commented Aug 9, 2018

Uh oh!

rafael commented Aug 9, 2018

Uh oh!

demmer left a comment

Choose a reason for hiding this comment

Uh oh!

demmer Aug 9, 2018

Choose a reason for hiding this comment

Uh oh!

demmer Aug 9, 2018

Choose a reason for hiding this comment

Uh oh!

demmer Aug 9, 2018

Choose a reason for hiding this comment

Uh oh!

rafael Aug 9, 2018

Choose a reason for hiding this comment

Uh oh!

sougou Aug 12, 2018

Choose a reason for hiding this comment

Uh oh!

sougou Aug 12, 2018

Choose a reason for hiding this comment

Uh oh!

rafael Aug 12, 2018

Choose a reason for hiding this comment

Uh oh!

sougou left a comment

Choose a reason for hiding this comment

Uh oh!

tirsen commented Aug 15, 2018

Uh oh!

sougou commented Aug 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants