Skip to content

Make Workflow SplitDiff task parallel across Shards#4126

Merged
sougou merged 4 commits intovitessio:masterfrom
tinyspeck:make-split-diff-parallel
Aug 12, 2018
Merged

Make Workflow SplitDiff task parallel across Shards#4126
sougou merged 4 commits intovitessio:masterfrom
tinyspeck:make-split-diff-parallel

Conversation

@rafael
Copy link
Copy Markdown
Member

@rafael rafael commented Aug 9, 2018

Description

  • The following PR changes SplitDiff tasks in a workflow to be Parallel. The change is for the most part straightforward, but it required some sort of coordination between the workers to not race when trying to choose a RDONLY tablet in the source shards. I addressed this by doing some optimistic locking and making ChangeType fail if a tablet is already drained. If the worker gets this error, it will try to find another RDONLY tablet.
  • @demmer / @sougou - We discussed that a better high level approach for this is to refactor SplitDiff to not require a destination shard as parameter, but rather work more like SplitClone and figure out by itself what to do. I think we should do that and I'll be opening an issue for that, but for now I think this is a good compromise and we get us unblock with parallel SplitDiff tasks in Workflows.

Tests

  • I added an integration test for the ChangeType change and updated some of the unit tests for Workflows to reflect the new validations.
  • I verified this change in my local environment.

Rafael Chacon added 3 commits August 7, 2018 17:03
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael rafael requested review from demmer and sougou August 9, 2018 15:27
@tirsen
Copy link
Copy Markdown
Collaborator

tirsen commented Aug 9, 2018

Have y'all seen my MultiSplitDiff PR? It's been festering but that's what we use at Square. It's frickin' awesome.

@rafael
Copy link
Copy Markdown
Member Author

rafael commented Aug 9, 2018

@tirsen - If you mean: #3781. Yes! I've been waiting for that to merge. Do you think you'll have time soon to push it across the finish line?

I think this change is a bit orthogonal to that one, as this to speed up the Workflow (i.e to allow workflows to execute multiple splitdiff at the same time).

I think with your change in, we can update the horizontal_sharding_workflow to use MultiSplitDiff and things will be really good.

Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the idea but I noticed a couple things that should be looked into.

}
defer agent.unlock()

if tabletType == topodatapb.TabletType_DRAINED && agent.Tablet().Type == topodatapb.TabletType_DRAINED {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an explanatory comment for why this check exists.

return fmt.Errorf("FindWorkerTablet() failed for %v/%v/%v: %v", sdw.cell, sdw.keyspace, ss.Shard, err)
// During an horizontal shard split, multiple workers could race to get
// a RDONLY tablet in the source shard. When this happen, one of the workers
// will fail to drain the tablet and FindWorkerTablet will return an error.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clarify... something to the effect of "one of the workers will fail to set the DRAIN state"

if err != nil {
return nil, err
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the tag comment right below this explicitly states that the intent is to put the tag on before changing the slave type, so either we need to rethink this part or at least change the comment now that it's no longer true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch. I think we can solve this by refreshing tablet state after adding this tag. Let's get @sougou input on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at implementation of ChangeSlaveType and it has a built-in refresh, which is probably why the tag change is done before.

The refresh that @rafael talks about is a higher level one, and I think it's for reacting to change in serving state. But since this is a lower level function that others can use independent of serving state, we should preserve the original order.

* Also fixes cleanup bug. It should revert to tabletType provided, not RDONLY.

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@rafael rafael changed the title Make Workflor SplitDiff task parallel across Shards Make Workflow SplitDiff task parallel across Shards Aug 12, 2018
// one of these calls to FindWorkerTablet will succeed and the rest will fail.
// The following, makes sures we keep trying to find a worker tablet when this error occur.
shortCtx, cancel := context.WithTimeout(ctx, *remoteActionsTimeout)
for {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this loop will help.

The context gets passed in, and the downstream calls are themselves looping waiting for context to expire. I see one difference in waitForHealthyTablets where a waitForHealthyTabletsTimeout gets added, but its default value is same as remoteActionsTimeout.

Can you check the error that caused the race? Then we can identify the code path, and ideally fix it at the point where the error origintated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Sugu - I've validated in my local environment that this fixes the problem.

The loop by itself does not help. The fix is the loop + the change where now when you try to call drain on a tablet that is already drained, you will get an error.

Before this change, the race allow two vtworkers to use the same tablet and if I recall correctly one of the actual errors you get is when trying to synchronizeReplication. Both workers are trying to StopBlp and they get errors.

Copy link
Copy Markdown
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all makes sense now. Thanks for the explanation.

@sougou sougou merged commit a1d5a98 into vitessio:master Aug 12, 2018
@tirsen
Copy link
Copy Markdown
Collaborator

tirsen commented Aug 15, 2018

@rafael I think that PR is mostly waiting for @sougou and @michael-berlin to make up their mind if it should replace the current SplitDiff but I think we can probably first merge it as a simple alternative to the existing SplitDiff and then look into if we want to fully replace it.

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Aug 15, 2018

At this point, it's mostly up to me. There are a few reasons why I've been sitting on this:

  1. There's a lot of duplication of SplitDiff. So, I've been on the fence about making the two as one.
  2. My vreplication PR will likely break this (or vice versa).
  3. The tests were failing, and I haven't been able to find why for the time I've spent so far.

I'm thinking we should merge vreplication and implement this as a change to SplitDiff after that merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants