Skip to content

Multi split diff#3781

Closed
tirsen wants to merge 1 commit intovitessio:masterfrom
tirsen:multi-split-diff
Closed

Multi split diff#3781
tirsen wants to merge 1 commit intovitessio:masterfrom
tirsen:multi-split-diff

Conversation

@tirsen
Copy link
Copy Markdown
Collaborator

@tirsen tirsen commented Mar 27, 2018

A new implementation of split diff that diffs against all the
destination servers at the same time. Since we only have to read out of
the source a single time this could give significant time savings.

Signed-off-by: Jon Tirsen jontirsen@squareup.com


This change is Reviewable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing you don't want that upstreamed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eep!

@tirsen tirsen force-pushed the multi-split-diff branch from 55cd85d to 3ea87d8 Compare March 27, 2018 13:57
@enisoc enisoc requested a review from michael-berlin March 27, 2018 17:58
@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented May 9, 2018

@michael-berlin We've been using this on the last couple of splits and it's decreased out split time quite dramatically (by as much as 70-80% on a 1:4 split). Any chance you want to start looking at this again?

@sougou
Copy link
Copy Markdown
Contributor

sougou commented May 9, 2018

YouTubers have been busy. I'm going through the backlog of pending PRs. So, I'll take a look when I get to it. The unit test failure in Travis looks real. So, that may need fixing.

@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented May 10, 2018 via email

@sougou
Copy link
Copy Markdown
Contributor

sougou commented May 13, 2018

Finally got around to reviewing this. Some high level comments:

  • There's too much duplication between multi and simple. I think this is ok only if we plan to deprecate the simple diff. I believe we should deprecate the simple diff if there are no downsides to always using multi.
  • I might have found an orthogonal issue: I don't think synchronize replication should have to do all the work it does now. It should be sufficient to stop binlog replication on the source at any point. Once it's stopped, this will result in filtered replication also pausing at the same point because there are no more events on the source. Then all we have to do is run the diff, and then resume binlog replication on the source tablet. Let me know if there is a problem with this approach.
  • I still need to find out why the tests are failing. More on this later.

Tagging @michael-berlin for possible feedback.

@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented May 28, 2018

Yeah I think this can replace the regular SplitDiff which is why I didn't put down the effort to de-duping the code. The only thing the regular SplitDiff can do that MultiSplitDiff can't is diffing a shard merge. I think it should be possible to add that though if necessary (it will complicate the code though).

Yeah you can stop the filtered replication but then you need to wait for replication to catch up in the target shards replica that we're going to use for the diff. That's what we're doing there. Maybe we can wait a fixed time or just until replication lag reaches 0?

@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented Aug 15, 2018

@sougou What say you, shall we try to get this merged? We've been using it for months and it's pretty solid.

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Aug 15, 2018

Reposting this from the other PR:

There are a few reasons why I've been sitting on this:

  1. There's a lot of duplication of SplitDiff. So, I've been on the fence about making the two as one.
  2. My vreplication PR will likely break this (or vice versa).
  3. The tests were failing, and I haven't been able to find why for the time I've spent so far.

I'm thinking we should merge vreplication and implement this as a change to SplitDiff after that merge.

@michael-berlin
Copy link
Copy Markdown
Contributor

My comments on this one:

  • In general, I'm in support of this change and I agree that it's a huge win. The trick here is that the "ResultMerger" class is used to combine the stream of multiple shards into a single one such that it's easy to diff it. I've initially created this class for the rewritten "SplitClone" vtworker command and it works well except for string-based primary key columns.

    In fact, for string-based columns the code will simply throw an error: https://github.com/vitessio/vitess/blob/master/go/vt/worker/result_merger.go#L72 In that case, the user would have to use the "SplitDiff" command. (btw: We discussed how to fix this properly for all collations: Instead of using the string column in the ResultMerger implementation, add WEIGHT_STRING(col) to the columns list and compare against that. I won't have time to work on this though. Once such a fix is in place, this MultiSplitDiff could completely replace SplitDiff and LegacySplitClone could be removed as well.)

  • this CL is missing a Python integration test e.g. ideally there should be a test/resharding_multi_split_diff.py which does the same thing as test/resharding.py but uses this command instead.

@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented Sep 6, 2018

FYI I have rebased this on the new vreplication changes and done some basic testing. We're going to use this for our upcoming shard splits.

@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented Sep 20, 2018

We've been using our rebase for a few shard splits now and it's solid. Going to clean it up a bit and re-push to this PR.

A new implementation of split diff that diffs against all the
destination servers at the same time. Since we only have to read out of
the source a single time this could give significant time savings.

Signed-off-by: Jon Tirsen <jontirsen@squareup.com>
@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented Sep 21, 2018

Ok rebased squashed and pushed. This is ready to be reviewed. There are still spurious test failures though. Given how many times we've used this in production they're likely false negatives so would appreciate any pointers on how to fix them.

@systay systay mentioned this pull request Oct 17, 2018
@tirsen
Copy link
Copy Markdown
Collaborator Author

tirsen commented Oct 17, 2018

@systay is taking over this work: #4281

@tirsen tirsen closed this Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants