Skip to content

VTGR: Vitess + MySQL group replication#8387

Merged
deepthi merged 15 commits intovitessio:mainfrom
5antelope:ywu/vtgr
Jul 13, 2021
Merged

VTGR: Vitess + MySQL group replication#8387
deepthi merged 15 commits intovitessio:mainfrom
5antelope:ywu/vtgr

Conversation

@5antelope
Copy link
Copy Markdown
Member

@5antelope 5antelope commented Jun 26, 2021

Description

This adds single primary mysql group replication support for Vitess. Specifically it does the followings:

  • VTTablet needs to understand and report the lag under MySQL group replication
  • VTGR handles group bootstrap and maintain the group memberships
  • VTGR handles necessary Vitess and MySQL failover and reconcile between the two

The structure of the PR is:

directory purpose
vtgr/config The configurations for VTGR
vtgr/controller/diagnose.go VTGR diagnose the status for a Vitess shard
vtgr/controller/repair.go VTGR repair the shard based on diagnose result
vtgr/controller/group.go VTGR store MySQL group information
vtgr/controller/refresh.go VTGR refresh tablet records from topo server
vtgr/db DB client that talks to MySQL

Related Issue(s)

#8386

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
@5antelope 5antelope requested a review from deepthi June 26, 2021 02:08
@5antelope 5antelope changed the title Ywu/vtgr VTGR: Vitess + MySQL group replication Jun 26, 2021
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
@deepthi deepthi requested a review from vmg June 27, 2021 21:28
@vmg vmg marked this pull request as ready for review June 28, 2021 10:07
Comment on lines +338 to +348
go func() { c <- shard.tmc.Ping(pingCtx, instance.tablet) }()
select {
case <-pingCtx.Done():
log.Errorf("Ping abort timeout %v", *pingTabletTimeout)
return false
case err := <-c:
if err != nil {
log.Errorf("Ping error host=%v: %v", instance.instanceKey.Hostname, err)
}
return err == nil
}
Copy link
Copy Markdown
Collaborator

@vmg vmg Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad but it's actually our fault. Once #8368 is merged we should be able to fix the TMC API so that the initial dial is context-aware and we don't need this hack.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think switching to DialContext there will actually fix this issue, though. From the docs

By default, it's a non-blocking dial (the function won't wait for connections to be established, and connecting happens in the background). To make it a blocking dial, use WithBlock() dial option.

In the non-blocking case, the ctx does not act against the connection. It only controls the setup steps.

I don't think we should unilaterally use WithBlock as that will result in slowdowns in the rest of the tmc use-cases, so what I think we should do is a follow-up refactor to the tmc api (and merge both #8368 (pending review / further changes) and this PR as-is) to allow callers to specify dial options to use, something like:

tmc := tmclient.TabletManagerClient().WithDialOptions(grpc.WithBlock())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sounds about right: the comment that @narcsfz added to this method was a bit confusing, because it complains about dial/retry timeouts when (in theory) we shouldn't be seeing dial timeouts at all, since this API is not called with WithBlock and hence the actual connection will only happen during the actual call to Ping, and will be context-aware. I think what @narcsfz was seeing are the basic setup steps which by default are not controlled by context but will be if we switch to DialContext.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vmg @ajm188. I'm happy to leave a TODO here and refactor it later if that makes sense. Since it is hard to reproduce in unit test, I also need to set up the test case that can reproduce the problem to test the change out.

Signed-off-by: crowu <y.wu4515@gmail.com>
5antelope added 4 commits July 5, 2021 17:10
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
@5antelope
Copy link
Copy Markdown
Member Author

Addressed your comments ptal @systay

Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-written and well-documented contribution.
These comments are from a first pass. I will be doing another pass over the PR.

Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
@5antelope
Copy link
Copy Markdown
Member Author

There is a test failure after a lot of retries. Looks like it's related to some network issue: https://github.com/vitessio/vitess/pull/8387/checks?check_run_id=3023339855

I will try to rerun it later, but it should not affect the review cc @deepthi

Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. We can merge once feedback has been addressed and tests are passing.

5antelope added 2 commits July 8, 2021 16:50
Signed-off-by: crowu <y.wu4515@gmail.com>
Signed-off-by: crowu <y.wu4515@gmail.com>
Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@deepthi deepthi merged commit 9fb976a into vitessio:main Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants