Skip to content

topology_watcher: Allow tablets to reuse old tablet addresses.#5244

Merged
sougou merged 1 commit intovitessio:masterfrom
planetscale:topo-watcher-race
Sep 27, 2019
Merged

topology_watcher: Allow tablets to reuse old tablet addresses.#5244
sougou merged 1 commit intovitessio:masterfrom
planetscale:topo-watcher-race

Conversation

@enisoc
Copy link
Member

@enisoc enisoc commented Sep 27, 2019

Fixes #5229.

The unit test failed before I added the check in RemoveTablet that we're deleting the tablet we think we are.

I also removed the goroutines from RemoveTablet and ReplaceTablet because it's ludicrous that there was no guarantee whatsoever of what state you'll be in at the end of loadTablets(). Let me know if there was some reason we had tried to make those asynchronous.

Signed-off-by: Anthony Yeh enisoc@planetscale.com

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
hc.AddTablet(new, name)
}()
hc.deleteConn(old)
hc.AddTablet(new, name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice yeah this should hopefully eliminate a huge source of threading issues

@tirsen
Copy link
Collaborator

tirsen commented Sep 27, 2019

I'll patch this in and will test it in our environment. It was very easy to reproduce the issue there.

@tirsen
Copy link
Collaborator

tirsen commented Sep 27, 2019

I can confirm this solves the issue in our environment but I'll let someone else more familiar with the code do a deeper review.

@enisoc
Copy link
Member Author

enisoc commented Sep 27, 2019

@demmer explained that the reason RemoveTablet and ReplaceTablet were in goroutines was likely because they had seen that sometimes the attempt to close the connection inside deleteConn() would hang forever, which deadlocked the healthchecker because it held the mutex forever.

However, it seems that now we don't close connections synchronously anyway. We just cancel the Context and move on. So I believe it should be safe to run RemoveTablet synchronously now.

Copy link
Member

@derekperkins derekperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be watertight. But it's a big improvement over what was there before.

// If it's the same tablet, something is wrong.
if topoproto.TabletAliasEqual(th.latestTabletStats.Tablet.Alias, tablet.Alias) {
hc.mu.Unlock()
log.Warningf("refusing to add duplicate tablet %v for %v: %+v", name, tablet.Alias.Cell, tablet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should eventually become an acceptable state, which will allow us to "update if needed" kind of calls. But let the warning remain for now. It will allow us to find out if this really happens in the wild.

The other future improvement on this will be to add a timestamp that tracks when a tablet acquired the address to declare the true winner. Similar to how we're solving the mastership story.

@sougou sougou merged commit e8b05e6 into vitessio:master Sep 27, 2019
@enisoc enisoc deleted the topo-watcher-race branch September 27, 2019 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vtgate intermittently loses a tablet

4 participants