topology_watcher: Allow tablets to reuse old tablet addresses.#5244
topology_watcher: Allow tablets to reuse old tablet addresses.#5244sougou merged 1 commit intovitessio:masterfrom
Conversation
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
| hc.AddTablet(new, name) | ||
| }() | ||
| hc.deleteConn(old) | ||
| hc.AddTablet(new, name) |
There was a problem hiding this comment.
Nice yeah this should hopefully eliminate a huge source of threading issues
|
I'll patch this in and will test it in our environment. It was very easy to reproduce the issue there. |
|
I can confirm this solves the issue in our environment but I'll let someone else more familiar with the code do a deeper review. |
|
@demmer explained that the reason However, it seems that now we don't close connections synchronously anyway. We just cancel the Context and move on. So I believe it should be safe to run |
sougou
left a comment
There was a problem hiding this comment.
This may not be watertight. But it's a big improvement over what was there before.
| // If it's the same tablet, something is wrong. | ||
| if topoproto.TabletAliasEqual(th.latestTabletStats.Tablet.Alias, tablet.Alias) { | ||
| hc.mu.Unlock() | ||
| log.Warningf("refusing to add duplicate tablet %v for %v: %+v", name, tablet.Alias.Cell, tablet) |
There was a problem hiding this comment.
This should eventually become an acceptable state, which will allow us to "update if needed" kind of calls. But let the warning remain for now. It will allow us to find out if this really happens in the wild.
The other future improvement on this will be to add a timestamp that tracks when a tablet acquired the address to declare the true winner. Similar to how we're solving the mastership story.
Fixes #5229.
The unit test failed before I added the check in RemoveTablet that we're deleting the tablet we think we are.
I also removed the goroutines from
RemoveTabletandReplaceTabletbecause it's ludicrous that there was no guarantee whatsoever of what state you'll be in at the end ofloadTablets(). Let me know if there was some reason we had tried to make those asynchronous.Signed-off-by: Anthony Yeh enisoc@planetscale.com