[FIXED] Duplicate Raft nodes during restart#6732
Conversation
derekcollison
left a comment
There was a problem hiding this comment.
Is the return value just for testing?
Yes. |
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
35f112c to
ff374d0
Compare
Hmm ok. I tend to shy away from production code changes that are only for tests. Never been a fan of that pattern, just confusing looking at the code since I could see that no one was actually using the return value. |
|
@MauriceVanVeen @neilalexander @derekcollison I have seen this code going from "lock the whole function" to "fine grained" and on and on. So there may be a more fundamental issue that we are trying to solve, or evolution of code make it ok to have it like that now. See previous PR where I changed back from "whole function" to "fine grained" and you could even "git blame" and go back in time to see the flip-flop in this function. Please make sure that we are not re-introducing issues with this latest change. |
|
The more fundamental issue is the 'reset/restart of a stream' with But, the important part is that the test could reproduce having two Raft instances for the same underlying storage/group. And the lock should be held during all of that (except for waiting for stopping node) to not run into this issue. This method currently has a test that checks this issue can't happen anymore. So at least seems the current locking is the way it should be. We've been looking a lot lately into Raft, stream, and consumer layer issues and will need to look at the meta layer more in the future (especially related to issues with peer-remove and stream scale-move). |
Includes the following (already cherry-picked) PRs: - #6587 - #6607 - #6612 - #6609 - #6620 - #6668 - #6674 - #6647 - #6684 - #6691 - #6697 - #6705 - #6706 - #6704 - #6714 - #6720 - #6727 - #6730 - #6726 - #6732 - #6759 - #6753 - #6685 - #6769 - #6777 - #6785 - #6786 - #6778 - #6790 - #6791 - #6798 - #6794 - #6801 Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Neil Twigg <neil@nats.io>
Antithesis triggered duplicate Raft nodes with the following scenario:
catchup aborted, no leadermset.resetClusteredState(..)attempts to restart Raft nodemset.stop(..)call resetssa.Group.node = niljs.createRaftGroup(..)and waits for Raft node to stopsa.Group.node == niljs.createRaftGroup(..), and also waits for Raft node to stopjs.createRaftGroup(..)calls duplicate the Raft nodeDetected stream cluster node skew, deletes node and resetsThis fix ensures we can't create duplicate Raft nodes for the same group.
It's unlikely to hit this normally due to requiring exact timing and leaderless condition on the stream while having a leader for the meta layer. But was easily reproducible in a test by just calling
js.createRaftGroupin parallel.Signed-off-by: Maurice van Veen github@mauricevanveen.com