Skip to content

[FIXED] Duplicate Raft nodes during restart#6732

Merged
neilalexander merged 1 commit intomainfrom
maurice/duplicated-raft-nodes
Mar 28, 2025
Merged

[FIXED] Duplicate Raft nodes during restart#6732
neilalexander merged 1 commit intomainfrom
maurice/duplicated-raft-nodes

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

Antithesis triggered duplicate Raft nodes with the following scenario:

  • follower needs to catchup based on snapshot, but is leaderless, so results in: catchup aborted, no leader
    • call to mset.resetClusteredState(..) attempts to restart Raft node
    • mset.stop(..) call resets sa.Group.node = nil
    • goes into js.createRaftGroup(..) and waits for Raft node to stop
  • in the meantime this folllower gets a snapshot from the meta layer, which updates/checks the stream assignment
    • update notices sa.Group.node == nil
    • does another call to js.createRaftGroup(..), and also waits for Raft node to stop
  • now Raft node stops and both js.createRaftGroup(..) calls duplicate the Raft node
  • health check now notices Detected stream cluster node skew, deletes node and resets

This fix ensures we can't create duplicate Raft nodes for the same group.

It's unlikely to hit this normally due to requiring exact timing and leaderless condition on the stream while having a leader for the meta layer. But was easily reproducible in a test by just calling js.createRaftGroup in parallel.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner March 27, 2025 12:25
Copy link
Copy Markdown
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the return value just for testing?

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Is the return value just for testing?

Yes.
There seems no other way to know if there are multiple raft nodes. There is s.numRaftNodes() but that's a map only containing one entry per ID, same thing with the stream and assignment only containing one node, but the test shows we would have multiple Raft nodes for the same ID.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/duplicated-raft-nodes branch from 35f112c to ff374d0 Compare March 27, 2025 13:25
@derekcollison
Copy link
Copy Markdown
Member

Is the return value just for testing?

Yes. There seems no other way to know if there are multiple raft nodes. There is s.numRaftNodes() but that's a map only containing one entry per ID, same thing with the stream and assignment only containing one node, but the test shows we would have multiple Raft nodes for the same ID.

Hmm ok. I tend to shy away from production code changes that are only for tests. Never been a fan of that pattern, just confusing looking at the code since I could see that no one was actually using the return value.

Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit bfb8894 into main Mar 28, 2025
37 checks passed
@neilalexander neilalexander deleted the maurice/duplicated-raft-nodes branch March 28, 2025 10:22
@kozlovic
Copy link
Copy Markdown
Member

@MauriceVanVeen @neilalexander @derekcollison I have seen this code going from "lock the whole function" to "fine grained" and on and on. So there may be a more fundamental issue that we are trying to solve, or evolution of code make it ok to have it like that now. See previous PR where I changed back from "whole function" to "fine grained" and you could even "git blame" and go back in time to see the flip-flop in this function.

Please make sure that we are not re-introducing issues with this latest change.

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

The more fundamental issue is the 'reset/restart of a stream' with mset.resetClusteredState. It updates the stream assignment async and outside of the control of the main meta apply entries loop, which inherently allows for having multiple callers to js.createRaftGroup. Doing any meta operations outside the meta loop in general results in various (ordering) issues.

But, the important part is that the test could reproduce having two Raft instances for the same underlying storage/group. And the lock should be held during all of that (except for waiting for stopping node) to not run into this issue.

This method currently has a test that checks this issue can't happen anymore. So at least seems the current locking is the way it should be. We've been looking a lot lately into Raft, stream, and consumer layer issues and will need to look at the meta layer more in the future (especially related to issues with peer-remove and stream scale-move).

neilalexander added a commit that referenced this pull request Apr 17, 2025
Includes the following (already cherry-picked) PRs:

- #6587
- #6607
- #6612
- #6609
- #6620
- #6668
- #6674
- #6647
- #6684
- #6691
- #6697
- #6705
- #6706
- #6704
- #6714
- #6720
- #6727
- #6730
- #6726
- #6732
- #6759
- #6753
- #6685
- #6769
- #6777
- #6785
- #6786
- #6778
- #6790
- #6791
- #6798
- #6794
- #6801

Signed-off-by: Neil Twigg <neil@nats.io>

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants