Skip to content

[FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb#6778

Merged
neilalexander merged 2 commits intonats-io:mainfrom
souravagrawal:main
Apr 11, 2025
Merged

[FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb#6778
neilalexander merged 2 commits intonats-io:mainfrom
souravagrawal:main

Conversation

@souravagrawal
Copy link
Copy Markdown
Contributor

Fixes an edge case where the stream state could reset during recovery in WorkQueue (WQ) streams.

In WQ streams, messages are removed immediately upon ACK. Once a message block becomes empty, NATS deletes the corresponding block file. After deleting the last message block (lmb), NATS also creates a new block file that carries the latest sequence and timestamp from lmb.

If NATS crashes between deleting the lmb and creating the new block file, on NATS reboot the recovery logic cannot restore the stream state due to the absence of any block files. This leads to a reset to sequence 0, potentially causing inconsistencies such as the consumer sequence being higher than the stream sequence.

This change updates the logic to first create the new block file before deleting the lmb. This ensures that there is always at least one valid block on disk, allowing for consistent recovery even if NATS is terminated unexpectedly during the transition.

PR fixes : #6600

Signed-off-by: Sourabh Agrawal souravagrawal1111@gmail.com

Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>
@souravagrawal souravagrawal marked this pull request as ready for review April 9, 2025 13:58
@souravagrawal souravagrawal requested a review from a team as a code owner April 9, 2025 13:58
@shiv4289
Copy link
Copy Markdown

@souravagrawal Thank you for the patch. Can we put some debug logs when we delete and create new blk files please?

@shiv4289
Copy link
Copy Markdown

@neilalexander Can you please take a look at the code proposed here? We have a way to test this but it requires deleting blk file manually to trigger the recovery code path. Let us know if you have a better idea to write the test.

@MauriceVanVeen
Copy link
Copy Markdown
Member

MauriceVanVeen commented Apr 11, 2025

Thank you for the contribution!

Have opened #6789, to include some tests proving the server does the right thing with this behavior change.
We included some tests in this PR.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@neilalexander neilalexander reopened this Apr 11, 2025
@MauriceVanVeen MauriceVanVeen changed the title [FIXED] Consumer Sequence ahead of Stream Sequence [FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb Apr 11, 2025
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution (and thanks @MauriceVanVeen for the tests)!

@neilalexander neilalexander merged commit 85bcba7 into nats-io:main Apr 11, 2025
22 checks passed
@sourabhaggrawal
Copy link
Copy Markdown

Thank you for merging this PR! I really appreciate the prompt review and also the addition of the test case — that was super helpful and thoughtful. 🙌

neilalexander added a commit that referenced this pull request Apr 17, 2025
Includes the following (already cherry-picked) PRs:

- #6587
- #6607
- #6612
- #6609
- #6620
- #6668
- #6674
- #6647
- #6684
- #6691
- #6697
- #6705
- #6706
- #6704
- #6714
- #6720
- #6727
- #6730
- #6726
- #6732
- #6759
- #6753
- #6685
- #6769
- #6777
- #6785
- #6786
- #6778
- #6790
- #6791
- #6798
- #6794
- #6801

Signed-off-by: Neil Twigg <neil@nats.io>

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consumer Sequence ahead of Stream Sequence

5 participants