[FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb#6778
[FIXED] Lost sequences after hard kill in fs.removeMsgBlock of lmb#6778neilalexander merged 2 commits intonats-io:mainfrom
Conversation
Signed-off-by: souravagrawal <souravagrawal1111@gmail.com>
|
@souravagrawal Thank you for the patch. Can we put some debug logs when we delete and create new blk files please? |
|
@neilalexander Can you please take a look at the code proposed here? We have a way to test this but it requires deleting blk file manually to trigger the recovery code path. Let us know if you have a better idea to write the test. |
|
Thank you for the contribution!
|
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander
left a comment
There was a problem hiding this comment.
LGTM, thanks for the contribution (and thanks @MauriceVanVeen for the tests)!
|
Thank you for merging this PR! I really appreciate the prompt review and also the addition of the test case — that was super helpful and thoughtful. 🙌 |
Includes the following (already cherry-picked) PRs: - #6587 - #6607 - #6612 - #6609 - #6620 - #6668 - #6674 - #6647 - #6684 - #6691 - #6697 - #6705 - #6706 - #6704 - #6714 - #6720 - #6727 - #6730 - #6726 - #6732 - #6759 - #6753 - #6685 - #6769 - #6777 - #6785 - #6786 - #6778 - #6790 - #6791 - #6798 - #6794 - #6801 Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Neil Twigg <neil@nats.io>
Fixes an edge case where the stream state could reset during recovery in WorkQueue (WQ) streams.
In WQ streams, messages are removed immediately upon ACK. Once a message block becomes empty, NATS deletes the corresponding block file. After deleting the last message block (lmb), NATS also creates a new block file that carries the latest sequence and timestamp from lmb.
If NATS crashes between deleting the lmb and creating the new block file, on NATS reboot the recovery logic cannot restore the stream state due to the absence of any block files. This leads to a reset to sequence 0, potentially causing inconsistencies such as the consumer sequence being higher than the stream sequence.
This change updates the logic to first create the new block file before deleting the lmb. This ensures that there is always at least one valid block on disk, allowing for consistent recovery even if NATS is terminated unexpectedly during the transition.
PR fixes : #6600
Signed-off-by: Sourabh Agrawal souravagrawal1111@gmail.com