Skip to content

[FIXED] NRG: New entries from old leader reset WAL during catchup#7209

Merged
neilalexander merged 1 commit intomainfrom
maurice/nrg-catchup-reset
Aug 26, 2025
Merged

[FIXED] NRG: New entries from old leader reset WAL during catchup#7209
neilalexander merged 1 commit intomainfrom
maurice/nrg-catchup-reset

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

A follower could be running catchup and then receive "new" very delayed entries from an old leader. This would result in one entry being allowed through to truncate or reset the WAL, while subsequent old entries would be rejected. This would result in desync if the current leader had already committed the entry at that index.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner August 25, 2025 15:34
@MauriceVanVeen MauriceVanVeen changed the title NRG: New entries from leader reset WAL during catchup NRG: New entries from old leader reset WAL during catchup Aug 25, 2025
n.debug("Term higher than ours and we are not a follower: %v, stepping down to %q", n.State(), ae.leader)
n.stepdownLocked(ae.leader)
}
} else if lterm < n.term && sub != nil && (isNew || ae.lterm != 0) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block was moved up, only this statement was changed from:

} else if lterm < n.term && sub != nil && !(catchingUp && ae.lterm == 0) {

Copy link
Copy Markdown
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit af0c23a into main Aug 26, 2025
89 of 92 checks passed
@neilalexander neilalexander deleted the maurice/nrg-catchup-reset branch August 26, 2025 08:35
neilalexander added a commit that referenced this pull request Sep 1, 2025
Responding to a lower term was removed in
#7209, as it assumed the
leader would already respond with this information. However, if this
server is a candidate, it would need to be the one rejecting and sending
back the response.

This PR fixes that, and adds a test to ensure it stays that way.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
MauriceVanVeen pushed a commit that referenced this pull request Sep 3, 2025
A follower could be running catchup and then receive "new" very delayed
entries from an old leader. This would result in one entry being allowed
through to truncate or reset the WAL, while subsequent old entries would
be rejected. This would result in desync if the current leader had
already committed the entry at that index.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
MauriceVanVeen pushed a commit that referenced this pull request Sep 3, 2025
Responding to a lower term was removed in
#7209, as it assumed the
leader would already respond with this information. However, if this
server is a candidate, it would need to be the one rejecting and sending
back the response.

This PR fixes that, and adds a test to ensure it stays that way.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@wallyqs wallyqs changed the title NRG: New entries from old leader reset WAL during catchup [FIXED] NRG: New entries from old leader reset WAL during catchup Sep 3, 2025
neilalexander added a commit that referenced this pull request Sep 8, 2025
Includes the following:
- #7200
- #7201
- #7202
- #7209
- #7210
- #7211
- #7213
- #7212
- #7216
- #7217
- #7230
- #7239
- #7246
- #7248
-
8241a15,
specifically delayed errors that are not JS API errors
- #7158 (not containing
2.12-specific changes)
- #7233
- #7255
- #7249
- #7259
- #7265
- #7273 (not including Go
1.25.x)
- #7258
- #7222

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Neil Twigg <neil@nats.io>
neilalexander added a commit that referenced this pull request Oct 13, 2025
Follow-up of a small bug introduced in
#7209 that could result in
desync under niche conditions.

If a follower is behind it will request catchup from the leader. The
leader however could send more catchup data than the follower requested
by sending it up to the last known entry. The follower will then mark
catchup as finished when it has received what it knew to be all entries,
but it may receive more messages from the leader still.

If multiple catchups were running in parallel, this could result in the
follower marking the catchup as complete but then still letting an
append entry from another catchup through that could truncate the WAL
for entries that were already responded to as "can be committed".

This PR simply adds a small guard to not allow truncation from a catchup
entry if catchup has completed. Later on we'll need to look at how to
improve catchup in general, especially with fast batch ingest that will
require Raft catchup to be faster than that as well.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants