Skip to content

NRG: Replay snapshot upon timeout instead of reset#7293

Merged
neilalexander merged 1 commit intomainfrom
maurice/stream-no-reset
Sep 29, 2025
Merged

NRG: Replay snapshot upon timeout instead of reset#7293
neilalexander merged 1 commit intomainfrom
maurice/stream-no-reset

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

mset.resetClusteredState can be the source of many issues due to race conditions. When something has already gone very wrong with replication then it's a good last effort way to try and get us out of this situation. However, we should never be resetting the whole state as a result of a stream snapshot timing out due to no leader or exceeding retries.

This PR changes this such that we "just" replay the snapshot and allow us to re-send entries into the apply queue afterward. When running under Antithesis this has shown to resolve many health-related issues such as "node skew" where the Raft node for the stream/consumer assignment is different than what is actually being used for the stream/consumer.

Closely related to #7249, this was very likely to be part of the cause of the monitor goroutine not being shutdown. Hypothetically due to the above node skew issue.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

MauriceVanVeen commented Sep 10, 2025

Ready for review already. Marked as draft to do a couple more runs in Antithesis, but primarily due to releasing 2.12.0 very soon and not including this last-minute.

@wallyqs
Copy link
Copy Markdown
Member

wallyqs commented Sep 10, 2025

label that for v2.12.1 for now?

stype, tierName, replicas := mset.cfg.Storage, mset.tier, mset.cfg.Replicas
mset.mu.RUnlock()

assert.Unreachable("Reset clustered state", map[string]any{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really do not like these :/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but there's no reason to not include them in the code base.

These are cases that should be unreachable, if we reach this code something has already gone terribly wrong. And we need to find out why and fix that bug. In an ideal world mset.resetClusteredState can be removed entirely.

Even if these asserts are called, they are noop in production. So there's no downside to even having it be called in the case where something else has gone wrong already to get us to be here in the first place.

More importantly, these asserts are already finding bugs: #7297

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

label that for v2.12.1 for now?

Not needed, this can be backported into 2.11 as well. We're just waiting for the code freeze to clear.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review September 29, 2025 08:43
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner September 29, 2025 08:43
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit 703c823 into main Sep 29, 2025
48 checks passed
@neilalexander neilalexander deleted the maurice/stream-no-reset branch September 29, 2025 09:00
neilalexander added a commit that referenced this pull request Sep 29, 2025
Includes the following:

- #7290
- #7295
- #7291
- #7287
- #7299
- #7300
- #7297
- #7303
- #7304
- #7305
- #7309
- #7307
- #7320
- #7337
- #7344
- #7345
- #7348
- #7349
- #7350
- #7357
- #7356
- #7358
- #7367
- #7293

Signed-off-by: Neil Twigg <neil@nats.io>
neilalexander added a commit that referenced this pull request Sep 29, 2025
Includes the following:

- #7337
- #7342
- #7344
- #7345
- #7347
- #7346
- #7348
- #7349
- #7350
- #7357
- #7356
- #7358
- #7359
- #7366
- #7367
- #7293
- #7368

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants