NRG: Replay snapshot upon timeout instead of reset by MauriceVanVeen · Pull Request #7293 · nats-io/nats-server

MauriceVanVeen · 2025-09-10T13:38:49Z

mset.resetClusteredState can be the source of many issues due to race conditions. When something has already gone very wrong with replication then it's a good last effort way to try and get us out of this situation. However, we should never be resetting the whole state as a result of a stream snapshot timing out due to no leader or exceeding retries.

This PR changes this such that we "just" replay the snapshot and allow us to re-send entries into the apply queue afterward. When running under Antithesis this has shown to resolve many health-related issues such as "node skew" where the Raft node for the stream/consumer assignment is different than what is actually being used for the stream/consumer.

Closely related to #7249, this was very likely to be part of the cause of the monitor goroutine not being shutdown. Hypothetically due to the above node skew issue.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

MauriceVanVeen · 2025-09-10T13:39:34Z

Ready for review already. Marked as draft to do a couple more runs in Antithesis, but primarily due to releasing 2.12.0 very soon and not including this last-minute.

wallyqs · 2025-09-10T15:53:57Z

label that for v2.12.1 for now?

server/jetstream_cluster.go

derekcollison · 2025-09-10T15:54:35Z

server/jetstream_cluster.go

 	stype, tierName, replicas := mset.cfg.Storage, mset.tier, mset.cfg.Replicas
 	mset.mu.RUnlock()

+	assert.Unreachable("Reset clustered state", map[string]any{


Really do not like these :/

I understand, but there's no reason to not include them in the code base.

These are cases that should be unreachable, if we reach this code something has already gone terribly wrong. And we need to find out why and fix that bug. In an ideal world mset.resetClusteredState can be removed entirely.

Even if these asserts are called, they are noop in production. So there's no downside to even having it be called in the case where something else has gone wrong already to get us to be here in the first place.

More importantly, these asserts are already finding bugs: #7297

server/raft.go

MauriceVanVeen · 2025-09-11T08:09:15Z

label that for v2.12.1 for now?

Not needed, this can be backported into 2.11 as well. We're just waiting for the code freeze to clear.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander

LGTM

Includes the following: - #7290 - #7295 - #7291 - #7287 - #7299 - #7300 - #7297 - #7303 - #7304 - #7305 - #7309 - #7307 - #7320 - #7337 - #7344 - #7345 - #7348 - #7349 - #7350 - #7357 - #7356 - #7358 - #7367 - #7293 Signed-off-by: Neil Twigg <neil@nats.io>

Includes the following: - #7337 - #7342 - #7344 - #7345 - #7347 - #7346 - #7348 - #7349 - #7350 - #7357 - #7356 - #7358 - #7359 - #7366 - #7367 - #7293 - #7368 Signed-off-by: Neil Twigg <neil@nats.io>

derekcollison reviewed Sep 10, 2025

View reviewed changes

NRG: Replay snapshot upon timeout instead of reset

2d5385f

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the maurice/stream-no-reset branch from 85530c6 to 2d5385f Compare September 29, 2025 08:27

MauriceVanVeen marked this pull request as ready for review September 29, 2025 08:43

MauriceVanVeen requested a review from a team as a code owner September 29, 2025 08:43

neilalexander approved these changes Sep 29, 2025

View reviewed changes

neilalexander merged commit 703c823 into main Sep 29, 2025
48 checks passed

neilalexander deleted the maurice/stream-no-reset branch September 29, 2025 09:00

neilalexander mentioned this pull request Sep 29, 2025

Cherry-picks for v2.11.10-RC.1 #7370

Merged

neilalexander mentioned this pull request Sep 29, 2025

Cherry-picks for v2.12.1-RC.1 #7372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NRG: Replay snapshot upon timeout instead of reset#7293

NRG: Replay snapshot upon timeout instead of reset#7293
neilalexander merged 1 commit intomainfrom
maurice/stream-no-reset

MauriceVanVeen commented Sep 10, 2025

Uh oh!

MauriceVanVeen commented Sep 10, 2025 •

edited

Loading

Uh oh!

wallyqs commented Sep 10, 2025

Uh oh!

Uh oh!

derekcollison Sep 10, 2025

Uh oh!

MauriceVanVeen Sep 11, 2025

Uh oh!

Uh oh!

MauriceVanVeen commented Sep 11, 2025

Uh oh!

neilalexander left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

MauriceVanVeen commented Sep 10, 2025

Uh oh!

MauriceVanVeen commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wallyqs commented Sep 10, 2025

Uh oh!

Uh oh!

derekcollison Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MauriceVanVeen commented Sep 11, 2025

Uh oh!

neilalexander left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MauriceVanVeen commented Sep 10, 2025 •

edited

Loading