[FIXED] Infinite wait on shutdown of monitor goroutine by MauriceVanVeen · Pull Request #7249 · nats-io/nats-server

MauriceVanVeen · 2025-09-02T18:09:25Z

This PR fixes an issue where the Raft node is potentially nil for a stream or consumer, resulting in infinite waiting on the mset/o.monitorWg.Wait() to complete when stopping/deleting the stream/consumer.

This is done by introducing the mqch (monitor quit channel) for the consumer as well. The stream already had this. Also, updates other places to ensure the monitor quit channel is closed prior to calling monitorWg.Wait().

Signed-off-by: Maurice van Veen github@mauricevanveen.com

derekcollison · 2025-09-02T18:38:32Z

server/consumer.go

 	o.closed = true

+	// Signal to the monitor loop.
+	// Can't use qch here.


Why can't we?

Maybe we can for the consumer, but not for the stream?
Just copied that over from the stream, that was introduced here: de89207#diff-2f4991438bb868a8587303cde9107f83127e88ad70bd19d5c6a31c238a20c299R4916-R4918

Turns out we close o.qch as a follower/when stepping down. Shouldn't stop the monitor routine then, so need this other o.mqch channel.
Updated to:

// Signal to the monitor loop.
// Can't use only qch here, since that's used when stepping down as a leader.

derekcollison · 2025-09-02T18:39:25Z

server/jetstream_cluster.go

 	if !isShuttingDown {
+		// Signal to the monitor loop. If there's no Raft node,
+		// this will be the only way to stop the monitor goroutine.
+		mset.mu.Lock()


Probably make this a function so as to not keep repeating the code. Compiler will inline.

server/jetstream_cluster.go

server/stream.go

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison

LGTM

Includes the following: - #7200 - #7201 - #7202 - #7209 - #7210 - #7211 - #7213 - #7212 - #7216 - #7217 - #7230 - #7239 - #7246 - #7248 - 8241a15, specifically delayed errors that are not JS API errors - #7158 (not containing 2.12-specific changes) - #7233 - #7255 - #7249 - #7259 - #7265 - #7273 (not including Go 1.25.x) - #7258 - #7222 Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Signed-off-by: Neil Twigg <neil@nats.io>

`mset.resetClusteredState` can be the source of many issues due to race conditions. When something has already gone very wrong with replication then it's a good last effort way to try and get us out of this situation. However, we should never be resetting the whole state as a result of a stream snapshot timing out due to no leader or exceeding retries. This PR changes this such that we "just" replay the snapshot and allow us to re-send entries into the apply queue afterward. When running under Antithesis this has shown to resolve many health-related issues such as "node skew" where the Raft node for the stream/consumer assignment is different than what is actually being used for the stream/consumer. Closely related to #7249, this was very likely to be part of the cause of the monitor goroutine not being shutdown. Hypothetically due to the above node skew issue. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison reviewed Sep 2, 2025

View reviewed changes

[FIXED] Infinite wait on shutdown of monitor goroutine

5d838c5

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the maurice/inf-monitor-wait branch from 0e9a849 to 5d838c5 Compare September 2, 2025 19:38

MauriceVanVeen marked this pull request as ready for review September 4, 2025 09:02

MauriceVanVeen requested a review from a team as a code owner September 4, 2025 09:02

derekcollison approved these changes Sep 4, 2025

View reviewed changes

derekcollison merged commit 82054be into main Sep 4, 2025
89 of 92 checks passed

derekcollison deleted the maurice/inf-monitor-wait branch September 4, 2025 13:10

MauriceVanVeen mentioned this pull request Sep 4, 2025

Cherry-picks for 2.11.9-RC.3 #7252

Merged

MauriceVanVeen mentioned this pull request Sep 10, 2025

NRG: Replay snapshot upon timeout instead of reset #7293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIXED] Infinite wait on shutdown of monitor goroutine#7249

[FIXED] Infinite wait on shutdown of monitor goroutine#7249
derekcollison merged 1 commit intomainfrom
maurice/inf-monitor-wait

MauriceVanVeen commented Sep 2, 2025

Uh oh!

derekcollison Sep 2, 2025

Uh oh!

MauriceVanVeen Sep 2, 2025

Uh oh!

MauriceVanVeen Sep 2, 2025

Uh oh!

derekcollison Sep 2, 2025

Uh oh!

MauriceVanVeen Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

derekcollison left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MauriceVanVeen commented Sep 2, 2025

Uh oh!

derekcollison Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

derekcollison Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

derekcollison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants