Skip to content

[FIXED] Infinite wait on shutdown of monitor goroutine#7249

Merged
derekcollison merged 1 commit intomainfrom
maurice/inf-monitor-wait
Sep 4, 2025
Merged

[FIXED] Infinite wait on shutdown of monitor goroutine#7249
derekcollison merged 1 commit intomainfrom
maurice/inf-monitor-wait

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

This PR fixes an issue where the Raft node is potentially nil for a stream or consumer, resulting in infinite waiting on the mset/o.monitorWg.Wait() to complete when stopping/deleting the stream/consumer.

This is done by introducing the mqch (monitor quit channel) for the consumer as well. The stream already had this. Also, updates other places to ensure the monitor quit channel is closed prior to calling monitorWg.Wait().

Signed-off-by: Maurice van Veen github@mauricevanveen.com

o.closed = true

// Signal to the monitor loop.
// Can't use qch here.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can for the consumer, but not for the stream?
Just copied that over from the stream, that was introduced here: de89207#diff-2f4991438bb868a8587303cde9107f83127e88ad70bd19d5c6a31c238a20c299R4916-R4918

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we close o.qch as a follower/when stepping down. Shouldn't stop the monitor routine then, so need this other o.mqch channel.
Updated to:

// Signal to the monitor loop.
// Can't use only qch here, since that's used when stepping down as a leader.

if !isShuttingDown {
// Signal to the monitor loop. If there's no Raft node,
// this will be the only way to stop the monitor goroutine.
mset.mu.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably make this a function so as to not keep repeating the code. Compiler will inline.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/inf-monitor-wait branch from 0e9a849 to 5d838c5 Compare September 2, 2025 19:38
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review September 4, 2025 09:02
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner September 4, 2025 09:02
Copy link
Copy Markdown
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 82054be into main Sep 4, 2025
89 of 92 checks passed
@derekcollison derekcollison deleted the maurice/inf-monitor-wait branch September 4, 2025 13:10
neilalexander added a commit that referenced this pull request Sep 8, 2025
Includes the following:
- #7200
- #7201
- #7202
- #7209
- #7210
- #7211
- #7213
- #7212
- #7216
- #7217
- #7230
- #7239
- #7246
- #7248
-
8241a15,
specifically delayed errors that are not JS API errors
- #7158 (not containing
2.12-specific changes)
- #7233
- #7255
- #7249
- #7259
- #7265
- #7273 (not including Go
1.25.x)
- #7258
- #7222

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Neil Twigg <neil@nats.io>
neilalexander added a commit that referenced this pull request Sep 29, 2025
`mset.resetClusteredState` can be the source of many issues due to race
conditions. When something has already gone very wrong with replication
then it's a good last effort way to try and get us out of this
situation. However, we should never be resetting the whole state as a
result of a stream snapshot timing out due to no leader or exceeding
retries.

This PR changes this such that we "just" replay the snapshot and allow
us to re-send entries into the apply queue afterward. When running under
Antithesis this has shown to resolve many health-related issues such as
"node skew" where the Raft node for the stream/consumer assignment is
different than what is actually being used for the stream/consumer.

Closely related to #7249,
this was very likely to be part of the cause of the monitor goroutine
not being shutdown. Hypothetically due to the above node skew issue.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants