NRG: Replay snapshot upon timeout instead of reset#7293
Conversation
|
Ready for review already. Marked as draft to do a couple more runs in Antithesis, but primarily due to releasing 2.12.0 very soon and not including this last-minute. |
|
label that for v2.12.1 for now? |
| stype, tierName, replicas := mset.cfg.Storage, mset.tier, mset.cfg.Replicas | ||
| mset.mu.RUnlock() | ||
|
|
||
| assert.Unreachable("Reset clustered state", map[string]any{ |
There was a problem hiding this comment.
Really do not like these :/
There was a problem hiding this comment.
I understand, but there's no reason to not include them in the code base.
These are cases that should be unreachable, if we reach this code something has already gone terribly wrong. And we need to find out why and fix that bug. In an ideal world mset.resetClusteredState can be removed entirely.
Even if these asserts are called, they are noop in production. So there's no downside to even having it be called in the case where something else has gone wrong already to get us to be here in the first place.
More importantly, these asserts are already finding bugs: #7297
Not needed, this can be backported into 2.11 as well. We're just waiting for the code freeze to clear. |
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
85530c6 to
2d5385f
Compare
mset.resetClusteredStatecan be the source of many issues due to race conditions. When something has already gone very wrong with replication then it's a good last effort way to try and get us out of this situation. However, we should never be resetting the whole state as a result of a stream snapshot timing out due to no leader or exceeding retries.This PR changes this such that we "just" replay the snapshot and allow us to re-send entries into the apply queue afterward. When running under Antithesis this has shown to resolve many health-related issues such as "node skew" where the Raft node for the stream/consumer assignment is different than what is actually being used for the stream/consumer.
Closely related to #7249, this was very likely to be part of the cause of the monitor goroutine not being shutdown. Hypothetically due to the above node skew issue.
Signed-off-by: Maurice van Veen github@mauricevanveen.com