(2.11) Don't send meta snapshot when becoming metaleader by neilalexander · Pull Request #5700 · nats-io/nats-server

neilalexander · 2024-07-25T09:05:30Z

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet.

We will want to audit all uses of SendSnapshot as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that SendSnapshot usage may have been papering over.

Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to processStreamRemoval.

We've also added a new unit test TestJetStreamClusterHardKillAfterStreamAdd for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable.

Signed-off-by: Neil Twigg neil@nats.io

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. Signed-off-by: Neil Twigg <neil@nats.io>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison · 2024-10-03T20:10:27Z

server/consumer.go

 			go func() {
 				const (
-					startInterval = 30 * time.Second
+					startInterval = 5 * time.Second


Why is this being changed in a PR about meta snapshots?

Sending the snapshot was forcing the ghost consumers tests to pass for the wrong reason.

When we removed it, the test started to flake unless it was given over a minute to run. Reducing the initial time to clean up those ghost consumers de-flaked those tests without having to extend the test to such a long runtime.

Yes but that can dramatically impact production systems with 10s or 100s of thousands of these. Hence the 30s vs shorter.

If need be make a var off of a const and set var to what you need in test and reset at the end of the test.

OK will do tomorrow.

In fact have just done now, as it was a quick change.

Just removed the accidental const () that was left over too. Editor language server got me.

Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison

LGTM

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

Includes the following: - #5661 - #5666 - #5671 - #5344 - #5684 - #5689 - #5691 - #5714 - #5717 - #5707 - #5792 - #5912 - #5957 - #5700 - #5975 - #5991 - #5987 - #6027 - #6038 - #6053 - #5848 - #6055 - #6056 - #6060 - #6061 - #6072 - #5832 - #6073 - #6107 Signed-off-by: Neil Twigg <neil@nats.io>

neilalexander requested a review from a team as a code owner July 25, 2024 09:05

neilalexander marked this pull request as draft July 25, 2024 11:36

neilalexander force-pushed the neil/jsmetasnap branch from c6cf93d to c20906a Compare July 30, 2024 15:13

neilalexander force-pushed the neil/jsmetasnap branch from c20906a to 5549baf Compare September 13, 2024 12:59

neilalexander force-pushed the neil/jsmetasnap branch from 5549baf to e576cf1 Compare October 2, 2024 09:16

Test hard kill after stream add should not remove stream

82c1371

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the neil/jsmetasnap branch from 0858231 to 9ef2ed2 Compare October 2, 2024 12:51

neilalexander marked this pull request as ready for review October 2, 2024 13:46

MauriceVanVeen mentioned this pull request Oct 2, 2024

Correct ae.commit on recovery to equal call to applyCommit(index) #5946

Closed

derekcollison reviewed Oct 3, 2024

View reviewed changes

neilalexander force-pushed the neil/jsmetasnap branch from b45f105 to 9109439 Compare October 3, 2024 20:51

derekcollison self-requested a review October 3, 2024 20:54

Factor out consumer cleanup times to deflake orphaned consumer tests

03ed9c1

Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander force-pushed the neil/jsmetasnap branch from 9109439 to 03ed9c1 Compare October 3, 2024 20:56

derekcollison approved these changes Oct 3, 2024

View reviewed changes

derekcollison merged commit acbca0f into main Oct 3, 2024

derekcollison deleted the neil/jsmetasnap branch October 3, 2024 21:40

MauriceVanVeen mentioned this pull request Nov 20, 2024

[FIXED] Don't SendSnapshot on becoming consumer leader #6151

Merged

neilalexander mentioned this pull request Nov 25, 2024

Cherry-picks for 2.10.23-RC.5 #6171

Merged

MauriceVanVeen mentioned this pull request Nov 7, 2025

JetStream stream lost when processes are killed and restarted [v2.10.20] #6888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(2.11) Don't send meta snapshot when becoming metaleader#5700

(2.11) Don't send meta snapshot when becoming metaleader#5700
derekcollison merged 3 commits intomainfrom
neil/jsmetasnap

neilalexander commented Jul 25, 2024 •

edited

Loading

Uh oh!

derekcollison Oct 3, 2024

Uh oh!

neilalexander Oct 3, 2024

Uh oh!

derekcollison Oct 3, 2024

Uh oh!

neilalexander Oct 3, 2024

Uh oh!

neilalexander Oct 3, 2024

Uh oh!

neilalexander Oct 3, 2024 •

edited

Loading

Uh oh!

derekcollison left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

neilalexander commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekcollison Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

derekcollison Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

neilalexander Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekcollison left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilalexander commented Jul 25, 2024 •

edited

Loading

neilalexander Oct 3, 2024 •

edited

Loading