(2.11) Don't send meta snapshot when becoming metaleader#5700
(2.11) Don't send meta snapshot when becoming metaleader#5700derekcollison merged 3 commits intomainfrom
Conversation
c6cf93d to
c20906a
Compare
c20906a to
5549baf
Compare
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. Signed-off-by: Neil Twigg <neil@nats.io>
5549baf to
e576cf1
Compare
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
0858231 to
9ef2ed2
Compare
server/consumer.go
Outdated
| go func() { | ||
| const ( | ||
| startInterval = 30 * time.Second | ||
| startInterval = 5 * time.Second |
There was a problem hiding this comment.
Why is this being changed in a PR about meta snapshots?
There was a problem hiding this comment.
Sending the snapshot was forcing the ghost consumers tests to pass for the wrong reason.
When we removed it, the test started to flake unless it was given over a minute to run. Reducing the initial time to clean up those ghost consumers de-flaked those tests without having to extend the test to such a long runtime.
There was a problem hiding this comment.
Yes but that can dramatically impact production systems with 10s or 100s of thousands of these. Hence the 30s vs shorter.
If need be make a var off of a const and set var to what you need in test and reset at the end of the test.
There was a problem hiding this comment.
OK will do tomorrow.
There was a problem hiding this comment.
In fact have just done now, as it was a quick change.
There was a problem hiding this comment.
Just removed the accidental const () that was left over too. Editor language server got me.
b45f105 to
9109439
Compare
Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
9109439 to
03ed9c1
Compare
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet.
We will want to audit all uses of
SendSnapshotas it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs thatSendSnapshotusage may have been papering over.Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to
processStreamRemoval.We've also added a new unit test
TestJetStreamClusterHardKillAfterStreamAddfor a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable.Signed-off-by: Neil Twigg neil@nats.io