[IMPROVED] Speed up a mirror or source consumer's resync across leafnode connections.#6981
Merged
derekcollison merged 2 commits intomainfrom Jun 15, 2025
Merged
[IMPROVED] Speed up a mirror or source consumer's resync across leafnode connections.#6981derekcollison merged 2 commits intomainfrom
derekcollison merged 2 commits intomainfrom
Conversation
Signed-off-by: Derek Collison <derek@nats.io>
…an extended downtime across a leafnode. When a consumer for a source or a mirror is failing to be created, we backoff creation attempts. If the failures were due to a downed leafnode, meaning the mirror or source is across a leafnode connection, the resync could take more time then desired after the leafnode reconnects. This improves resync time by hooking into the leafnode's reconnect logic (either via connect or async info). Once we detect the reconnect, we search for streams that are leaders, and are a mirror or sourcing from another stream, and do not have an active sync consumer. If we detect this we will reset the consumer backoff and retry with just a small jitter backoff. Signed-off-by: Derek Collison <derek@nats.io>
derekcollison
added a commit
that referenced
this pull request
Sep 6, 2025
…are re-established. We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain. The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip). For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail. Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs. Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized). For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal. Signed-off-by: Derek Collison <derek@nats.io>
derekcollison
added a commit
that referenced
this pull request
Sep 7, 2025
…are re-established. We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain. The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip). For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail. Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs. Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized). For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal. Signed-off-by: Derek Collison <derek@nats.io>
neilalexander
pushed a commit
that referenced
this pull request
Sep 8, 2025
…are re-established. We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain. The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip). For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail. Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs. Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized). For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal. Signed-off-by: Derek Collison <derek@nats.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a consumer for a source or a mirror is failing to be created, we backoff creation attempts.
If the failures were due to a downed leafnode, meaning the mirror or source is across a leafnode connection, the resync could take more time then desired after the leafnode reconnects.
This improves resync time by hooking into the leafnode's reconnect logic (either via connect or async info).
Once we detect the reconnect, we search for streams that are leaders and are mirrors or sources, and do not have an active sync consumer.
If we detect this we will reset the consumer backoff and retry with just a small jitter backoff.
Signed-off-by: Derek Collison derek@nats.io