Skip to content

[IMPROVED] Speed up a mirror or source consumer's resync across leafnode connections.#6981

Merged
derekcollison merged 2 commits intomainfrom
resync-faster
Jun 15, 2025
Merged

[IMPROVED] Speed up a mirror or source consumer's resync across leafnode connections.#6981
derekcollison merged 2 commits intomainfrom
resync-faster

Conversation

@derekcollison
Copy link
Copy Markdown
Member

When a consumer for a source or a mirror is failing to be created, we backoff creation attempts.
If the failures were due to a downed leafnode, meaning the mirror or source is across a leafnode connection, the resync could take more time then desired after the leafnode reconnects.

This improves resync time by hooking into the leafnode's reconnect logic (either via connect or async info).
Once we detect the reconnect, we search for streams that are leaders and are mirrors or sources, and do not have an active sync consumer.

If we detect this we will reset the consumer backoff and retry with just a small jitter backoff.

Signed-off-by: Derek Collison derek@nats.io

…an extended downtime across a leafnode.

When a consumer for a source or a mirror is failing to be created, we backoff creation attempts.
If the failures were due to a downed leafnode, meaning the mirror or source is across a leafnode connection, the resync could take more time then desired after the leafnode reconnects.

This improves resync time by hooking into the leafnode's reconnect logic (either via connect or async info).
Once we detect the reconnect, we search for streams that are leaders, and are a mirror or sourcing from another stream, and do not have an active sync consumer.
If we detect this we will reset the consumer backoff and retry with just a small jitter backoff.

Signed-off-by: Derek Collison <derek@nats.io>
@derekcollison derekcollison requested a review from a team as a code owner June 15, 2025 22:12
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 0e2e28e into main Jun 15, 2025
90 of 92 checks passed
@derekcollison derekcollison deleted the resync-faster branch June 15, 2025 23:16
neilalexander added a commit that referenced this pull request Jun 17, 2025
Includes the following:

- #6922
- #6931
- #6933
- #6934
- #6939
- #6938
- #6940
- #6941
- #6942
- #6943
- #6945
- #6944
- #6947
- #6948
- #6949
- #6956
- #6960
- #6961
- #6951
- #6965
- #6968
- #6981
- #6983
- #6984

Signed-off-by: Neil Twigg <neil@nats.io>
derekcollison added a commit that referenced this pull request Sep 6, 2025
…are re-established.

We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain.
The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip).

For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail.

Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs.
Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized).
For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal.

Signed-off-by: Derek Collison <derek@nats.io>
derekcollison added a commit that referenced this pull request Sep 7, 2025
…are re-established.

We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain.
The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip).

For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail.

Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs.
Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized).
For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal.

Signed-off-by: Derek Collison <derek@nats.io>
neilalexander pushed a commit that referenced this pull request Sep 8, 2025
…are re-established.

We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain.
The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip).

For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail.

Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs.
Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized).
For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal.

Signed-off-by: Derek Collison <derek@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants