Skip to content

[IMPROVED] Resync for Mirrors and Sources on LN reconnect in complex topologies#7265

Merged
neilalexander merged 1 commit intomainfrom
resync2
Sep 7, 2025
Merged

[IMPROVED] Resync for Mirrors and Sources on LN reconnect in complex topologies#7265
neilalexander merged 1 commit intomainfrom
resync2

Conversation

@derekcollison
Copy link
Copy Markdown
Member

This improves our resync logic for source and mirrors when leafnodes are re-established.

We previously improved this with PR #6981 - but this was too rigid. It expected the LN to have JS enabled and have the same domain. The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip).

For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail.

Now we are much more broad about the conditions to retry. I did look into checking for $JS..API.INFO but this was brittle and depended on timing and doing retries or backoffs. Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized). For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal.

Signed-off-by: Derek Collison derek@nats.io

@derekcollison derekcollison requested a review from a team as a code owner September 6, 2025 22:32
Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

})
if elapsed := time.Since(start); elapsed > 2*time.Second {
if elapsed := time.Since(start); elapsed > 3*time.Second {
t.Fatalf("Expected to resync all streams <2s but got %v", elapsed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be updated to 3 seconds?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I upped it because I saw o ne failure running in a loop. Will update.

…are re-established.

We previously improved this with PR #6981 - but this ws too rigid. It expected the LN to have JS enabled and have the same domain.
The test also simulated a long time for the link to be down and manually changed the state to no in progress (si.sip).

For simpler setups this worked, but if LNs were daisy chained, and either the GW Leafnode did not have JS enabled, or if enabled it would have a different domain, meaning the speedup would fail.

Now we are much more broad about the conditions to retry. I did look into checking for $JS.<DOMAIN>.API.INFO but this was brittle and depended on timing and doing retries or backoffs.
Will revisit in the future (We do have the ability to register for a callback for interest in a subject which could be utilized).
For now this works well, and is simple, and the cost of being "wrong" in very complicated setups is minimal.

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit d09bb5b into main Sep 7, 2025
68 of 70 checks passed
@neilalexander neilalexander deleted the resync2 branch September 7, 2025 12:51
neilalexander added a commit that referenced this pull request Sep 8, 2025
Includes the following:
- #7200
- #7201
- #7202
- #7209
- #7210
- #7211
- #7213
- #7212
- #7216
- #7217
- #7230
- #7239
- #7246
- #7248
-
8241a15,
specifically delayed errors that are not JS API errors
- #7158 (not containing
2.12-specific changes)
- #7233
- #7255
- #7249
- #7259
- #7265
- #7273 (not including Go
1.25.x)
- #7258
- #7222

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants