Skip to content

[IMPROVED] Routing: reduce chances of duplicate implicit routes#5602

Merged
derekcollison merged 1 commit intomainfrom
routes_gossip
Jul 22, 2024
Merged

[IMPROVED] Routing: reduce chances of duplicate implicit routes#5602
derekcollison merged 1 commit intomainfrom
routes_gossip

Conversation

@kozlovic
Copy link
Copy Markdown
Member

This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic ivan@synadia.com

This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic kozlovic requested a review from a team as a code owner June 27, 2024 17:47
@kozlovic kozlovic requested a review from derekcollison June 27, 2024 17:48
Copy link
Copy Markdown
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 5b5857f into main Jul 22, 2024
@derekcollison derekcollison deleted the routes_gossip branch July 22, 2024 15:37
bruth added a commit that referenced this pull request Jul 30, 2024
Includes the following:

* #5602
* #5672
* #5668
* #5607
* #5687
* #5695
* #5697
* #5704
* #5706
* #5709
* #5710
* #5713
* #5719

Some PRs specifically excluded:

* #5707 — based on a 2.11 NRG
PR
* #5690 — continue to allow
Go 1.20 for 2.10.x
* Various other NRG PRs which are higher risk and destined for 2.11
instead
@neilalexander
Copy link
Copy Markdown
Member

@kozlovic We noticed that TestStressChainedSolicitWorks fails pretty reliably on our local machines now and have bisected it back to this change, would you mind taking a look?

kozlovic added a commit that referenced this pull request Aug 3, 2024
The PR#5602 solved the issue for typical seed setups, but it was
found that the test `TestStressChainedSolicitWorks` would sometimes
fail. This is a situation where we have S4->S3->S2->S1 and all
servers start at the same type. This is not a typical setup, but
still the regular gossip would allow the creation of the full mesh,
so this PR helps in those situations.

Related to #5602

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
derekcollison added a commit that referenced this pull request Aug 3, 2024
#5746)

The PR#5602 solved the issue for typical seed setups, but it was found
that the test `TestStressChainedSolicitWorks` would sometimes fail. This
is a situation where we have S4->S3->S2->S1 and all servers start at the
same type. This is not a typical setup, but still the regular gossip
would allow the creation of the full mesh, so this PR helps in those
situations.

Related to #5602

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
neilalexander added a commit that referenced this pull request Aug 6, 2024
Includes the following:

- #5690
- #5725
- #5729
- #5734
- #5735
- #5736
- #5743
- #5744
- #5751
- #5755
- #5754
- #5732
- #5750
- #5756

The following were reverted before this PR:
- #5602

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'Duplicate Route' connection floods on implicit routes

3 participants