-
Notifications
You must be signed in to change notification settings - Fork 940
Fix incorrect TCP connections. #12259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bosilca
wants to merge
1
commit into
open-mpi:main
Choose a base branch
from
bosilca:fix/tcp_connection_s
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+125
−11
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
Author
|
This patch improves #12232 but does not solve it. |
f146bf9 to
5e5db44
Compare
Member
Author
|
Can I get a review on this please. |
devreal
reviewed
Mar 5, 2024
Contributor
devreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, had a pending review that I never submitted...
bosilca
commented
Mar 5, 2024
934289b to
c50329f
Compare
devreal
previously approved these changes
Mar 6, 2024
Member
|
@jsquyres @ggouaillardet please review when you have a chance |
jsquyres
reviewed
Jan 9, 2026
289b46f to
76525e9
Compare
If nodes have the same IP addresses (for containers or other purposes) and these addresses get published as part of the modex, a remote peer might try to use one of the addresses to connect. As both nodes have the same IP, there are several cases: - the "remote" port is not used by an OMPI process locally, the connection is refused or it timeouts. This is the "nicest" outcome, as a new IP will be used resulting in a successful connection and the continuation of the application. - the "remote" port is used by another OMPI process on the local node. A connection will be established but the incorrect guid will be exchanged leading to complaints, connection dropped and/or deadlocks. - the "remote" port is used by this process, basically resulting in a connection-to-self. Bad things happen, as we don't support TCP connections to self. Some output messages are generated, but the outcome is most likely a deadlock. Up to now, users were expected to exclude such interfaces from the accepted interfaces, but this patch removes this need. If we discover a local IP as part of the IP list of a remote peer, we drop it and never try to use it. This does not apply to local processes, so we can still use these interfaces for node level communications (which will work as we will connect to the correct port according to the destination process). Signed-off-by: George Bosilca <[email protected]> Signed-off-by: George Bosilca <[email protected]>
76525e9 to
f2f671c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If nodes have the same IP addresses (for containers or other purposes) and these addresses get published as part of the modex, a remote peer might try to use one of the addresses to connect. As both nodes have the same IP, there are several cases:
Up to now, users were expected to exclude such interfaces from the accepted interfaces, but this patch removes this need. If we discover a local IP as part of the IP list of a remote peer, we drop it and never try to use it. This does not apply to local processes, so we can still use these interfaces for node level communications (which will work as we will connect to the correct port according to the destination process).