[BugFix] port finding and avoid hardcoded ipv4 host #10693

xk-huang · 2025-09-20T17:09:21Z

Motivation

This PR fixes occasional NCCL rendezvous deadlocks and repeated nccl_port collisions when launching multi-GPU runs with --dp > 1 (The bug is introduced in #7418). The previous logic tweaked the nccl_port by adding/subtracting small constants, which could still land on a busy/ephemeral port or cause different processes to race for the same value.

Modifications

Avoid hardcoded ipv4 localhost: as some machine only has ipv6, directly initializing torch.dist on ipv4 localhost would cause a stuck.
In python/sglang/srt/server_args.py, during PortArgs initialization:
- Remove the previous ±(42|43) adjustments that could still collide or select a closed port which are all not available.
- nccl_port should also be increased in when --dp > 1. Otherwise the port collision happens. Add a simple availability probe loop using is_port_available(nccl_port) and increment until a free port is found.

This keeps the selection deterministic relative to the server’s base port while adding jitter and a concrete availability check to avoid deadlocks. (Change diff is in the commit view.)

Accuracy Tests

No changes to model kernels, scheduling, or outputs—only to rendezvous port selection. To be safe, I validated on the following setups:

Single node / 8 GPUs (NCCL): multiple back-to-back launches with varying --port showed no deadlocks; all runs reached dist.init_process_group() and completed a small all_reduce sanity test.
Explicit --nccl-port: when provided, the code respects the user value (no changes), and initialization succeeds if the port is free; fails fast if it’s actually bound by another process (as expected when --dp > 1).

Observed identical token-level outputs across runs compared to main (no accuracy deltas).

Benchmarking and Profiling

Negligible overhead.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-20T17:09:32Z

Summary of Changes

Hello @xk-huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues related to port management in multi-GPU distributed training environments, specifically focusing on NCCL rendezvous deadlocks and port collisions. It refines the port selection mechanism to ensure reliable communication initialization by dynamically finding available ports and utilizing the configured host address instead of a hardcoded localhost, thereby enhancing the stability and flexibility of distributed operations.

Highlights

Port Collision Fix: Resolved occasional NCCL rendezvous deadlocks and nccl_port collisions by improving the port selection logic, particularly when --dp > 1.
Dynamic Port Selection: Replaced the problematic ±(42|43) port adjustments with a robust mechanism that probes for available ports and increments nccl_port until a free one is found.
Host Address Configuration: Changed the hardcoded 127.0.0.1 for dist_init_method to use self.server_args.host, allowing for more flexible distributed setups beyond localhost.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces fixes for NCCL port finding to avoid deadlocks and collisions in multi-GPU setups, and also replaces a hardcoded IP address to support different hosts. The changes are generally good and address the described issues. I've identified a potential bug in the IP address handling for IPv6 and suggested a fix. I also proposed a simplification to the port finding logic to make it more robust and consistent.

gemini-code-assist · 2025-09-20T17:17:35Z

python/sglang/srt/model_executor/model_runner.py

            dist_init_method = f"tcp://{self.server_args.dist_init_addr}"
        else:
-            dist_init_method = f"tcp://127.0.0.1:{self.dist_port}"
+            dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"


This change correctly replaces the hardcoded 127.0.0.1 with self.server_args.host. However, if self.server_args.host is an IPv6 address, it needs to be enclosed in square brackets (e.g., [::1]) to form a valid TCP URL for torch.distributed. Without the brackets, initialization will fail for IPv6 hosts. It's recommended to handle this case.

Suggested change

dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"

host = self.server_args.host

if ":" in host and not host.startswith("["):

# Wrap IPv6 addresses in brackets for URL formatting.

host = f"[{host}]"

dist_init_method = f"tcp://{host}:{self.dist_port}"

gemini-code-assist · 2025-09-20T17:17:36Z

python/sglang/srt/server_args.py

                if nccl_port < 60000:
                    nccl_port += 42
                else:
-                    nccl_port -= 43
+                    nccl_port = server_args.port + random.randint(100, 1000)


The logic for finding an available port when one is not provided is a bit complex. Using nccl_port += 42 is arbitrary, and re-randomizing the port if it's above 60000 can lead to new collisions between processes. A simpler and more robust approach would be to use a linear probe (nccl_port += 1), which is consistent with the logic used when nccl_port is provided by the user. This ensures deterministic port finding after the initial random selection.

nccl_port += 1

xk-huang added 2 commits September 18, 2025 04:38

[BugFix] find port deadlock and nccl_port repeat

379f824

[BugFix] avoid hard code ipv4 host for dist init

7333ce5

xk-huang requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners September 20, 2025 17:09

gemini-code-assist bot reviewed Sep 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] port finding and avoid hardcoded ipv4 host #10693

[BugFix] port finding and avoid hardcoded ipv4 host #10693

xk-huang commented Sep 20, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 20, 2025

Uh oh!

gemini-code-assist bot Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-            dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"
+            host = self.server_args.host
+            if ":" in host and not host.startswith("["):
+                # Wrap IPv6 addresses in brackets for URL formatting.
+                host = f"[{host}]"
+            dist_init_method = f"tcp://{host}:{self.dist_port}"

[BugFix] port finding and avoid hardcoded ipv4 host #10693

Are you sure you want to change the base?

[BugFix] port finding and avoid hardcoded ipv4 host #10693

Conversation

xk-huang commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xk-huang commented Sep 20, 2025 •

edited

Loading