Skip to content

Conversation

@xk-huang
Copy link

@xk-huang xk-huang commented Sep 20, 2025

Motivation

This PR fixes occasional NCCL rendezvous deadlocks and repeated nccl_port collisions when launching multi-GPU runs with --dp > 1 (The bug is introduced in #7418). The previous logic tweaked the nccl_port by adding/subtracting small constants, which could still land on a busy/ephemeral port or cause different processes to race for the same value.

Modifications

  • Avoid hardcoded ipv4 localhost: as some machine only has ipv6, directly initializing torch.dist on ipv4 localhost would cause a stuck.
  • In python/sglang/srt/server_args.py, during PortArgs initialization:
    • Remove the previous ±(42|43) adjustments that could still collide or select a closed port which are all not available.
    • nccl_port should also be increased in when --dp > 1. Otherwise the port collision happens. Add a simple availability probe loop using is_port_available(nccl_port) and increment until a free port is found.

This keeps the selection deterministic relative to the server’s base port while adding jitter and a concrete availability check to avoid deadlocks. (Change diff is in the commit view.)

Accuracy Tests

No changes to model kernels, scheduling, or outputs—only to rendezvous port selection. To be safe, I validated on the following setups:

  • Single node / 8 GPUs (NCCL): multiple back-to-back launches with varying --port showed no deadlocks; all runs reached dist.init_process_group() and completed a small all_reduce sanity test.
  • Explicit --nccl-port: when provided, the code respects the user value (no changes), and initialization succeeds if the port is free; fails fast if it’s actually bound by another process (as expected when --dp > 1).

Observed identical token-level outputs across runs compared to main (no accuracy deltas).

Benchmarking and Profiling

Negligible overhead.

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @xk-huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues related to port management in multi-GPU distributed training environments, specifically focusing on NCCL rendezvous deadlocks and port collisions. It refines the port selection mechanism to ensure reliable communication initialization by dynamically finding available ports and utilizing the configured host address instead of a hardcoded localhost, thereby enhancing the stability and flexibility of distributed operations.

Highlights

  • Port Collision Fix: Resolved occasional NCCL rendezvous deadlocks and nccl_port collisions by improving the port selection logic, particularly when --dp > 1.
  • Dynamic Port Selection: Replaced the problematic ±(42|43) port adjustments with a robust mechanism that probes for available ports and increments nccl_port until a free one is found.
  • Host Address Configuration: Changed the hardcoded 127.0.0.1 for dist_init_method to use self.server_args.host, allowing for more flexible distributed setups beyond localhost.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fixes for NCCL port finding to avoid deadlocks and collisions in multi-GPU setups, and also replaces a hardcoded IP address to support different hosts. The changes are generally good and address the described issues. I've identified a potential bug in the IP address handling for IPv6 and suggested a fix. I also proposed a simplification to the port finding logic to make it more robust and consistent.

dist_init_method = f"tcp://{self.server_args.dist_init_addr}"
else:
dist_init_method = f"tcp://127.0.0.1:{self.dist_port}"
dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly replaces the hardcoded 127.0.0.1 with self.server_args.host. However, if self.server_args.host is an IPv6 address, it needs to be enclosed in square brackets (e.g., [::1]) to form a valid TCP URL for torch.distributed. Without the brackets, initialization will fail for IPv6 hosts. It's recommended to handle this case.

Suggested change
dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"
host = self.server_args.host
if ":" in host and not host.startswith("["):
# Wrap IPv6 addresses in brackets for URL formatting.
host = f"[{host}]"
dist_init_method = f"tcp://{host}:{self.dist_port}"

Comment on lines 2788 to +2791
if nccl_port < 60000:
nccl_port += 42
else:
nccl_port -= 43
nccl_port = server_args.port + random.randint(100, 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for finding an available port when one is not provided is a bit complex. Using nccl_port += 42 is arbitrary, and re-randomizing the port if it's above 60000 can lead to new collisions between processes. A simpler and more robust approach would be to use a linear probe (nccl_port += 1), which is consistent with the logic used when nccl_port is provided by the user. This ensures deterministic port finding after the initial random selection.

                nccl_port += 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant