Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/sglang/srt/model_executor/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,7 @@ def init_torch_distributed(self):
if self.server_args.dist_init_addr:
dist_init_method = f"tcp://{self.server_args.dist_init_addr}"
else:
dist_init_method = f"tcp://127.0.0.1:{self.dist_port}"
dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly replaces the hardcoded 127.0.0.1 with self.server_args.host. However, if self.server_args.host is an IPv6 address, it needs to be enclosed in square brackets (e.g., [::1]) to form a valid TCP URL for torch.distributed. Without the brackets, initialization will fail for IPv6 hosts. It's recommended to handle this case.

Suggested change
dist_init_method = f"tcp://{self.server_args.host}:{self.dist_port}"
host = self.server_args.host
if ":" in host and not host.startswith("["):
# Wrap IPv6 addresses in brackets for URL formatting.
host = f"[{host}]"
dist_init_method = f"tcp://{host}:{self.dist_port}"

set_custom_all_reduce(not self.server_args.disable_custom_all_reduce)
set_mscclpp_all_reduce(self.server_args.enable_mscclpp)

Expand Down
7 changes: 6 additions & 1 deletion python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -2788,9 +2788,14 @@ def init_new(server_args, dp_rank: Optional[int] = None) -> "PortArgs":
if nccl_port < 60000:
nccl_port += 42
else:
nccl_port -= 43
nccl_port = server_args.port + random.randint(100, 1000)
Comment on lines 2788 to +2791
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for finding an available port when one is not provided is a bit complex. Using nccl_port += 42 is arbitrary, and re-randomizing the port if it's above 60000 can lead to new collisions between processes. A simpler and more robust approach would be to use a linear probe (nccl_port += 1), which is consistent with the logic used when nccl_port is provided by the user. This ensures deterministic port finding after the initial random selection.

                nccl_port += 1

else:
nccl_port = server_args.nccl_port
# Check if the port is available
while True:
if is_port_available(nccl_port):
break
nccl_port += 1

if not server_args.enable_dp_attention:
# Normal case, use IPC within a single node
Expand Down