fix: add better port logic #2175

alec-flowers · 2025-07-29T20:41:23Z

Overview:

A number of users have been noting port conflict errors. This is being caused by a race condition as we reserve ports but then hand them off to vLLM to bind. It tried to fix this by adding a reservation system to ETCD, however there are many ephermeral ports that vLLM uses that we aren't able to capture and reserve unless we directly modify vllm.

Another problem is under the hood, vLLM was allocating more ports for NIXL than we realized. Now we mimic this logic on the frontend to make sure we can register and reserve the correct ports.

Here, we select a range of ports to allocate as a DynamoPortRange. Inside this range we make the assumption that ports are

Either reserved and stay reserved before we start Dynamo through startup (ie long running server / service)
Or reserved during dynamo startup and added into etcd to prevent race conditions

It is the users responsibility to adapt this range so both 1 and 2 remain true. If something is allocating ports dynamically within this range during startup, race conditions can occur.

coderabbitai · 2025-07-29T20:45:52Z

Walkthrough

This change introduces a new ports.py module for robust port allocation and management in Dynamo vLLM backends, refactors existing logic in args.py to use these abstractions, and adds comprehensive unit and integration tests for port utilities. Supporting documentation and configuration files for testing are also included, and .gitignore is updated for Pytest coverage artifacts.

Changes

Cohort / File(s)	Change Summary
Port Allocation Utilities `components/backends/vllm/src/dynamo/vllm/ports.py`	New module providing port range validation, ETCD-based reservation, exclusive socket binding, and metadata tracking for Dynamo service ports. Defines data classes, context managers, and async functions for safe, concurrent port allocation and reservation.
Port Logic Refactor `components/backends/vllm/src/dynamo/vllm/args.py`	Refactored to remove internal port allocation logic, now importing and using abstractions from the new `ports.py` module. Adds explicit port range configuration via CLI, updates environment variable setup, and improves side channel port block allocation and validation.
Port Utilities Testing `components/backends/vllm/src/dynamo/tests/test_ports.py`	New test module with comprehensive unit and integration tests for port allocation, ETCD reservation, port binding, metadata serialization, and utility functions. Includes both synchronous and asynchronous tests, using mocks and fixtures, and covers error and edge cases.
Test Suite Documentation `components/backends/vllm/src/dynamo/tests/README.md`	New README providing instructions for running tests, installing dependencies, and using pytest features for the Dynamo vLLM backend test suite.
Test Suite Initialization `components/backends/vllm/src/dynamo/tests/__init__.py`	New file with copyright/license header and module-level docstring for the Dynamo vLLM backend test suite.
Pytest Configuration `components/backends/vllm/src/dynamo/tests/pytest.ini`	New pytest configuration file specifying test discovery patterns, marker usage, path setup, verbosity, and asyncio mode for the test suite.
Pytest Coverage Ignore `.gitignore`	Appends `.coverage` to ignore Pytest coverage files and adds a newline for clarity.

Sequence Diagram(s)

sequenceDiagram
    participant CLI/User
    participant ArgsModule as args.py
    participant PortsModule as ports.py
    participant EtcdClient
    participant OS

    CLI/User->>ArgsModule: parse_args() with --dynamo-port-min/max
    ArgsModule->>PortsModule: Import DynamoPortRange, EtcdContext, etc.
    ArgsModule->>PortsModule: Create EtcdContext
    ArgsModule->>PortsModule: allocate_and_reserve_port_block(request)
    PortsModule->>OS: hold_ports(ports)
    PortsModule->>EtcdClient: reserve_port_in_etcd(context, port, metadata)
    PortsModule->>OS: check_port_available(port)
    PortsModule->>ArgsModule: Return allocated ports
    ArgsModule->>PortsModule: get_host_ip()
    ArgsModule->>OS: Set environment variables

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In burrows deep, where sockets sleep,
A rabbit hops, reserving ports to keep.
With ETCD keys and binding tight,
Tests now ensure each port is right.
Refactored code, new docs in tow—
This backend’s ready, let the traffic flow!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (1)

.gitignore (1)
87-93: Remove duplicate .build/ entry

The .build/ pattern is already present on line 87. The duplicate entry on line 92 should be removed.

Apply this diff to remove the duplicate:
 .build/
 **/.devcontainer/.env
 TensorRT-LLM
-
-# Local build artifacts for devcontainer
-.build/

🧹 Nitpick comments (3)

components/backends/vllm/src/dynamo/tests/test_ports.py (2)
66-81: Consider adding test coverage for basic metadata conversion.

While the test covers the block_info scenario well, consider adding a test case for metadata without block_info and verifying all fields in the output (worker_id, reason, reserved_at, pid).

Add this test method to the class:
def test_to_etcd_value_without_block_info(self):
    """Test converting metadata to ETCD value without block info."""
    metadata = PortMetadata(
        worker_id="test-worker",
        reason="test-reason",
    )
    
    with patch("time.time", return_value=1234567890):
        with patch("os.getpid", return_value=12345):
            value = metadata.to_etcd_value()
    
    assert value["worker_id"] == "test-worker"
    assert value["reason"] == "test-reason"
    assert value["reserved_at"] == 1234567890
    assert value["pid"] == 12345
    assert "block_index" not in value
    assert "block_size" not in value
    assert "block_start" not in value
232-259: Consider adding failure scenario tests for single port allocation.

While the success case is covered, consider adding tests for:

When all ports in the range are occupied

When ETCD reservation fails

When max_attempts is exceeded
components/backends/vllm/src/dynamo/vllm/args.py (1)
165-202: Well-implemented NIXL port block allocation with proper validation.

The logic correctly allocates a block of ports for tensor parallel workers and calculates the base port. The validation for negative base port is a valuable addition.

Consider including the actual port range in the error message for better debugging:
-            f"base_port={base_side_channel_port}. Consider using a higher port range."
+            f"base_port={base_side_channel_port}. Current range: {config.port_range.min}-{config.port_range.max}. "
+            f"Consider using a higher port range."

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8cb655 and a864e91.

📒 Files selected for processing (7)

.gitignore (1 hunks)
components/backends/vllm/src/dynamo/tests/README.md (1 hunks)
components/backends/vllm/src/dynamo/tests/__init__.py (1 hunks)
components/backends/vllm/src/dynamo/tests/pytest.ini (1 hunks)
components/backends/vllm/src/dynamo/tests/test_ports.py (1 hunks)
components/backends/vllm/src/dynamo/vllm/args.py (5 hunks)
components/backends/vllm/src/dynamo/vllm/ports.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

components/backends/vllm/src/dynamo/tests/README.md (1)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

components/backends/vllm/src/dynamo/tests/__init__.py (1)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

🧬 Code Graph Analysis (1)

components/backends/vllm/src/dynamo/tests/test_ports.py (1)

components/backends/vllm/src/dynamo/vllm/ports.py (13)

DynamoPortRange (44-58)

EtcdContext (62-71)

PortAllocationRequest (96-103)

PortBinding (106-138)

PortMetadata (75-92)

allocate_and_reserve_port (277-308)

allocate_and_reserve_port_block (186-274)

check_port_available (159-167)

get_host_ip (311-338)

hold_ports (142-156)

reserve_port_in_etcd (170-183)

make_port_key (68-71)

to_etcd_value (82-92)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - vllm

🔇 Additional comments (22)

.gitignore (1)

94-96: LGTM!

The addition of .coverage to gitignore is appropriate for excluding pytest coverage data files.

components/backends/vllm/src/dynamo/vllm/ports.py (4)

51-54: Consider using inclusive boundary check for clarity

The current validation allows max=49151 which is correct, but the condition self.max > 49151 could be more clearly expressed as self.max > 49151 (which is what you have). This is fine as-is.

106-139: LGTM!

The PortBinding class correctly implements the context manager protocol with proper resource cleanup in both success and error cases.

186-275: LGTM!

The allocate_and_reserve_port_block function implements a robust allocation strategy with proper race condition prevention by holding socket bindings during ETCD reservation. The retry logic and error handling are well-designed.

311-339: LGTM!

The get_host_ip function implements a robust IP resolution strategy with proper fallbacks and bindability testing. The error handling and logging are comprehensive.

components/backends/vllm/src/dynamo/tests/__init__.py (1)

1-5: LGTM!

Standard package initialization file with appropriate copyright header and docstring.

components/backends/vllm/src/dynamo/tests/README.md (1)

19-21: Coverage module path is correct—no adjustment needed

The dynamo.vllm.ports module exists at components/backends/vllm/src/dynamo/vllm/ports.py and is properly packaged (there’s an __init__.py in the vllm directory). The --cov=dynamo.vllm.ports flag accurately targets this module and requires no change.

Likely an incorrect or invalid review comment.

components/backends/vllm/src/dynamo/tests/pytest.ini (1)

1-25: LGTM!

The pytest configuration is well-structured with appropriate settings for async tests, import paths, and output formatting.

components/backends/vllm/src/dynamo/tests/test_ports.py (9)

1-26: Well-structured test module with comprehensive imports.

The imports are properly organized and include all necessary testing utilities and components from the ports module.

27-51: Comprehensive validation tests for DynamoPortRange.

The test class thoroughly covers all validation scenarios including valid ranges, out-of-bounds ranges, and invalid min/max relationships.

53-64: Good test coverage for ETCD key generation.

The test properly mocks the hostname and validates the expected key format.

83-152: Excellent test coverage for PortBinding context manager.

The tests thoroughly validate single and multiple port binding scenarios, proper cleanup on exit, and error handling for partial binding failures. Good use of actual socket operations to ensure real-world behavior.

154-178: Good coverage for hold_ports wrapper function.

Tests properly verify both single and multiple port scenarios and the correct return type.

180-200: Complete test coverage for port availability checks.

Both positive and negative scenarios are properly tested.

202-230: Well-structured async test for ETCD port reservation.

Properly uses AsyncMock for async methods and regular Mock for synchronous methods. Good validation of the serialized JSON value.

260-313: Good test coverage for port block allocation.

The tests cover successful allocation and validation for insufficient port range. The assertions properly verify contiguous port allocation.

331-333: Standard pytest execution block.

components/backends/vllm/src/dynamo/vllm/args.py (5)

15-26: Clean import refactoring to use the new ports module.

All imported components from the ports module are properly utilized in the refactored code.

44-44: Good addition of typed port_range attribute.

Using the DynamoPortRange type provides proper validation and encapsulation.

75-86: Well-documented CLI arguments for port range configuration.

The arguments have clear help text and use appropriate defaults from the ports module.

133-136: Proper instantiation of DynamoPortRange with validation.

The DynamoPortRange class will validate the port range constraints automatically.

149-164: Clean refactoring for KV port allocation.

Good use of the new port allocation abstractions with proper metadata and conditional allocation based on prefix caching.

components/backends/vllm/src/dynamo/tests/test_ports.py

components/backends/vllm/src/dynamo/vllm/ports.py

alec-flowers · 2025-07-30T06:58:22Z

I've tested this manually for both tp and dp for the NIXL ports as well as the Kv Events ports and it looks good to me.

components/backends/vllm/src/dynamo/vllm/args.py

ptarasiewiczNV

LGTM, pipeline to test: https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo-ci/-/pipelines/32526987

alec-flowers requested review from a team, GuanLuo, PeaBrane, biswapanda, grahamking, ishandhanani, jthomson04, kkranen, nnshah1, paulhendricks, piotrm-nvidia, ptarasiewiczNV, rmccorm4, ryanolson, tanmayv25, tedzhouhk and tmonty12 as code owners July 29, 2025 20:41

pull-request-size bot added the size/XL label Jul 29, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 20:41 Inactive

github-actions bot added the fix label Jul 29, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 20:42 Inactive

coderabbitai bot reviewed Jul 29, 2025

View reviewed changes

components/backends/vllm/src/dynamo/tests/test_ports.py Outdated Show resolved Hide resolved

components/backends/vllm/src/dynamo/vllm/ports.py Show resolved Hide resolved

components/backends/vllm/src/dynamo/vllm/ports.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 21:53 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 21:58 Inactive

alec-flowers force-pushed the aflowers/fix-port-race-2 branch from a7b70a0 to 28b51c6 Compare July 29, 2025 22:00

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 22:00 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 22:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 22:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 29, 2025 22:08 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 30, 2025 00:37 Inactive

add new port reservation behavior

d9c09e6

alec-flowers force-pushed the aflowers/fix-port-race-2 branch from c47a1fc to d9c09e6 Compare July 30, 2025 06:29

pull-request-size bot added size/L and removed size/XL labels Jul 30, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 30, 2025 06:29 Inactive

ptarasiewiczNV reviewed Jul 30, 2025

View reviewed changes

components/backends/vllm/src/dynamo/vllm/args.py Show resolved Hide resolved

ptarasiewiczNV approved these changes Jul 30, 2025

View reviewed changes

ishandhanani approved these changes Jul 30, 2025

View reviewed changes

bug fix

fc6d2a1

copy-pr-bot bot temporarily deployed to GITLAB July 30, 2025 15:56 Inactive

fix mypy

afa5a98

copy-pr-bot bot temporarily deployed to GITLAB July 30, 2025 16:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 30, 2025 16:19 Inactive

alec-flowers merged commit b69c507 into main Jul 30, 2025
13 checks passed

alec-flowers deleted the aflowers/fix-port-race-2 branch July 30, 2025 17:21

coderabbitai bot mentioned this pull request Jul 30, 2025

chore(vllm): port range, tests #2187

Closed

alec-flowers added a commit that referenced this pull request Jul 30, 2025

fix: add better port logic (#2175)

d5a204e

alec-flowers mentioned this pull request Jul 30, 2025

fix: add better port logic (#2175) #2192

Merged

dmitry-tokarev-nv pushed a commit that referenced this pull request Jul 30, 2025

fix: add better port logic (#2175) (#2192)

992adfb

coderabbitai bot mentioned this pull request Aug 22, 2025

fix: prevent crash looping hello world #2625 #2670

Closed

coderabbitai bot mentioned this pull request Sep 18, 2025

feat: Port vllm port allocator to Rust in bindings #3125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add better port logic #2175

fix: add better port logic #2175

Uh oh!

alec-flowers commented Jul 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jul 29, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented Jul 30, 2025

Uh oh!

Uh oh!

ptarasiewiczNV left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: add better port logic #2175

fix: add better port logic #2175

Uh oh!

Conversation

alec-flowers commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Uh oh!

coderabbitai bot commented Jul 29, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented Jul 30, 2025

Uh oh!

Uh oh!

ptarasiewiczNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alec-flowers commented Jul 29, 2025 •

edited

Loading