Skip to content

Conversation

@alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented Jul 29, 2025

Overview:

A number of users have been noting port conflict errors. This is being caused by a race condition as we reserve ports but then hand them off to vLLM to bind. It tried to fix this by adding a reservation system to ETCD, however there are many ephermeral ports that vLLM uses that we aren't able to capture and reserve unless we directly modify vllm.

Another problem is under the hood, vLLM was allocating more ports for NIXL than we realized. Now we mimic this logic on the frontend to make sure we can register and reserve the correct ports.

Here, we select a range of ports to allocate as a DynamoPortRange. Inside this range we make the assumption that ports are

  1. Either reserved and stay reserved before we start Dynamo through startup (ie long running server / service)
  2. Or reserved during dynamo startup and added into etcd to prevent race conditions

It is the users responsibility to adapt this range so both 1 and 2 remain true. If something is allocating ports dynamically within this range during startup, race conditions can occur.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 29, 2025

Walkthrough

This change introduces a new ports.py module for robust port allocation and management in Dynamo vLLM backends, refactors existing logic in args.py to use these abstractions, and adds comprehensive unit and integration tests for port utilities. Supporting documentation and configuration files for testing are also included, and .gitignore is updated for Pytest coverage artifacts.

Changes

Cohort / File(s) Change Summary
Port Allocation Utilities
components/backends/vllm/src/dynamo/vllm/ports.py
New module providing port range validation, ETCD-based reservation, exclusive socket binding, and metadata tracking for Dynamo service ports. Defines data classes, context managers, and async functions for safe, concurrent port allocation and reservation.
Port Logic Refactor
components/backends/vllm/src/dynamo/vllm/args.py
Refactored to remove internal port allocation logic, now importing and using abstractions from the new ports.py module. Adds explicit port range configuration via CLI, updates environment variable setup, and improves side channel port block allocation and validation.
Port Utilities Testing
components/backends/vllm/src/dynamo/tests/test_ports.py
New test module with comprehensive unit and integration tests for port allocation, ETCD reservation, port binding, metadata serialization, and utility functions. Includes both synchronous and asynchronous tests, using mocks and fixtures, and covers error and edge cases.
Test Suite Documentation
components/backends/vllm/src/dynamo/tests/README.md
New README providing instructions for running tests, installing dependencies, and using pytest features for the Dynamo vLLM backend test suite.
Test Suite Initialization
components/backends/vllm/src/dynamo/tests/__init__.py
New file with copyright/license header and module-level docstring for the Dynamo vLLM backend test suite.
Pytest Configuration
components/backends/vllm/src/dynamo/tests/pytest.ini
New pytest configuration file specifying test discovery patterns, marker usage, path setup, verbosity, and asyncio mode for the test suite.
Pytest Coverage Ignore
.gitignore
Appends .coverage to ignore Pytest coverage files and adds a newline for clarity.

Sequence Diagram(s)

sequenceDiagram
    participant CLI/User
    participant ArgsModule as args.py
    participant PortsModule as ports.py
    participant EtcdClient
    participant OS

    CLI/User->>ArgsModule: parse_args() with --dynamo-port-min/max
    ArgsModule->>PortsModule: Import DynamoPortRange, EtcdContext, etc.
    ArgsModule->>PortsModule: Create EtcdContext
    ArgsModule->>PortsModule: allocate_and_reserve_port_block(request)
    PortsModule->>OS: hold_ports(ports)
    PortsModule->>EtcdClient: reserve_port_in_etcd(context, port, metadata)
    PortsModule->>OS: check_port_available(port)
    PortsModule->>ArgsModule: Return allocated ports
    ArgsModule->>PortsModule: get_host_ip()
    ArgsModule->>OS: Set environment variables
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In burrows deep, where sockets sleep,
A rabbit hops, reserving ports to keep.
With ETCD keys and binding tight,
Tests now ensure each port is right.
Refactored code, new docs in tow—
This backend’s ready, let the traffic flow!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
.gitignore (1)

87-93: Remove duplicate .build/ entry

The .build/ pattern is already present on line 87. The duplicate entry on line 92 should be removed.

Apply this diff to remove the duplicate:

 .build/
 **/.devcontainer/.env
 TensorRT-LLM
-
-# Local build artifacts for devcontainer
-.build/
🧹 Nitpick comments (3)
components/backends/vllm/src/dynamo/tests/test_ports.py (2)

66-81: Consider adding test coverage for basic metadata conversion.

While the test covers the block_info scenario well, consider adding a test case for metadata without block_info and verifying all fields in the output (worker_id, reason, reserved_at, pid).

Add this test method to the class:

def test_to_etcd_value_without_block_info(self):
    """Test converting metadata to ETCD value without block info."""
    metadata = PortMetadata(
        worker_id="test-worker",
        reason="test-reason",
    )
    
    with patch("time.time", return_value=1234567890):
        with patch("os.getpid", return_value=12345):
            value = metadata.to_etcd_value()
    
    assert value["worker_id"] == "test-worker"
    assert value["reason"] == "test-reason"
    assert value["reserved_at"] == 1234567890
    assert value["pid"] == 12345
    assert "block_index" not in value
    assert "block_size" not in value
    assert "block_start" not in value

232-259: Consider adding failure scenario tests for single port allocation.

While the success case is covered, consider adding tests for:

  • When all ports in the range are occupied
  • When ETCD reservation fails
  • When max_attempts is exceeded
components/backends/vllm/src/dynamo/vllm/args.py (1)

165-202: Well-implemented NIXL port block allocation with proper validation.

The logic correctly allocates a block of ports for tensor parallel workers and calculates the base port. The validation for negative base port is a valuable addition.

Consider including the actual port range in the error message for better debugging:

-            f"base_port={base_side_channel_port}. Consider using a higher port range."
+            f"base_port={base_side_channel_port}. Current range: {config.port_range.min}-{config.port_range.max}. "
+            f"Consider using a higher port range."
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8cb655 and a864e91.

📒 Files selected for processing (7)
  • .gitignore (1 hunks)
  • components/backends/vllm/src/dynamo/tests/README.md (1 hunks)
  • components/backends/vllm/src/dynamo/tests/__init__.py (1 hunks)
  • components/backends/vllm/src/dynamo/tests/pytest.ini (1 hunks)
  • components/backends/vllm/src/dynamo/tests/test_ports.py (1 hunks)
  • components/backends/vllm/src/dynamo/vllm/args.py (5 hunks)
  • components/backends/vllm/src/dynamo/vllm/ports.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
components/backends/vllm/src/dynamo/tests/README.md (1)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

components/backends/vllm/src/dynamo/tests/__init__.py (1)

Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.

🧬 Code Graph Analysis (1)
components/backends/vllm/src/dynamo/tests/test_ports.py (1)
components/backends/vllm/src/dynamo/vllm/ports.py (13)
  • DynamoPortRange (44-58)
  • EtcdContext (62-71)
  • PortAllocationRequest (96-103)
  • PortBinding (106-138)
  • PortMetadata (75-92)
  • allocate_and_reserve_port (277-308)
  • allocate_and_reserve_port_block (186-274)
  • check_port_available (159-167)
  • get_host_ip (311-338)
  • hold_ports (142-156)
  • reserve_port_in_etcd (170-183)
  • make_port_key (68-71)
  • to_etcd_value (82-92)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (22)
.gitignore (1)

94-96: LGTM!

The addition of .coverage to gitignore is appropriate for excluding pytest coverage data files.

components/backends/vllm/src/dynamo/vllm/ports.py (4)

51-54: Consider using inclusive boundary check for clarity

The current validation allows max=49151 which is correct, but the condition self.max > 49151 could be more clearly expressed as self.max > 49151 (which is what you have). This is fine as-is.


106-139: LGTM!

The PortBinding class correctly implements the context manager protocol with proper resource cleanup in both success and error cases.


186-275: LGTM!

The allocate_and_reserve_port_block function implements a robust allocation strategy with proper race condition prevention by holding socket bindings during ETCD reservation. The retry logic and error handling are well-designed.


311-339: LGTM!

The get_host_ip function implements a robust IP resolution strategy with proper fallbacks and bindability testing. The error handling and logging are comprehensive.

components/backends/vllm/src/dynamo/tests/__init__.py (1)

1-5: LGTM!

Standard package initialization file with appropriate copyright header and docstring.

components/backends/vllm/src/dynamo/tests/README.md (1)

19-21: Coverage module path is correct—no adjustment needed

The dynamo.vllm.ports module exists at components/backends/vllm/src/dynamo/vllm/ports.py and is properly packaged (there’s an __init__.py in the vllm directory). The --cov=dynamo.vllm.ports flag accurately targets this module and requires no change.

Likely an incorrect or invalid review comment.

components/backends/vllm/src/dynamo/tests/pytest.ini (1)

1-25: LGTM!

The pytest configuration is well-structured with appropriate settings for async tests, import paths, and output formatting.

components/backends/vllm/src/dynamo/tests/test_ports.py (9)

1-26: Well-structured test module with comprehensive imports.

The imports are properly organized and include all necessary testing utilities and components from the ports module.


27-51: Comprehensive validation tests for DynamoPortRange.

The test class thoroughly covers all validation scenarios including valid ranges, out-of-bounds ranges, and invalid min/max relationships.


53-64: Good test coverage for ETCD key generation.

The test properly mocks the hostname and validates the expected key format.


83-152: Excellent test coverage for PortBinding context manager.

The tests thoroughly validate single and multiple port binding scenarios, proper cleanup on exit, and error handling for partial binding failures. Good use of actual socket operations to ensure real-world behavior.


154-178: Good coverage for hold_ports wrapper function.

Tests properly verify both single and multiple port scenarios and the correct return type.


180-200: Complete test coverage for port availability checks.

Both positive and negative scenarios are properly tested.


202-230: Well-structured async test for ETCD port reservation.

Properly uses AsyncMock for async methods and regular Mock for synchronous methods. Good validation of the serialized JSON value.


260-313: Good test coverage for port block allocation.

The tests cover successful allocation and validation for insufficient port range. The assertions properly verify contiguous port allocation.


331-333: Standard pytest execution block.

components/backends/vllm/src/dynamo/vllm/args.py (5)

15-26: Clean import refactoring to use the new ports module.

All imported components from the ports module are properly utilized in the refactored code.


44-44: Good addition of typed port_range attribute.

Using the DynamoPortRange type provides proper validation and encapsulation.


75-86: Well-documented CLI arguments for port range configuration.

The arguments have clear help text and use appropriate defaults from the ports module.


133-136: Proper instantiation of DynamoPortRange with validation.

The DynamoPortRange class will validate the port range constraints automatically.


149-164: Clean refactoring for KV port allocation.

Good use of the new port allocation abstractions with proper metadata and conditional allocation based on prefix caching.

@alec-flowers
Copy link
Contributor Author

I've tested this manually for both tp and dp for the NIXL ports as well as the Kv Events ports and it looks good to me.

Copy link
Contributor

@ptarasiewiczNV ptarasiewiczNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants