-
Notifications
You must be signed in to change notification settings - Fork 675
fix: add better port logic #2175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant CLI/User
participant ArgsModule as args.py
participant PortsModule as ports.py
participant EtcdClient
participant OS
CLI/User->>ArgsModule: parse_args() with --dynamo-port-min/max
ArgsModule->>PortsModule: Import DynamoPortRange, EtcdContext, etc.
ArgsModule->>PortsModule: Create EtcdContext
ArgsModule->>PortsModule: allocate_and_reserve_port_block(request)
PortsModule->>OS: hold_ports(ports)
PortsModule->>EtcdClient: reserve_port_in_etcd(context, port, metadata)
PortsModule->>OS: check_port_available(port)
PortsModule->>ArgsModule: Return allocated ports
ArgsModule->>PortsModule: get_host_ip()
ArgsModule->>OS: Set environment variables
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
.gitignore (1)
87-93: Remove duplicate.build/entryThe
.build/pattern is already present on line 87. The duplicate entry on line 92 should be removed.Apply this diff to remove the duplicate:
.build/ **/.devcontainer/.env TensorRT-LLM - -# Local build artifacts for devcontainer -.build/
🧹 Nitpick comments (3)
components/backends/vllm/src/dynamo/tests/test_ports.py (2)
66-81: Consider adding test coverage for basic metadata conversion.While the test covers the block_info scenario well, consider adding a test case for metadata without block_info and verifying all fields in the output (worker_id, reason, reserved_at, pid).
Add this test method to the class:
def test_to_etcd_value_without_block_info(self): """Test converting metadata to ETCD value without block info.""" metadata = PortMetadata( worker_id="test-worker", reason="test-reason", ) with patch("time.time", return_value=1234567890): with patch("os.getpid", return_value=12345): value = metadata.to_etcd_value() assert value["worker_id"] == "test-worker" assert value["reason"] == "test-reason" assert value["reserved_at"] == 1234567890 assert value["pid"] == 12345 assert "block_index" not in value assert "block_size" not in value assert "block_start" not in value
232-259: Consider adding failure scenario tests for single port allocation.While the success case is covered, consider adding tests for:
- When all ports in the range are occupied
- When ETCD reservation fails
- When max_attempts is exceeded
components/backends/vllm/src/dynamo/vllm/args.py (1)
165-202: Well-implemented NIXL port block allocation with proper validation.The logic correctly allocates a block of ports for tensor parallel workers and calculates the base port. The validation for negative base port is a valuable addition.
Consider including the actual port range in the error message for better debugging:
- f"base_port={base_side_channel_port}. Consider using a higher port range." + f"base_port={base_side_channel_port}. Current range: {config.port_range.min}-{config.port_range.max}. " + f"Consider using a higher port range."
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.gitignore(1 hunks)components/backends/vllm/src/dynamo/tests/README.md(1 hunks)components/backends/vllm/src/dynamo/tests/__init__.py(1 hunks)components/backends/vllm/src/dynamo/tests/pytest.ini(1 hunks)components/backends/vllm/src/dynamo/tests/test_ports.py(1 hunks)components/backends/vllm/src/dynamo/vllm/args.py(5 hunks)components/backends/vllm/src/dynamo/vllm/ports.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
components/backends/vllm/src/dynamo/tests/README.md (1)
Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.
components/backends/vllm/src/dynamo/tests/__init__.py (1)
Learnt from: biswapanda
PR: #1412
File: lib/bindings/python/src/dynamo/runtime/logging.py:100-100
Timestamp: 2025-06-06T21:48:35.214Z
Learning: In the Dynamo codebase, BentoML has been completely removed from all executable code, with only documentation and attribution references remaining. The error_loggers configuration in lib/bindings/python/src/dynamo/runtime/logging.py should not include "bentoml" since those modules no longer exist.
🧬 Code Graph Analysis (1)
components/backends/vllm/src/dynamo/tests/test_ports.py (1)
components/backends/vllm/src/dynamo/vllm/ports.py (13)
DynamoPortRange(44-58)EtcdContext(62-71)PortAllocationRequest(96-103)PortBinding(106-138)PortMetadata(75-92)allocate_and_reserve_port(277-308)allocate_and_reserve_port_block(186-274)check_port_available(159-167)get_host_ip(311-338)hold_ports(142-156)reserve_port_in_etcd(170-183)make_port_key(68-71)to_etcd_value(82-92)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (22)
.gitignore (1)
94-96: LGTM!The addition of
.coverageto gitignore is appropriate for excluding pytest coverage data files.components/backends/vllm/src/dynamo/vllm/ports.py (4)
51-54: Consider using inclusive boundary check for clarityThe current validation allows
max=49151which is correct, but the conditionself.max > 49151could be more clearly expressed asself.max > 49151(which is what you have). This is fine as-is.
106-139: LGTM!The
PortBindingclass correctly implements the context manager protocol with proper resource cleanup in both success and error cases.
186-275: LGTM!The
allocate_and_reserve_port_blockfunction implements a robust allocation strategy with proper race condition prevention by holding socket bindings during ETCD reservation. The retry logic and error handling are well-designed.
311-339: LGTM!The
get_host_ipfunction implements a robust IP resolution strategy with proper fallbacks and bindability testing. The error handling and logging are comprehensive.components/backends/vllm/src/dynamo/tests/__init__.py (1)
1-5: LGTM!Standard package initialization file with appropriate copyright header and docstring.
components/backends/vllm/src/dynamo/tests/README.md (1)
19-21: Coverage module path is correct—no adjustment neededThe
dynamo.vllm.portsmodule exists atcomponents/backends/vllm/src/dynamo/vllm/ports.pyand is properly packaged (there’s an__init__.pyin thevllmdirectory). The--cov=dynamo.vllm.portsflag accurately targets this module and requires no change.Likely an incorrect or invalid review comment.
components/backends/vllm/src/dynamo/tests/pytest.ini (1)
1-25: LGTM!The pytest configuration is well-structured with appropriate settings for async tests, import paths, and output formatting.
components/backends/vllm/src/dynamo/tests/test_ports.py (9)
1-26: Well-structured test module with comprehensive imports.The imports are properly organized and include all necessary testing utilities and components from the ports module.
27-51: Comprehensive validation tests for DynamoPortRange.The test class thoroughly covers all validation scenarios including valid ranges, out-of-bounds ranges, and invalid min/max relationships.
53-64: Good test coverage for ETCD key generation.The test properly mocks the hostname and validates the expected key format.
83-152: Excellent test coverage for PortBinding context manager.The tests thoroughly validate single and multiple port binding scenarios, proper cleanup on exit, and error handling for partial binding failures. Good use of actual socket operations to ensure real-world behavior.
154-178: Good coverage for hold_ports wrapper function.Tests properly verify both single and multiple port scenarios and the correct return type.
180-200: Complete test coverage for port availability checks.Both positive and negative scenarios are properly tested.
202-230: Well-structured async test for ETCD port reservation.Properly uses AsyncMock for async methods and regular Mock for synchronous methods. Good validation of the serialized JSON value.
260-313: Good test coverage for port block allocation.The tests cover successful allocation and validation for insufficient port range. The assertions properly verify contiguous port allocation.
331-333: Standard pytest execution block.components/backends/vllm/src/dynamo/vllm/args.py (5)
15-26: Clean import refactoring to use the new ports module.All imported components from the ports module are properly utilized in the refactored code.
44-44: Good addition of typed port_range attribute.Using the DynamoPortRange type provides proper validation and encapsulation.
75-86: Well-documented CLI arguments for port range configuration.The arguments have clear help text and use appropriate defaults from the ports module.
133-136: Proper instantiation of DynamoPortRange with validation.The DynamoPortRange class will validate the port range constraints automatically.
149-164: Clean refactoring for KV port allocation.Good use of the new port allocation abstractions with proper metadata and conditional allocation based on prefix caching.
a7b70a0 to
28b51c6
Compare
c47a1fc to
d9c09e6
Compare
|
I've tested this manually for both tp and dp for the NIXL ports as well as the Kv Events ports and it looks good to me. |
ptarasiewiczNV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pipeline to test: https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo-ci/-/pipelines/32526987
Overview:
A number of users have been noting port conflict errors. This is being caused by a race condition as we reserve ports but then hand them off to vLLM to bind. It tried to fix this by adding a reservation system to ETCD, however there are many ephermeral ports that vLLM uses that we aren't able to capture and reserve unless we directly modify vllm.
Another problem is under the hood, vLLM was allocating more ports for NIXL than we realized. Now we mimic this logic on the frontend to make sure we can register and reserve the correct ports.
Here, we select a range of ports to allocate as a DynamoPortRange. Inside this range we make the assumption that ports are
It is the users responsibility to adapt this range so both 1 and 2 remain true. If something is allocating ports dynamically within this range during startup, race conditions can occur.