-
Notifications
You must be signed in to change notification settings - Fork 674
feat: Port vllm port allocator to Rust in bindings #3125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughPorts allocation is migrated from an ETCD-based mechanism to a DistributedRuntime-based API across vLLM components. Python call sites now pass runtime and namespace. Rust/PyO3 bindings add DistributedRuntime.allocate_port_block with input validation and atomic reservation. Types and signatures in ports/args/main are updated accordingly; publisher comments are adjusted. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor W as Worker
participant A as vLLM args.py
participant P as vLLM ports.py
participant R as DistributedRuntime (bindings)
W->>A: await configure_ports(runtime, config)
A->>P: allocate_and_reserve_port(runtime, namespace, metadata, range)
P->>R: allocate_port_block(namespace, min, max, block_size, context)
Note right of R: Validates inputs<br/>randomizes candidates<br/>binds sockets<br/>atomic reservation
R-->>P: [ports]
P-->>A: port(s)
A-->>W: configured ports
Note over R,P: Errors: release sockets, cleanup, retry (bounded)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/backends/vllm/src/dynamo/vllm/publisher.py (1)
163-171: DEPRECATE factory setters: add deprecation note + type hints; keep behavior for one release.Add DEPRECATED comments and type annotations on the two setters so callers see the deprecation now while retaining behavior for one release; update/flush call sites listed below.
Call sites to update:
- components/backends/vllm/src/dynamo/vllm/publisher.py (setters definitions).
- components/backends/vllm/src/dynamo/vllm/main.py: lines ~205-206 (factory.set_num_gpu_blocks_all / set_request_total_slots_all).
- examples/multimodal/components/publisher.py: lines ~175-181 (publisher setters).
- examples/multimodal/components/worker.py: lines ~150-155 (calls to stats_logger.set_num_gpu_blocks_all / set_request_total_slots_all).
Suggested diff (apply to the publisher methods):
- # TODO Remove once we publish metadata to shared storage - def set_num_gpu_blocks_all(self, num_blocks): + # DEPRECATED: prefer shared metadata; slated for removal. + def set_num_gpu_blocks_all(self, num_blocks: int) -> None: if self.created_logger: self.created_logger.set_num_gpu_block(num_blocks) @@ - def set_request_total_slots_all(self, request_total_slots): + # DEPRECATED: prefer shared metadata; slated for removal. + def set_request_total_slots_all(self, request_total_slots: int) -> None: if self.created_logger: self.created_logger.set_num_request_total_slots(request_total_slots)
🧹 Nitpick comments (14)
components/backends/vllm/src/dynamo/vllm/publisher.py (5)
53-56: Replace ad‑hoc setter with typed, deprecated API and plan metadata move.These fields should come from shared storage/runtime metadata. Until then, add type hints and mark as deprecated to ease removal.
- # TODO: Remove this and pass as metadata through shared storage - def set_num_gpu_block(self, num_blocks): - self.num_gpu_block = num_blocks + # DEPRECATED: prefer metadata from shared storage; slated for removal. + def set_num_gpu_block(self, num_blocks: int) -> None: + self.num_gpu_block = int(num_blocks)
57-60: Same here: add typing and deprecate.Mirror the approach for total slots.
- # TODO: Remove this and pass as metadata through shared storage - def set_num_request_total_slots(self, request_total_slots): - self.request_total_slots = request_total_slots + # DEPRECATED: prefer metadata from shared storage; slated for removal. + def set_num_request_total_slots(self, request_total_slots: int) -> None: + self.request_total_slots = int(request_total_slots)
69-89: Don’t republish static metadata on every record; clamp/round KV blocks and hit‑rate.
- request_total_slots and kv_total_blocks are static per model/GPU; publish once (e.g., init) and omit from hot‑path updates.
- kv_active_blocks uses truncation; prefer clamped round to avoid under‑count and float drift.
- Bound hit_rate to [0, 1].
- # they should be part of some runtime metadata tied to MDC or put in shared storage ? + # they should be part of runtime metadata (MDC/shared storage), not republished each tick. @@ - hit_rate = 0 - if scheduler_stats.prefix_cache_stats.queries > 0: - hit_rate = ( - scheduler_stats.prefix_cache_stats.hits - / scheduler_stats.prefix_cache_stats.queries - ) + hit_rate = 0.0 + if scheduler_stats.prefix_cache_stats.queries > 0: + hit_rate = scheduler_stats.prefix_cache_stats.hits / scheduler_stats.prefix_cache_stats.queries + hit_rate = max(0.0, min(1.0, float(hit_rate))) @@ - kv_stats = KvStats( - kv_active_blocks=int(self.num_gpu_block * scheduler_stats.kv_cache_usage), + usage = max(0.0, min(1.0, float(scheduler_stats.kv_cache_usage))) + active_blocks = int(round(self.num_gpu_block * usage)) + active_blocks = max(0, min(self.num_gpu_block, active_blocks)) + kv_stats = KvStats( + kv_active_blocks=active_blocks, kv_total_blocks=self.num_gpu_block, - gpu_cache_usage_perc=scheduler_stats.kv_cache_usage, + gpu_cache_usage_perc=usage,
150-158: Auto‑initialize metrics endpoint once created.Call init_publish here so first snapshot has static metadata set and zeros elsewhere.
logger = DynamoStatLoggerPublisher( self.component, dp_rank, metrics_labels=self.metrics_labels ) self.created_logger = logger - - return logger + logger.init_publish() + return logger
61-66: iteration_stats and engine_idx are unused.If vLLM doesn’t require them, prefix with underscores to signal intent; otherwise, integrate into metrics.
- def record( - self, - scheduler_stats: SchedulerStats, - iteration_stats: Optional[IterationStats], - engine_idx: int = 0, - ): + def record( + self, + scheduler_stats: SchedulerStats, + _iteration_stats: Optional[IterationStats], + _engine_idx: int = 0, + ) -> None:components/backends/vllm/src/dynamo/vllm/main.py (1)
204-204: Consider improving the TODO comment.The comment could be more specific about what needs to be done.
- # TODO Hack to get data, move this to registering in shared storage somewhere + # TODO: Move GPU blocks and slot configuration to a centralized registry servicecomponents/backends/vllm/src/dynamo/vllm/args.py (1)
276-276: Improve error message clarity.The error message could be more helpful for users.
- "config.kv_port is not set; call configure_ports(...) before overwrite_args " + "config.kv_port is not set; ensure configure_ports(...) is called before overwrite_args "lib/bindings/python/rust/lib.rs (2)
556-566: Consider the user experience for EtcdKvCache instantiation.While redirecting users to the factory method is good, the error message could be clearer.
- Err(PyErr::new::<pyo3::exceptions::PyRuntimeError, _>( - "EtcdKvCache must be created using the 'new' class method", - )) + Err(PyErr::new::<pyo3::exceptions::PyRuntimeError, _>( + "EtcdKvCache cannot be instantiated directly. Use EtcdKvCache.create(...) instead", + ))
528-538: Consider IPv6 support in socket binding.Function forces IPv4 (socket2::Domain::IPV4). Repo already references IPv6 fallbacks (lib/bindings/python/rust/lib.rs:547–548, lib/runtime/src/pipeline/network/tcp/server.rs:130) and metrics accepts IPv4/IPv6 (components/metrics/src/lib.rs:194). Make the domain configurable, attempt IPv6 if IPv4 bind fails, or create an IPv6 dual‑stack socket (clear IPV6_V6ONLY) to cover both.
components/backends/vllm/src/dynamo/vllm/ports.py (5)
13-13: Use type-only import to avoid import-time dependency/cycles.Avoid importing
DistributedRuntimeat module import; gate it under TYPE_CHECKING and quote annotations in signatures.-from dynamo.runtime import DistributedRuntime +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from dynamo.runtime import DistributedRuntime-async def allocate_and_reserve_port_block( - runtime: DistributedRuntime, namespace: str, request: PortAllocationRequest -) -> list[int]: +async def allocate_and_reserve_port_block( + runtime: "DistributedRuntime", namespace: str, request: PortAllocationRequest +) -> list[int]:-async def allocate_and_reserve_port( - runtime: DistributedRuntime, - namespace: str, +async def allocate_and_reserve_port( + runtime: "DistributedRuntime", + namespace: str,Also applies to: 67-69, 100-101
71-79: Validate inputs early and fix docstring; remove stale comment.Add explicit guards for
block_sizevs range length; documentruntimeandnamespace; drop the obsolete comment.- """ - Allocate a contiguous block of ports from the specified range and atomically reserve them. - Returns a list of all allocated ports in order. - - Args: - request: PortAllocationRequest containing all allocation parameters - - Returns: - list[int]: List of all allocated ports in ascending order - """ - # Create a list of valid starting ports (must have room for the entire block) + """ + Allocate a contiguous block of ports from the specified range and atomically reserve them. + Returns a list of all allocated ports in order. + + Args: + runtime: Distributed runtime used to perform the atomic reservation. + namespace: Logical namespace for the reservation keys. + request: PortAllocationRequest containing all allocation parameters. + + Returns: + list[int]: List of all allocated ports in ascending order. + """ + # Validate request before crossing the FFI boundary + range_len = request.port_range.max - request.port_range.min + 1 + if request.block_size < 1: + raise ValueError("block_size must be >= 1") + if request.block_size > range_len: + raise ValueError( + f"block_size {request.block_size} exceeds range length {range_len} " + f"({request.port_range.min}-{request.port_range.max})" + )Also applies to: 80-81, 90-96
48-55: Add dataclass-level validation for block_size.Backstop invalid
block_sizeat construction time; catches bugs closer to the source.@dataclass class PortAllocationRequest: """Parameters for port allocation""" metadata: PortMetadata port_range: DynamoPortRange block_size: int = 1 + + def __post_init__(self): + if self.block_size < 1: + raise ValueError("block_size must be >= 1") + range_len = self.port_range.max - self.port_range.min + 1 + if self.block_size > range_len: + raise ValueError( + f"block_size {self.block_size} exceeds range length {range_len} " + f"({self.port_range.min}-{self.port_range.max})" + )
106-115: Docstring args incomplete; add defensive check for empty result.Document
runtimeandnamespace; guard against an unexpected empty return.- """ - Allocate a port from the specified range and atomically reserve it. - This is a convenience wrapper around allocate_and_reserve_port_block with block_size=1. - - Args: - metadata: Port metadata / context - port_range: DynamoPortRange object specifying min and max ports to try - - Returns: - int: The allocated port number - """ + """ + Allocate a port from the specified range and atomically reserve it. + Convenience wrapper around allocate_and_reserve_port_block with block_size=1. + + Args: + runtime: Distributed runtime used to perform the atomic reservation. + namespace: Logical namespace for the reservation keys. + metadata: Port metadata / context. + port_range: Port range to search (inclusive). + + Returns: + int: The allocated port number. + """ @@ - allocated_ports = await allocate_and_reserve_port_block(runtime, namespace, request) - return allocated_ports[0] # Return the single allocated port + allocated_ports = await allocate_and_reserve_port_block(runtime, namespace, request) + if not allocated_ports: + raise RuntimeError("allocate_port_block returned no ports") + return allocated_ports[0]Also applies to: 121-122
41-46: Update ETCD wording in metadata docstring.This module no longer exposes ETCD; make the comment runtime-agnostic.
- """Metadata to store with port reservations in ETCD""" + """Metadata attached to port reservations in the distributed runtime"""
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
lib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
components/backends/vllm/src/dynamo/vllm/args.py(5 hunks)components/backends/vllm/src/dynamo/vllm/main.py(3 hunks)components/backends/vllm/src/dynamo/vllm/ports.py(2 hunks)components/backends/vllm/src/dynamo/vllm/publisher.py(3 hunks)lib/bindings/python/Cargo.toml(1 hunks)lib/bindings/python/rust/lib.rs(3 hunks)lib/bindings/python/src/dynamo/_core.pyi(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-01T13:55:03.940Z
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
Applied to files:
components/backends/vllm/src/dynamo/vllm/main.py
🧬 Code graph analysis (5)
lib/bindings/python/src/dynamo/_core.pyi (1)
lib/bindings/python/rust/lib.rs (2)
allocate_port_block(383-509)namespace(373-378)
components/backends/vllm/src/dynamo/vllm/main.py (1)
components/backends/vllm/src/dynamo/vllm/args.py (4)
Config(39-64)configure_ports(198-258)overwrite_args(336-363)parse_args(67-195)
components/backends/vllm/src/dynamo/vllm/ports.py (2)
lib/bindings/python/src/dynamo/_core.pyi (5)
DistributedRuntime(31-62)namespace(38-42)block_size(625-629)block_size(648-652)allocate_port_block(51-56)lib/bindings/python/rust/lib.rs (2)
namespace(373-378)allocate_port_block(383-509)
lib/bindings/python/rust/lib.rs (4)
lib/runtime/src/pipeline/network/egress/push_router.rs (2)
rand(169-169)err(245-245)lib/bindings/python/src/dynamo/_core.pyi (5)
namespace(38-42)block_size(625-629)block_size(648-652)allocate_port_block(51-56)new(117-135)lib/runtime/src/distributed.rs (2)
namespace(216-218)etcd_client(269-271)lib/bindings/python/rust/llm/entrypoint.rs (1)
to_pyerr(285-290)
components/backends/vllm/src/dynamo/vllm/args.py (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
DistributedRuntime(31-62)namespace(38-42)components/backends/vllm/src/dynamo/vllm/ports.py (1)
allocate_and_reserve_port_block(67-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (15)
components/backends/vllm/src/dynamo/vllm/publisher.py (1)
101-108: Consider skipping publish until metadata is initialized.StatLoggerFactory.init_publish() is invoked in components/backends/vllm/src/dynamo/vllm/main.py (≈ lines 205–207) but only after AsyncLLM.from_vllm_config(...) creates engine_client — a DynamoStatLoggerPublisher may be constructed and emit metrics with default num_gpu_block/request_total_slots=1 before metadata is set. Move init_publish() to before engine_client creation or gate publishing inside DynamoStatLoggerPublisher until the setters run.
lib/bindings/python/Cargo.toml (1)
40-43: LGTM! Dependencies align with port allocation requirements.The added dependencies are appropriate for the port allocation functionality:
local-ip-addressfor detecting the local IPrandfor randomizing port selection to reduce contentionsocket2for low-level socket binding to verify port availabilitycomponents/backends/vllm/src/dynamo/vllm/main.py (2)
25-25: LGTM! Clean migration from ETCD-based to runtime-based port configuration.The import change properly reflects the new runtime-based approach.
66-66: LGTM! Simplified port configuration call.The change correctly passes the runtime object to the new
configure_portsfunction, removing the dependency on ETCD client.components/backends/vllm/src/dynamo/vllm/args.py (4)
15-15: LGTM! Correct import for runtime-based approach.The import of
DistributedRuntimealigns with the new port allocation strategy.
198-199: LGTM! Clean function signature update.The function signature properly accepts the runtime object and maintains backward compatibility through the Config parameter.
208-212: LGTM! Correct usage of the new runtime-based port allocation.The code properly uses the runtime object for port allocation with appropriate parameters.
236-238: LGTM! Proper usage of block allocation for NIXL ports.The block allocation correctly reserves consecutive ports needed for NIXL side channel communication.
lib/bindings/python/rust/lib.rs (7)
11-11: LGTM! Appropriate use of rand for port randomization.The import is used correctly to reduce contention during port allocation.
382-509: Well-implemented port allocation with proper error handling and resource cleanup.The implementation shows excellent practices:
- Input validation for block size
- Randomized candidate selection to reduce contention
- Proper socket binding to ensure ports are available
- Atomic ETCD reservation with rollback on failure
- Clean resource management with RAII for sockets
393-397: Consider a more specific error type for validation errors.Using
PyValueErrorfor input validation is appropriate.
417-418: Good optimization - limiting candidates to avoid excessive attempts.The code properly caps the number of candidates to
MAX_ALLOCATE_ATTEMPTSto prevent excessive iterations.
487-493: Excellent cleanup logic for partial reservations.The code properly cleans up any partially reserved ports in ETCD when a reservation fails midway. The warning log for cleanup failures is appropriate since these are best-effort operations.
544-552: Good IPv6 fallback implementation.The function properly attempts IPv4 first and falls back to IPv6 if needed, which is a robust approach.
445-456: Verify socket binding behavior across platforms.socket2 is used (lib/bindings/python/rust/lib.rs — bind_tcp_port; lib/runtime/src/pipeline/network/tcp/server.rs); I didn't find cfg(target_os) branches in the socket code. Verify:
- SO_REUSEADDR vs SO_REUSEPORT semantics when binding multiple sockets.
- IPV6_V6ONLY / IPv4-vs-IPv6 (dual-stack) behavior across Linux/macOS/Windows.
- Windows-specific differences (from_raw_fd vs from_raw_socket, FD semantics) and any assumptions about Unix FDs.
- Add platform-gated handling or CI tests if behavior differs.
- Because port allocation code is duplicated in several backend components. - Because the bindings now hide etcd, making it easier to replace. I tried to stay faithful to the original vllm port allocation and reservation code. Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
a17b300 to
0828708
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty happy with this. Have a couple of questions, and a couple of nits which can be ignored.
Signed-off-by: Graham King <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved based on @grahamking's "fixing it" comment.
I tried to stay faithful to the original vllm port allocation and reservation code.
Summary by CodeRabbit