[Core] Replace routing replay with device cache and async D2H pipeline#39917
[Core] Replace routing replay with device cache and async D2H pipeline#39917TomerBN-Nvidia wants to merge 9 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the expert routing capture mechanism to use a GPU device cache and an asynchronous D2H pipeline, replacing the previous shared-memory implementation. Key changes include the addition of routed_experts to OpenAI response protocols and the integration of routing data flow through the ModelRunnerOutput in the V1 scheduler. Feedback highlights several critical issues: a potential IndexError due to a non-resetting global layer ID counter in FusedMoE, a bug where extracted routing data is not correctly assigned to request objects in the scheduler, and a memory leak in the host cache caused by missing cleanup logic for finished requests.
d348436 to
3eb1894
Compare
3eb1894 to
806d962
Compare
|
Documentation preview: https://vllm--39917.org.readthedocs.build/en/39917/ |
Replace the shared-memory routing replay implementation with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. Architecture changes: - Rewrite routed_experts_capturer.py: device cache (L,N,K) int16 buffer + per-request host cache + async D2H via CUDA events - Remove SharedMemory, fcntl locking, RoutedExpertsReader - Route data through ModelRunnerOutput (Ray DAG) instead of shared memory (enables multi-node) - Per-layer CUDA graph static marking for buffer views - Add moe_layer_id auto-increment to FusedMoE for buffer binding - Wire routed_experts to OpenAI API response - Capture routing in non-monolithic (Triton) path via topk_ids copy - Unit tests for device cache and host cache RFC: vllm-project#39701 Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Add comprehensive documentation for the routing replay feature covering architecture, usage (API server + Python SDK), output format, design decisions, performance, and supported configurations. Key sections: - Quickstart with code examples for OpenAI API and Python SDK - Output format: prompt vs generation routing split - Architecture: device cache, host cache, async D2H pipeline - Design decisions: why SharedMemory was replaced, buffer layout, int16 dtype, prompt/gen split, async D2H, symmetric TP buffers - Performance benchmarks and supported configurations - API reference for completions and chat completions Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Previous cleanup (re-applied on updated base): - Fix num_scheduled_tokes → num_scheduled_tokens typo in API signatures - Move FusedMoE._next_moe_layer_id below class docstring - Remove dead code: _get_routed_experts, pass statement, stale comments - Convert f-string logging to %-style (vLLM convention) - Replace Optional[X] with X | None, add from __future__ import annotations - Remove unused params: max_running_requests, use_shared_memory, num_fused_shared_experts (on DeviceCache), forward_batch - Replace get_tensor_size_bytes() with .nbytes - Extract extract_routed_experts_for_current_batch to capturer module New cleanup for prompt_routed_experts additions: - Remove string-quoted type hint in output_processor._new_request_output - Merge duplicated if-final_res_batch guards in completion serving - Replace getattr with direct attribute access for prompt_routed_experts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move the prompt/generation routing split + MTP clipping logic into routed_experts_capturer.py as split_routed_experts(). The output processor call site shrinks from 16 lines to 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move the async D2H copy logic (ordered dict construction, pinned positions copy, sync_fwd_experts_buffer_DtoH call) into routed_experts_capturer.py as issue_routing_d2h_copy(). Inline comment from call site moved into the function docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
- ruff-check: fix line-too-long (E501), add abstractmethod to RoutedExpertsCapturer (B024), use contextlib.suppress (SIM105), remove unused numpy import (F401) - ruff-format: apply formatting to 8 files - mypy: add assert guards for Optional fields in _scatter_to_host - markdownlint: fix table alignment (MD060), add code fence lang (MD040) - SPDX: add Apache-2.0 license header to routed_experts_capturer.py Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
81278c0 to
391c0b9
Compare
Add cleanup in _update_states to free routed experts host cache buffers when requests finish or are preempted. Without this, the per-request numpy buffers in _RoutedExpertsHostCache accumulate indefinitely. Ported from the production fork where this cleanup exists in the same location. Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move host cache cleanup for finished/preempted requests into routed_experts_capturer.py as free_routing_buffers(). Removes unnecessary hasattr guard (preempted_req_ids is a dataclass field, always present; None is handled by truthiness check). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: tbarnatan <tbarnatan@nvidia.com>
Summary
Replace upstream vLLM routing replay with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. This PR focuses on the core architecture change — monolithic kernel support and prefix caching are in a follow-up PR.
RFC: #39701
What this PR does
Replaces the SharedMemory-based routing replay with:
(L, N, K)int16 device buffer with per-layer viewsModelRunnerOutput→ Ray DAG → scheduler (enables multi-node)What this PR removes
RoutedExpertsReader(shared memory reader)multiprocessing.SharedMemoryusagefcntlfile-based lockingcapture()callback mechanism in router(N, L, K)buffer layout (replaced with(L, N, K))Changes
routed_experts_capturer.py: device cache + async D2H pipelinemoe_layer_idauto-increment toFusedMoEfor buffer bindingbind_routing_capture_to_model(): persistent tensor attribute +cudagraph_mark_tensor_statictopk_ids.to(int16)copyModelRunnerOutputinstead of shared memoryrouted_expertsto OpenAI API responseWhat is NOT in this PR (follow-up)
routing_replay_out) — depends on [feat] Add routing_replay_out support to MoE kernels and Python API flashinfer-ai/flashinfer#3024-1sentinel_monolithic_writes_routing_replayflagValidation
Tested on GB200 GPUs with a 120B MoE model (BF16 Triton path, non-monolithic):
Performance: 2.0% throughput overhead on random data.
Accuracy: GSM8K pass@1 = 95.77% (identical to baseline).
API Compatibility
Fully preserved — same CLI flag (
--enable-return-routed-experts), same output field (routed_experts), same shape[seq_len, num_moe_layers, top_k].Test Plan