[Core] Replace routing replay with device cache and async D2H pipeline by TomerBN-Nvidia · Pull Request #39917 · vllm-project/vllm

TomerBN-Nvidia · 2026-04-15T15:11:21Z

Summary

Replace upstream vLLM routing replay with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. This PR focuses on the core architecture change — monolithic kernel support and prefix caching are in a follow-up PR.

RFC: #39701

What this PR does

Replaces the SharedMemory-based routing replay with:

Pre-allocated (L, N, K) int16 device buffer with per-layer views
Async D2H pipeline via CUDA events + pinned memory
Per-request host cache (no shared memory, no file locks)
Data flows through ModelRunnerOutput → Ray DAG → scheduler (enables multi-node)

What this PR removes

RoutedExpertsReader (shared memory reader)
multiprocessing.SharedMemory usage
fcntl file-based locking
capture() callback mechanism in router
KV cache slot_mapping retrieval for routing data
int32 dtype (replaced with int16)
(N, L, K) buffer layout (replaced with (L, N, K))

Changes

Rewrite routed_experts_capturer.py: device cache + async D2H pipeline
Add moe_layer_id auto-increment to FusedMoE for buffer binding
bind_routing_capture_to_model(): persistent tensor attribute + cudagraph_mark_tensor_static
Capture routing in non-monolithic (Triton) path via topk_ids.to(int16) copy
Route data through ModelRunnerOutput instead of shared memory
Wire routed_experts to OpenAI API response
Unit tests for device cache and host cache

What is NOT in this PR (follow-up)

Monolithic kernel path (FP8/MXFP8 via FlashInfer routing_replay_out) — depends on [feat] Add routing_replay_out support to MoE kernels and Python API flashinfer-ai/flashinfer#3024
Prefix caching -1 sentinel
_monolithic_writes_routing_replay flag

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 Triton path, non-monolithic):

Single-node TP=4: PASS, 7,767 tok/s
Prefix caching: PASS, 7,136 tok/s
DP=2 (2 nodes, TP=4): PASS, 10,170 tok/s

Performance: 2.0% throughput overhead on random data.
Accuracy: GSM8K pass@1 = 95.77% (identical to baseline).

API Compatibility

Fully preserved — same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape [seq_len, num_moe_layers, top_k].

Test Plan

BF16 Triton path functional tests pass
Multi-node DP functional tests pass
Performance degradation < 5%
Accuracy unchanged
Unit tests pass (CI)

gemini-code-assist

Code Review

This pull request refactors the expert routing capture mechanism to use a GPU device cache and an asynchronous D2H pipeline, replacing the previous shared-memory implementation. Key changes include the addition of routed_experts to OpenAI response protocols and the integration of routing data flow through the ModelRunnerOutput in the V1 scheduler. Feedback highlights several critical issues: a potential IndexError due to a non-resetting global layer ID counter in FusedMoE, a bug where extracted routing data is not correctly assigned to request objects in the scheduler, and a memory leak in the host cache caused by missing cleanup logic for finished requests.

mergify · 2026-04-16T10:26:13Z

Documentation preview: https://vllm--39917.org.readthedocs.build/en/39917/

Replace the shared-memory routing replay implementation with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. Architecture changes: - Rewrite routed_experts_capturer.py: device cache (L,N,K) int16 buffer + per-request host cache + async D2H via CUDA events - Remove SharedMemory, fcntl locking, RoutedExpertsReader - Route data through ModelRunnerOutput (Ray DAG) instead of shared memory (enables multi-node) - Per-layer CUDA graph static marking for buffer views - Add moe_layer_id auto-increment to FusedMoE for buffer binding - Wire routed_experts to OpenAI API response - Capture routing in non-monolithic (Triton) path via topk_ids copy - Unit tests for device cache and host cache RFC: vllm-project#39701 Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Add comprehensive documentation for the routing replay feature covering architecture, usage (API server + Python SDK), output format, design decisions, performance, and supported configurations. Key sections: - Quickstart with code examples for OpenAI API and Python SDK - Output format: prompt vs generation routing split - Architecture: device cache, host cache, async D2H pipeline - Design decisions: why SharedMemory was replaced, buffer layout, int16 dtype, prompt/gen split, async D2H, symmetric TP buffers - Performance benchmarks and supported configurations - API reference for completions and chat completions Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Previous cleanup (re-applied on updated base): - Fix num_scheduled_tokes → num_scheduled_tokens typo in API signatures - Move FusedMoE._next_moe_layer_id below class docstring - Remove dead code: _get_routed_experts, pass statement, stale comments - Convert f-string logging to %-style (vLLM convention) - Replace Optional[X] with X | None, add from __future__ import annotations - Remove unused params: max_running_requests, use_shared_memory, num_fused_shared_experts (on DeviceCache), forward_batch - Replace get_tensor_size_bytes() with .nbytes - Extract extract_routed_experts_for_current_batch to capturer module New cleanup for prompt_routed_experts additions: - Remove string-quoted type hint in output_processor._new_request_output - Merge duplicated if-final_res_batch guards in completion serving - Replace getattr with direct attribute access for prompt_routed_experts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Move the prompt/generation routing split + MTP clipping logic into routed_experts_capturer.py as split_routed_experts(). The output processor call site shrinks from 16 lines to 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Move the async D2H copy logic (ordered dict construction, pinned positions copy, sync_fwd_experts_buffer_DtoH call) into routed_experts_capturer.py as issue_routing_d2h_copy(). Inline comment from call site moved into the function docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

- ruff-check: fix line-too-long (E501), add abstractmethod to RoutedExpertsCapturer (B024), use contextlib.suppress (SIM105), remove unused numpy import (F401) - ruff-format: apply formatting to 8 files - mypy: add assert guards for Optional fields in _scatter_to_host - markdownlint: fix table alignment (MD060), add code fence lang (MD040) - SPDX: add Apache-2.0 license header to routed_experts_capturer.py Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Add cleanup in _update_states to free routed experts host cache buffers when requests finish or are preempted. Without this, the per-request numpy buffers in _RoutedExpertsHostCache accumulate indefinitely. Ported from the production fork where this cleanup exists in the same location. Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>

Move host cache cleanup for finished/preempted requests into routed_experts_capturer.py as free_routing_buffers(). Removes unnecessary hasattr guard (preempted_req_ids is a dataclass field, always present; None is handled by truthiness check). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: tbarnatan <tbarnatan@nvidia.com>

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/routed_experts_capturer.py

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

mergify bot added frontend nvidia labels Apr 15, 2026

github-project-automation bot added this to NVIDIA Apr 15, 2026

mergify bot added the v1 label Apr 15, 2026

TomerBN-Nvidia force-pushed the upstream-routing-replay branch from d348436 to 3eb1894 Compare April 16, 2026 07:56

TomerBN-Nvidia changed the title ~~[Core] Replace routing replay with CUDA-graph-compatible device cache~~ [Core] Replace routing replay with device cache and async D2H pipeline Apr 16, 2026

TomerBN-Nvidia mentioned this pull request Apr 16, 2026

[Core] Add monolithic kernel routing replay and prefix caching sentinel TomerBN-Nvidia/vllm#12

Draft

4 tasks

TomerBN-Nvidia force-pushed the upstream-routing-replay branch from 3eb1894 to 806d962 Compare April 16, 2026 08:54

mergify bot added the documentation Improvements or additions to documentation label Apr 16, 2026

TomerBN-Nvidia and others added 6 commits April 16, 2026 03:47

TomerBN-Nvidia force-pushed the upstream-routing-replay branch from 81278c0 to 391c0b9 Compare April 16, 2026 11:17

TomerBN-Nvidia and others added 2 commits April 16, 2026 04:25

TomerBN-Nvidia marked this pull request as ready for review April 16, 2026 11:36

TomerBN-Nvidia requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners April 16, 2026 11:36

TomerBN-Nvidia requested review from DarkLight1337, chaunceyjiang, mgoin, pavanimajety and russellb as code owners April 16, 2026 11:36

Merge branch 'main' into upstream-routing-replay

cca558c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Replace routing replay with device cache and async D2H pipeline#39917

[Core] Replace routing replay with device cache and async D2H pipeline#39917
TomerBN-Nvidia wants to merge 9 commits intovllm-project:mainfrom
TomerBN-Nvidia:upstream-routing-replay

TomerBN-Nvidia commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TomerBN-Nvidia commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR does

What this PR removes

Changes

What is NOT in this PR (follow-up)

Validation

API Compatibility

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TomerBN-Nvidia commented Apr 15, 2026 •

edited

Loading