Skip to content

[Core] Replace routing replay with device cache and async D2H pipeline#39917

Open
TomerBN-Nvidia wants to merge 9 commits intovllm-project:mainfrom
TomerBN-Nvidia:upstream-routing-replay
Open

[Core] Replace routing replay with device cache and async D2H pipeline#39917
TomerBN-Nvidia wants to merge 9 commits intovllm-project:mainfrom
TomerBN-Nvidia:upstream-routing-replay

Conversation

@TomerBN-Nvidia
Copy link
Copy Markdown
Contributor

@TomerBN-Nvidia TomerBN-Nvidia commented Apr 15, 2026

Summary

Replace upstream vLLM routing replay with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. This PR focuses on the core architecture change — monolithic kernel support and prefix caching are in a follow-up PR.

RFC: #39701

What this PR does

Replaces the SharedMemory-based routing replay with:

  • Pre-allocated (L, N, K) int16 device buffer with per-layer views
  • Async D2H pipeline via CUDA events + pinned memory
  • Per-request host cache (no shared memory, no file locks)
  • Data flows through ModelRunnerOutput → Ray DAG → scheduler (enables multi-node)

What this PR removes

  • RoutedExpertsReader (shared memory reader)
  • multiprocessing.SharedMemory usage
  • fcntl file-based locking
  • capture() callback mechanism in router
  • KV cache slot_mapping retrieval for routing data
  • int32 dtype (replaced with int16)
  • (N, L, K) buffer layout (replaced with (L, N, K))

Changes

  • Rewrite routed_experts_capturer.py: device cache + async D2H pipeline
  • Add moe_layer_id auto-increment to FusedMoE for buffer binding
  • bind_routing_capture_to_model(): persistent tensor attribute + cudagraph_mark_tensor_static
  • Capture routing in non-monolithic (Triton) path via topk_ids.to(int16) copy
  • Route data through ModelRunnerOutput instead of shared memory
  • Wire routed_experts to OpenAI API response
  • Unit tests for device cache and host cache

What is NOT in this PR (follow-up)

Validation

Tested on GB200 GPUs with a 120B MoE model (BF16 Triton path, non-monolithic):

  • Single-node TP=4: PASS, 7,767 tok/s
  • Prefix caching: PASS, 7,136 tok/s
  • DP=2 (2 nodes, TP=4): PASS, 10,170 tok/s

Performance: 2.0% throughput overhead on random data.
Accuracy: GSM8K pass@1 = 95.77% (identical to baseline).

API Compatibility

Fully preserved — same CLI flag (--enable-return-routed-experts), same output field (routed_experts), same shape [seq_len, num_moe_layers, top_k].

Test Plan

  • BF16 Triton path functional tests pass
  • Multi-node DP functional tests pass
  • Performance degradation < 5%
  • Accuracy unchanged
  • Unit tests pass (CI)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the expert routing capture mechanism to use a GPU device cache and an asynchronous D2H pipeline, replacing the previous shared-memory implementation. Key changes include the addition of routed_experts to OpenAI response protocols and the integration of routing data flow through the ModelRunnerOutput in the V1 scheduler. Feedback highlights several critical issues: a potential IndexError due to a non-resetting global layer ID counter in FusedMoE, a bug where extracted routing data is not correctly assigned to request objects in the scheduler, and a memory leak in the host cache caused by missing cleanup logic for finished requests.

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/routed_experts_capturer.py
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
@mergify mergify bot added the v1 label Apr 15, 2026
@TomerBN-Nvidia TomerBN-Nvidia force-pushed the upstream-routing-replay branch from d348436 to 3eb1894 Compare April 16, 2026 07:56
@TomerBN-Nvidia TomerBN-Nvidia changed the title [Core] Replace routing replay with CUDA-graph-compatible device cache [Core] Replace routing replay with device cache and async D2H pipeline Apr 16, 2026
@TomerBN-Nvidia TomerBN-Nvidia force-pushed the upstream-routing-replay branch from 3eb1894 to 806d962 Compare April 16, 2026 08:54
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 16, 2026

Documentation preview: https://vllm--39917.org.readthedocs.build/en/39917/

@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 16, 2026
TomerBN-Nvidia and others added 6 commits April 16, 2026 03:47
Replace the shared-memory routing replay implementation with a
device-cache approach that works correctly with CUDA graphs,
multi-node TP, and data parallelism.

Architecture changes:
- Rewrite routed_experts_capturer.py: device cache (L,N,K) int16
  buffer + per-request host cache + async D2H via CUDA events
- Remove SharedMemory, fcntl locking, RoutedExpertsReader
- Route data through ModelRunnerOutput (Ray DAG) instead of
  shared memory (enables multi-node)
- Per-layer CUDA graph static marking for buffer views
- Add moe_layer_id auto-increment to FusedMoE for buffer binding
- Wire routed_experts to OpenAI API response
- Capture routing in non-monolithic (Triton) path via topk_ids copy
- Unit tests for device cache and host cache

RFC: vllm-project#39701
Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Add comprehensive documentation for the routing replay feature
covering architecture, usage (API server + Python SDK), output
format, design decisions, performance, and supported configurations.

Key sections:
- Quickstart with code examples for OpenAI API and Python SDK
- Output format: prompt vs generation routing split
- Architecture: device cache, host cache, async D2H pipeline
- Design decisions: why SharedMemory was replaced, buffer layout,
  int16 dtype, prompt/gen split, async D2H, symmetric TP buffers
- Performance benchmarks and supported configurations
- API reference for completions and chat completions

Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Previous cleanup (re-applied on updated base):
- Fix num_scheduled_tokes → num_scheduled_tokens typo in API signatures
- Move FusedMoE._next_moe_layer_id below class docstring
- Remove dead code: _get_routed_experts, pass statement, stale comments
- Convert f-string logging to %-style (vLLM convention)
- Replace Optional[X] with X | None, add from __future__ import annotations
- Remove unused params: max_running_requests, use_shared_memory,
  num_fused_shared_experts (on DeviceCache), forward_batch
- Replace get_tensor_size_bytes() with .nbytes
- Extract extract_routed_experts_for_current_batch to capturer module

New cleanup for prompt_routed_experts additions:
- Remove string-quoted type hint in output_processor._new_request_output
- Merge duplicated if-final_res_batch guards in completion serving
- Replace getattr with direct attribute access for prompt_routed_experts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move the prompt/generation routing split + MTP clipping logic into
routed_experts_capturer.py as split_routed_experts(). The output
processor call site shrinks from 16 lines to 4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move the async D2H copy logic (ordered dict construction, pinned
positions copy, sync_fwd_experts_buffer_DtoH call) into
routed_experts_capturer.py as issue_routing_d2h_copy(). Inline comment
from call site moved into the function docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
- ruff-check: fix line-too-long (E501), add abstractmethod to
  RoutedExpertsCapturer (B024), use contextlib.suppress (SIM105),
  remove unused numpy import (F401)
- ruff-format: apply formatting to 8 files
- mypy: add assert guards for Optional fields in _scatter_to_host
- markdownlint: fix table alignment (MD060), add code fence lang (MD040)
- SPDX: add Apache-2.0 license header to routed_experts_capturer.py

Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
@TomerBN-Nvidia TomerBN-Nvidia force-pushed the upstream-routing-replay branch from 81278c0 to 391c0b9 Compare April 16, 2026 11:17
TomerBN-Nvidia and others added 2 commits April 16, 2026 04:25
Add cleanup in _update_states to free routed experts host cache
buffers when requests finish or are preempted. Without this, the
per-request numpy buffers in _RoutedExpertsHostCache accumulate
indefinitely.

Ported from the production fork where this cleanup exists in the
same location.

Signed-off-by: Tomer Barnatan <tbarnatan@nvidia.com>
Move host cache cleanup for finished/preempted requests into
routed_experts_capturer.py as free_routing_buffers(). Removes
unnecessary hasattr guard (preempted_req_ids is a dataclass field,
always present; None is handled by truthiness check).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: tbarnatan <tbarnatan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend nvidia v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant