[Core] Add monolithic kernel routing replay and prefix caching sentinel by TomerBN-Nvidia · Pull Request #12 · TomerBN-Nvidia/vllm

TomerBN-Nvidia · 2026-04-16T07:57:51Z

Summary

Extend routing replay to support monolithic (FP8/MXFP8) kernel paths and prefix caching. This builds on top of the core device-cache architecture in vllm-project#39917.

Depends on:

[Core] Replace routing replay with device cache and async D2H pipeline vllm-project/vllm#39917 (core refactor, base branch)
[feat] Add routing_replay_out support to MoE kernels and Python API flashinfer-ai/flashinfer#3024 (routing_replay_out parameter on MoE kernels)

Changes

Monolithic kernel support:

Thread routing_replay_out through apply_monolithic() chain (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py)
Add _monolithic_writes_routing_replay flag to FP8/MXFP8 quant methods
BF16 monolithic fallback: run select_experts() separately when kernel does not write routing data internally

Prefix caching:

Initialize host cache with -1 sentinel instead of 0 (expert ID 0 is valid; -1 marks cache-hit positions)

Tests:

TestMonolithicWritesFlag tests for Fp8MoEMethod and base class
Update host cache sentinel test for -1 initialization

Files changed (7 files, +57/-4)

routed_experts_capturer.py — Host cache init np.zeros → np.full(-1)
moe_runner_base.py — Monolithic routing_replay_out passing + BF16 fallback
fused_moe_method_base.py — routing_replay_out param on apply_monolithic()
modular_kernel.py — routing_replay_out param
fp8.py — _monolithic_writes_routing_replay = True
modelopt.py — routing_replay_out threading + flags (FP8, FP4, MXFP8)
test_routed_experts_capture.py — Monolithic flag + sentinel tests

Validation

Tested on GB200 GPUs with FP8/MXFP8 monolithic kernel paths:

Ultra MXFP8 (2 nodes, TP=8): PASS, 3,060 tok/s
Ultra BF16 Triton multi-node: PASS, 3,558 tok/s
MTP + prefix caching: PASS

Test Plan

FP8/MXFP8 monolithic functional tests pass
Prefix caching with -1 sentinel works
Unit tests pass (CI)
FlashInfer PR merged

Replace the shared-memory routing replay implementation with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. Architecture changes: - Rewrite routed_experts_capturer.py: device cache (L,N,K) int16 buffer + per-request host cache + async D2H via CUDA events - Remove SharedMemory, fcntl locking, RoutedExpertsReader - Route data through ModelRunnerOutput (Ray DAG) instead of shared memory (enables multi-node) - Per-layer CUDA graph static marking for buffer views - Add moe_layer_id auto-increment to FusedMoE for buffer binding - Wire routed_experts to OpenAI API response - Capture routing in non-monolithic (Triton) path via topk_ids copy - Unit tests for device cache and host cache RFC: vllm-project#39701

Extend routing replay to support monolithic (FP8/MXFP8) kernel paths and prefix caching: Monolithic kernel support: - Thread routing_replay_out through apply_monolithic() chain (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py) - Add _monolithic_writes_routing_replay flag to quant methods - BF16 monolithic fallback: run select_experts() separately when kernel does not write routing data internally Prefix caching: - Initialize host cache with -1 sentinel instead of 0 (expert ID 0 is valid; -1 marks cache-hit positions) Tests: - Add TestMonolithicWritesFlag tests - Update host cache sentinel test for -1 initialization Depends on: flashinfer-ai/flashinfer#3024 (routing_replay_out param)

TomerBN-Nvidia added 2 commits April 16, 2026 00:54

TomerBN-Nvidia force-pushed the upstream-routing-replay branch 2 times, most recently from 81278c0 to 391c0b9 Compare April 16, 2026 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add monolithic kernel routing replay and prefix caching sentinel#12

[Core] Add monolithic kernel routing replay and prefix caching sentinel#12
TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
upstream-routing-replay-features

TomerBN-Nvidia commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TomerBN-Nvidia commented Apr 16, 2026

Summary

Changes

Files changed (7 files, +57/-4)

Validation

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant