Skip to content

[Core] Add monolithic kernel routing replay and prefix caching sentinel#12

Draft
TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
upstream-routing-replay-features
Draft

[Core] Add monolithic kernel routing replay and prefix caching sentinel#12
TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
upstream-routing-replay-features

Conversation

@TomerBN-Nvidia
Copy link
Copy Markdown
Owner

Summary

Extend routing replay to support monolithic (FP8/MXFP8) kernel paths and prefix caching. This builds on top of the core device-cache architecture in vllm-project#39917.

Depends on:

Changes

Monolithic kernel support:

  • Thread routing_replay_out through apply_monolithic() chain (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py)
  • Add _monolithic_writes_routing_replay flag to FP8/MXFP8 quant methods
  • BF16 monolithic fallback: run select_experts() separately when kernel does not write routing data internally

Prefix caching:

  • Initialize host cache with -1 sentinel instead of 0 (expert ID 0 is valid; -1 marks cache-hit positions)

Tests:

  • TestMonolithicWritesFlag tests for Fp8MoEMethod and base class
  • Update host cache sentinel test for -1 initialization

Files changed (7 files, +57/-4)

  • routed_experts_capturer.py — Host cache init np.zerosnp.full(-1)
  • moe_runner_base.py — Monolithic routing_replay_out passing + BF16 fallback
  • fused_moe_method_base.pyrouting_replay_out param on apply_monolithic()
  • modular_kernel.pyrouting_replay_out param
  • fp8.py_monolithic_writes_routing_replay = True
  • modelopt.pyrouting_replay_out threading + flags (FP8, FP4, MXFP8)
  • test_routed_experts_capture.py — Monolithic flag + sentinel tests

Validation

Tested on GB200 GPUs with FP8/MXFP8 monolithic kernel paths:

  • Ultra MXFP8 (2 nodes, TP=8): PASS, 3,060 tok/s
  • Ultra BF16 Triton multi-node: PASS, 3,558 tok/s
  • MTP + prefix caching: PASS

Test Plan

  • FP8/MXFP8 monolithic functional tests pass
  • Prefix caching with -1 sentinel works
  • Unit tests pass (CI)
  • FlashInfer PR merged

Replace the shared-memory routing replay implementation with a
device-cache approach that works correctly with CUDA graphs,
multi-node TP, and data parallelism.

Architecture changes:
- Rewrite routed_experts_capturer.py: device cache (L,N,K) int16
  buffer + per-request host cache + async D2H via CUDA events
- Remove SharedMemory, fcntl locking, RoutedExpertsReader
- Route data through ModelRunnerOutput (Ray DAG) instead of
  shared memory (enables multi-node)
- Per-layer CUDA graph static marking for buffer views
- Add moe_layer_id auto-increment to FusedMoE for buffer binding
- Wire routed_experts to OpenAI API response
- Capture routing in non-monolithic (Triton) path via topk_ids copy
- Unit tests for device cache and host cache

RFC: vllm-project#39701
Extend routing replay to support monolithic (FP8/MXFP8) kernel paths
and prefix caching:

Monolithic kernel support:
- Thread routing_replay_out through apply_monolithic() chain
  (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py)
- Add _monolithic_writes_routing_replay flag to quant methods
- BF16 monolithic fallback: run select_experts() separately when
  kernel does not write routing data internally

Prefix caching:
- Initialize host cache with -1 sentinel instead of 0
  (expert ID 0 is valid; -1 marks cache-hit positions)

Tests:
- Add TestMonolithicWritesFlag tests
- Update host cache sentinel test for -1 initialization

Depends on: flashinfer-ai/flashinfer#3024 (routing_replay_out param)
@TomerBN-Nvidia TomerBN-Nvidia force-pushed the upstream-routing-replay branch 2 times, most recently from 81278c0 to 391c0b9 Compare April 16, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant