[Core] Add monolithic kernel routing replay and prefix caching sentinel#12
Draft
TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
Draft
[Core] Add monolithic kernel routing replay and prefix caching sentinel#12TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
TomerBN-Nvidia wants to merge 2 commits intoupstream-routing-replayfrom
Conversation
Replace the shared-memory routing replay implementation with a device-cache approach that works correctly with CUDA graphs, multi-node TP, and data parallelism. Architecture changes: - Rewrite routed_experts_capturer.py: device cache (L,N,K) int16 buffer + per-request host cache + async D2H via CUDA events - Remove SharedMemory, fcntl locking, RoutedExpertsReader - Route data through ModelRunnerOutput (Ray DAG) instead of shared memory (enables multi-node) - Per-layer CUDA graph static marking for buffer views - Add moe_layer_id auto-increment to FusedMoE for buffer binding - Wire routed_experts to OpenAI API response - Capture routing in non-monolithic (Triton) path via topk_ids copy - Unit tests for device cache and host cache RFC: vllm-project#39701
Extend routing replay to support monolithic (FP8/MXFP8) kernel paths and prefix caching: Monolithic kernel support: - Thread routing_replay_out through apply_monolithic() chain (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py) - Add _monolithic_writes_routing_replay flag to quant methods - BF16 monolithic fallback: run select_experts() separately when kernel does not write routing data internally Prefix caching: - Initialize host cache with -1 sentinel instead of 0 (expert ID 0 is valid; -1 marks cache-hit positions) Tests: - Add TestMonolithicWritesFlag tests - Update host cache sentinel test for -1 initialization Depends on: flashinfer-ai/flashinfer#3024 (routing_replay_out param)
81278c0 to
391c0b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend routing replay to support monolithic (FP8/MXFP8) kernel paths and prefix caching. This builds on top of the core device-cache architecture in vllm-project#39917.
Depends on:
routing_replay_outparameter on MoE kernels)Changes
Monolithic kernel support:
routing_replay_outthroughapply_monolithic()chain (modelopt.py, fp8.py, modular_kernel.py, fused_moe_method_base.py)_monolithic_writes_routing_replayflag to FP8/MXFP8 quant methodsselect_experts()separately when kernel does not write routing data internallyPrefix caching:
-1sentinel instead of0(expert ID 0 is valid;-1marks cache-hit positions)Tests:
TestMonolithicWritesFlagtests for Fp8MoEMethod and base class-1initializationFiles changed (7 files, +57/-4)
routed_experts_capturer.py— Host cache initnp.zeros→np.full(-1)moe_runner_base.py— Monolithicrouting_replay_outpassing + BF16 fallbackfused_moe_method_base.py—routing_replay_outparam onapply_monolithic()modular_kernel.py—routing_replay_outparamfp8.py—_monolithic_writes_routing_replay = Truemodelopt.py—routing_replay_outthreading + flags (FP8, FP4, MXFP8)test_routed_experts_capture.py— Monolithic flag + sentinel testsValidation
Tested on GB200 GPUs with FP8/MXFP8 monolithic kernel paths:
Test Plan