[Feature] Support recording expert indices for rollout router replay#28284
[Feature] Support recording expert indices for rollout router replay#2828422quinn merged 33 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for recording expert routing decisions for MoE models, a feature named Router Replay Router (R3). The implementation is comprehensive, touching configuration, engine arguments, the model executor, and the scheduler. My review has identified a few critical issues, primarily concerning race conditions due to missing locks for shared memory access and a method signature mismatch that will lead to a TypeError. There are also some code quality suggestions to improve maintainability, such as removing a redundant argument and moving a local import.
| @@ -1283,6 +1294,8 @@ def __init__( | |||
| raise ValueError("Duplicate layer name: {}".format(prefix)) | |||
| compilation_config.static_forward_context[prefix] = self | |||
| self.layer_name = prefix | |||
| from vllm.model_executor.models.utils import extract_layer_index | |||
There was a problem hiding this comment.
The import from vllm.model_executor.models.utils import extract_layer_index is performed inside the __init__ method. It is a best practice to place all imports at the top of the file for better readability, performance, and to avoid potential circular import issues. Please move this import to the top of the file.
💡 Codex Reviewhttps://github.com/vllm-project/vllm/blob/611bc69292546334ddbcc52689ffe86f91da41e1/vllm/v1/worker/gpu_model_runner.py#L2737-L2738 When routed expert recording is enabled, ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
04141a2 to
f4e5998
Compare
Signed-off-by: xhx1022 <1737006628@qq.com>
Signed-off-by: xhx1022 <1737006628@qq.com>
| f"disable_custom_all_reduce={self.parallel_config.disable_custom_all_reduce}, " # noqa | ||
| f"quantization={self.model_config.quantization}, " | ||
| f"enforce_eager={self.model_config.enforce_eager}, " | ||
| f"enable_return_routed_experts={self.model_config.enable_return_routed_experts}, " # noqa |
There was a problem hiding this comment.
Missing async scheduling disable for routed experts feature
Medium Severity
The PR notes explicitly state that async scheduling should be disabled when enable_return_routed_experts=True, but this isn't implemented. The async scheduling logic in VllmConfig.__post_init__ handles various incompatibility cases (PP > 1, speculative decoding, executor backend) but doesn't include any handling for enable_return_routed_experts. Users in the PR discussion are reporting significant latency issues (10X slower), which could be related to async scheduling interference with the capture/save operations for routed experts.
…llm-project#28284) Signed-off-by: xhx1022 <1737006628@qq.com> Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com> Signed-off-by: arlenxu <arlenxu@tencent.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: arlenxu <arlenxu@tencent.com> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
…llm-project#28284) Signed-off-by: xhx1022 <1737006628@qq.com> Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com> Signed-off-by: arlenxu <arlenxu@tencent.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: arlenxu <arlenxu@tencent.com>
…llm-project#28284) Signed-off-by: xhx1022 <1737006628@qq.com> Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com> Signed-off-by: arlenxu <arlenxu@tencent.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: arlenxu <arlenxu@tencent.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Purpose
This PR introduces Rollout Router Replay (R3) support into vLLM runtime.
Inspired by the recent research in reinforcement learning alignment for MoE-based LLMs (arXiv:2510.11370, arXiv:2507.18071), this implementation allows recording the expert routing decisions for every token at every layer during model inference. The recorded routing traces can be used for replaying the expert routing process during RL post traning;
Currently, the initial version supports:
Below is a minimal reproducible example for running Qwen3-30B-A3B with tensor_parallel_size = 8 and async concurrent inference.
The example also prints the shape of the returned
routed_expertstensor.Reminder
The number of experts in the output can be 1 less than (prompt_length + response_token_count).
This gap of 1 is expected because the final generated token is sampled, not computed through the forward layer, and therefore is not included in the expert count.
Acknowledgments
This work is inspired by and builds upon the implementation from SGLang PR #12162.
Special thanks to @ocss884 and the SGLang RL team for their valuable discussions and contributions.
Note
Introduces an opt-in pathway to capture and return MoE routed expert indices per token/layer.
enable_return_routed_expertstoModelConfig, plumbed throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared memory buffers;fused_moe/layer.pyrecords gatetopk_idsper layerGPUModelRunnerinitializes/clears/saves captured experts and computesslot_mapping; TP rank 0 writes to shared memorySchedulerderives token slots from KV blocks and attachesrouted_expertstoEngineCoreOutputCompletionOutput.routed_experts(numpy array) and related plumbing in v1 engine/output processorWritten by Cursor Bugbot for commit b0fb649926346a1a132d8c8dd294a5a95142579f. This will update automatically on new commits. Configure here.
Note
Introduces optional routed-expert tracing for MoE models, plumbed end-to-end and exposed in request outputs.
enable_return_routed_expertstoModelConfigand surfaces it viaEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared memory buffers;fused_moe/layer.pyrecords gatetopk_idsper layer usinglayer_idGPUModelRunnerinitializes/clears/saves captured experts and computesslot_mapping; TP rank 0 writes captured indices to shared memorySchedulerreconstructs token slots from KV blocks and attachesrouted_expertstoEngineCoreOutputCompletionOutput.routed_experts(numpy array) and corresponding plumbing in v1 engine/output processorWritten by Cursor Bugbot for commit 407fd57d30b3fe321fcbb75bea382b17fa89f349. This will update automatically on new commits. Configure here.
Note
Enables recording and returning MoE routed expert indices for each token/layer when
enable_return_routed_expertsis set.enable_return_routed_expertstoModelConfig; plumbed throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared memory buffers;fused_moe/layer.pycaptures gatetopk_idsusinglayer_idGPUModelRunnerinitializes/clears/saves captured experts and computesslot_mapping; TP rank 0 writes captured indices to shared memorySchedulerreconstructs token slots from KV blocks and attachesrouted_expertsto engine outputsCompletionOutput.routed_experts(numpy array) and wiring through v1 engine/output processorWritten by Cursor Bugbot for commit ec8ed03f325943d2c63b329e36f018123f91109d. This will update automatically on new commits. Configure here.
Note
Enables optional routed-expert tracing for MoE models and surfaces it in request outputs.
enable_return_routed_expertstoModelConfig, threads throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared memory buffers;fused_moe/layer.pycaptures gatetopk_idsusinglayer_idGPUModelRunnerinitializes/clears/saves captured experts and computesslot_mapping; TP rank 0 writes captured indices to shared memorySchedulerreconstructs token slot indices from KV blocks and attachesrouted_expertsto engine outputsCompletionOutput.routed_expertsand wiring through v1 engine/output processorWritten by Cursor Bugbot for commit ec8ed03f325943d2c63b329e36f018123f91109d. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit ec8ed03f325943d2c63b329e36f018123f91109d. Configure here.
Note
Enables tracing MoE router decisions end-to-end when
enable_return_routed_expertsis set, and exposes them to clients.enable_return_routed_expertstoModelConfig, plumbed throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared-memory buffers; TP rank 0 writes captured indicesfused_moe/layer.pyto capture gatetopk_idsperlayer_idGPUModelRunnerinitializes/clears/saves captured experts and computesslot_mappingSchedulerreconstructs token slot indices from KV blocks and attachesrouted_expertstoEngineCoreOutputCompletionOutput.routed_experts(numpy array) and wiring through v1 engine/output processorWritten by Cursor Bugbot for commit ec8ed03f325943d2c63b329e36f018123f91109d. This will update automatically on new commits. Configure here.
Note
Enables optional routed-expert tracing for MoE models and surfaces it in request outputs.
enable_return_routed_expertstoModelConfig; plumbed throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared-memory buffers to store/read per-token per-layer routertopk_ids; TP rank 0 writes, scheduler readsfused_moe/layer.pyto capture gatetopk_idswithlayer_id;GPUModelRunnerinitializes/clears/saves captures and computesslot_mappingSchedulerderives token slot indices from KV blocks on request finish and attachesrouted_expertsto engine outputsCompletionOutput.routed_experts(numpy array) and wiring through v1 engine/output processor/engine structuresWritten by Cursor Bugbot for commit c9d5d3b2729284422782eb1dccd1bc0668ab111c. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 90ebad2bf212397ec26f7939e47f11c2139fa9e9. Configure here.
Note
Introduces an opt-in pathway to capture and return MoE router
topk_idsper token/layer and expose them to clients.enable_return_routed_expertstoModelConfig, surfaced viaEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingRoutedExpertsCapturer/RoutedExpertsReaderwith shared memory buffers to store/read per-token per-layer routed experts; TP rank 0 writes, scheduler readsfused_moe/layer.pyto capture gatetopk_idsusinglayer_id;GPUModelRunnerinitializes/clears/saves captures and computesslot_mappingSchedulerreconstructs token slot indices from KV blocks on request finish and attachesrouted_experts(numpy array, shape[seq_len, layer_num, topk]) to outputsCompletionOutput.routed_expertsand wires via v1 engine/output processorenable_return_routed_experts=Trueand asserts no context parallelism (DCP/PCP > 1 unsupported)Written by Cursor Bugbot for commit 90ebad2bf212397ec26f7939e47f11c2139fa9e9. This will update automatically on new commits. Configure here.
Note
Introduces end-to-end, optional routed-expert tracing for MoE models and exposes it in outputs.
enable_return_routed_expertstoModelConfig, plumbed viaEngineArgs/CLI (--enable-return-routed-experts) andLLMentrypoint; included in config logsfused_moe/routed_experts_capturer.pywith shared-memory buffers and a reader/writer;fused_moe/layer.pyhooks capture per-layertopk_idsusinglayer_idGPUModelRunnerinitializes/clears/saves captures, computesslot_mapping, and writes from TP rank 0Schedulerreconstructs token slot indices from KV blocks and attachesrouted_experts(numpy array) to engine outputs; asserts no context parallelismCompletionOutput.routed_expertsand threading through v1 engine/output processorenable_return_routed_experts=TrueWritten by Cursor Bugbot for commit 39aefda113994386d32634051e9183c618f25681. This will update automatically on new commits. Configure here.
Note
Enables optional end-to-end capture and return of MoE routed expert indices.
enable_return_routed_expertstoModelConfig, surfaced viaEngineArgs/CLI (--enable-return-routed-experts) andLLM; included in config logsfused_moe/routed_experts_capturer.py(capturer/reader with shared memory) and hooks infused_moe/layer.pyto record per-layer routertopk_ids(layer_id)GPUModelRunner: initialize/clear/save captures, computeslot_mapping, TP rank 0 writes to shared memoryScheduler: attaches reader, reconstructs token slots from KV blocks on request finish, and attachesrouted_expertsto engine outputs; asserts no DCP/PCP; disables async scheduling when flag is setCompletionOutput.routed_experts(numpy array) and threads through v1 engine/output processor/structuresWritten by Cursor Bugbot for commit 819964f71ede807917c1ab2d63fb9c0232fbbd5e. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 21c26d3001b2bbf73e8704f086b37061355f3907. Configure here.
Note
Introduces end-to-end, opt-in routed-expert tracing for MoE models and surfaces it to clients.
enable_return_routed_expertstoModelConfig, threaded throughEngineArgs/CLI (--enable-return-routed-experts),LLMentrypoint, and config loggingfused_moe/routed_experts_capturer.pyimplementing capturer/reader singletons with shared memory buffers; TP rank 0 writes captured indicesfused_moe/layer.pyto record gatetopk_idsperlayer_idGPUModelRunner: initializes/clears/saves captures and computesslot_mapping; writes per-token routed experts to shared memoryScheduler: attaches reader, reconstructs token slots from KV blocks on request finish, asserts no context parallelism, and attachesrouted_expertsto engine outputsCompletionOutput.routed_experts(numpy array, shape[seq_len,layer_num,topk]) and wires through v1 engine/output processor/structuresWritten by Cursor Bugbot for commit 9fd9ac2ad7004208b263a6d16a412a74ea3317cc. This will update automatically on new commits. Configure here.
Note
Enables returning MoE routed expert indices when requested, plumbed across config → execution → outputs.
enable_return_routed_expertstoModelConfig, surfaced viaEngineArgs/CLI andLLM; included in config logsfused_moe/routed_experts_capturer.py(capturer/reader singletons using shared memory);fused_moe/layer.pyrecords gatetopk_idsperlayer_idGPUModelRunnerinitializes/clears/saves captured experts and computes tokenslot_mapping(TP rank 0 writes)Schedulerreconstructs token slots from KV blocks and attachesrouted_expertsto engine outputs; asserts no context parallelismCompletionOutput.routed_experts(numpy array) and threads through v1 engine/output processor/structuresWritten by Cursor Bugbot for commit aeb469e. This will update automatically on new commits. Configure here.