Route the missing parameter for trtllm_fp8_per_tensor_scale_moe_op #3094
Route the missing parameter for trtllm_fp8_per_tensor_scale_moe_op #3094aleozlx merged 4 commits intoflashinfer-ai:mainfrom
trtllm_fp8_per_tensor_scale_moe_op #3094Conversation
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughForward an optional Changes
Sequence Diagram(s)(Skipped — changes are limited plumbing updates without significant new multi-component control flow.) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the Fused MoE core logic by converting positional arguments to keyword arguments and adding support for routing_replay_out. Several critical issues were identified in the trtllm_fp8_per_tensor_scale_moe implementation: the manual allocation of the output tensor and the enable_pdl check are redundant and lead to a positional argument mismatch in the subsequent function call. Additionally, the tune_max_num_tokens parameter was incorrectly changed from an integer to a list, which will cause failures in the autotuning logic.
| output, | ||
| num_experts, |
There was a problem hiding this comment.
This change introduces a critical bug due to a positional argument mismatch. The target function trtllm_fp8_per_tensor_scale_moe_op does not include output in its signature (defined at line 1503). By inserting output at this position, all subsequent arguments are shifted: num_experts will be passed to the top_k parameter, top_k to n_group, and so on. This will cause a TypeError or incorrect kernel execution. This line should be removed to restore the correct positional mapping.
| output, | |
| num_experts, | |
| num_experts, |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/fused_moe/core.py (1)
2628-2657:⚠️ Potential issue | 🔴 CriticalCritical: positional argument misalignment — this PR's fix will TypeError at runtime.
The public wrapper (lines 2628–2657) passes 25 positional arguments to
trtllm_fp8_per_tensor_scale_moe_op, which has 24 parameters. Position 9 insertsoutput(tensor) wherenum_experts: intis expected, and position 22 inserts[-1, -1]wheretune_max_num_tokens: intbelongs. The 25th argument (routing_replay_out) exceeds the function's arity and will raiseTypeError: too many positional argumentsat runtime.Meanwhile, the internal C++ binding call inside the
_opfunction (lines 1603–1629) correctly uses keyword arguments and successfully passesoutput=andconfig_index=to the C++ binding—those parameters exist at the C++ level but are not exposed as formal parameters intrtllm_fp8_per_tensor_scale_moe_op's Python signature. The_opfunction allocatesoutputinternally (lines 1538–1540) and infersconfig_indexfrom the AutoTuner (line 1625).Fix: Remove the
outputallocation (lines 2628–2633) and the[-1, -1]placeholder (line 2653) from the public wrapper. Calltrtllm_fp8_per_tensor_scale_moe_oppositionally without these arguments, matching the 24-parameter signature:result = get_trtllm_moe_sm100_module().trtllm_fp8_per_tensor_scale_moe_op( routing_logits, routing_bias, hidden_states, gemm1_weights, output1_scales_scalar, output1_scales_gate_scalar, gemm2_weights, output2_scales_scalar, num_experts, top_k, n_group, topk_group, intermediate_size, local_expert_offset, local_num_experts, routed_scaling_factor, use_routing_scales_on_input, routing_method_type, do_finalize, enable_pdl, tune_max_num_tokens, # default=8192 in _op, or parameterize the wrapper activation_type, norm_topk_prob, routing_replay_out, )The
_opwill allocate and returnoutputinternally; extract it from the returned list as done in lines 2630–2637.Also add an integration test that invokes this wrapper with a non-None
routing_replay_outto prevent regression when this code path is exercised in CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 2628 - 2657, The wrapper is passing 25 positional args to trtllm_fp8_per_tensor_scale_moe_op — misplacing `output` and `[-1, -1]` — which will TypeError at runtime; fix by removing the external `output` allocation (the torch.empty block) and the `[-1, -1]` placeholder, then call get_trtllm_moe_sm100_module().trtllm_fp8_per_tensor_scale_moe_op with the 24 expected positional parameters (use the `_op` default for tune_max_num_tokens or expose it as a wrapper param) and, after the call, extract the internally-allocated output from the returned result as `_op` does; reference trtllm_fp8_per_tensor_scale_moe_op, get_trtllm_moe_sm100_module, and the internal _op allocation/return logic when making the change.
🧹 Nitpick comments (1)
flashinfer/fused_moe/core.py (1)
1603-1628: Nice cleanup — keyword-arg invocation is much safer.Switching
moe_op.trtllm_fp8_per_tensor_scale_moe(...)to kwargs (includingoutput=outputandrouting_replay_out=routing_replay_out) eliminates the exact class of positional-drift bug that caused the vLLM CI regression in the first place. Consider applying the same treatment to the remaining positional C++ calls in this file (trtllm_fp8_block_scale_moeat Line 1821,trtllm_fp4_block_scale_moeat Line 2048,trtllm_mxint4_block_scale_moeat Line 2245) in a follow-up to prevent recurrence.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1603 - 1628, The call-site class of positional-argument bugs was fixed by switching moe_op.trtllm_fp8_per_tensor_scale_moe(...) to keyword args; do the same for the remaining C++-bound calls to avoid positional-drift regressions: update moe_op.trtllm_fp8_block_scale_moe, moe_op.trtllm_fp4_block_scale_moe, and moe_op.trtllm_mxint4_block_scale_moe to use explicit keyword arguments for every parameter (including output=..., routing_replay_out=..., config_index=..., activation_type=..., etc.), preserving the existing argument names and values so the call semantics do not change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 2628-2657: The wrapper is passing 25 positional args to
trtllm_fp8_per_tensor_scale_moe_op — misplacing `output` and `[-1, -1]` — which
will TypeError at runtime; fix by removing the external `output` allocation (the
torch.empty block) and the `[-1, -1]` placeholder, then call
get_trtllm_moe_sm100_module().trtllm_fp8_per_tensor_scale_moe_op with the 24
expected positional parameters (use the `_op` default for tune_max_num_tokens or
expose it as a wrapper param) and, after the call, extract the
internally-allocated output from the returned result as `_op` does; reference
trtllm_fp8_per_tensor_scale_moe_op, get_trtllm_moe_sm100_module, and the
internal _op allocation/return logic when making the change.
---
Nitpick comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1603-1628: The call-site class of positional-argument bugs was
fixed by switching moe_op.trtllm_fp8_per_tensor_scale_moe(...) to keyword args;
do the same for the remaining C++-bound calls to avoid positional-drift
regressions: update moe_op.trtllm_fp8_block_scale_moe,
moe_op.trtllm_fp4_block_scale_moe, and moe_op.trtllm_mxint4_block_scale_moe to
use explicit keyword arguments for every parameter (including output=...,
routing_replay_out=..., config_index=..., activation_type=..., etc.), preserving
the existing argument names and values so the call semantics do not change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 49aada39-68d1-4440-988e-001d72d9e64b
📒 Files selected for processing (1)
flashinfer/fused_moe/core.py
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
/bot run |
…_op` (#3094) <!-- .github/pull_request_template.md --> ## 📌 Description Fix vLLM CI failure for 0.6.8 - https://buildkite.com/vllm/ci/builds/61703/steps/canvas?jid=019d97e7-d69d-4b1a-a597-95a021d29060&tab=output#019d97e7-d69d-4b1a-a597-95a021d29060 The public `trtllm_fp8_per_tensor_scale_moe` wrapper at line 2559 calls into `_op` via the `SimpleNamespace` returned by `get_trtllm_moe_sm100_module()` (line 2315), so user-facing callers hit the same error. The `routing_replay_out` trailing argument was added in #3024 (2026-04-15). That PR updated the public wrapper's call list but not the inner `_op`'s call to the C++ binding. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Improved MoE routing output handling in TensorRT-LLM FP16/BF16 and FP8 inference paths so optional replay outputs are accepted and processed for more robust Mixture-of-Experts inference. * **Chores** * Relaxed version tag validation in the release workflow to accept an additional optional segment after the patch number. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: Alex Yang <aleyang@nvidia.com>
📌 Description
Fix vLLM CI failure for 0.6.8 - https://buildkite.com/vllm/ci/builds/61703/steps/canvas?jid=019d97e7-d69d-4b1a-a597-95a021d29060&tab=output#019d97e7-d69d-4b1a-a597-95a021d29060
The public
trtllm_fp8_per_tensor_scale_moewrapper at line 2559 calls into_opvia theSimpleNamespacereturned byget_trtllm_moe_sm100_module()(line 2315), so user-facing callers hit the same error.The
routing_replay_outtrailing argument was added in #3024 (2026-04-15). That PR updated the public wrapper's call list but not the inner_op's call to the C++ binding.🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Refactor
Chores