fix: int32 overflow in trtllm_fp4_block_scale_moe causing "Unsupported hidden state scale shape" for EP32+ configs#2853
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis change widens two size computations in the TRTLLM fused MoE kernel launcher from 32-bit to 64-bit integers to avoid overflow when computing scale vector sizes for hidden states and weights. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical integer overflow bug that prevented the successful deployment of large Expert Parallel configurations in vLLM, specifically when using models with substantial hidden state sizes. By promoting key arithmetic operations to 64-bit integers, the change ensures that calculations involving large token counts and hidden dimensions are handled correctly, thereby resolving misleading Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
trtllm_fp4_block_scale_moe causing "Unsupported hidden state scale shape" for EP32+ configs
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a critical int32 overflow bug in trtllm_fp4_block_scale_moe by promoting the multiplication to 64-bit. The proactive fix for weight_scale_vec_size is also a good addition for safety. I have one minor suggestion to improve consistency between the two related variable declarations.
|
/bot run |
|
[FAILED] Pipeline #46822943: 13/20 passed |
yzh119
left a comment
There was a problem hiding this comment.
LGTM, thanks for the bugfix.
📌 Description
Fix
int32overflow intrtllm_fp4_block_scale_moethat causes a misleadingNotImplementedError: Unsupported hidden state scale shapewhen deploying large Expert Parallel configurations (e.g., EP32 withDeepSeek-R1 NVFP4).Step 1, NVFP4 activation quantization (per EP rank)
Each of the 32 EP ranks quantizes its local activations via
vllm.ops.scaled_fp4_quantwithis_sf_swizzled_layout=False. From nvfp4_quant_entry.cu:output_sf = torch::empty( {m, n / CVT_FP4_SF_VEC_SIZE}, torch::TensorOptions().device(device).dtype(torch::kUInt8));For m=10240 (
max_num_batched_tokens), n=7168 (hidden_size):hidden_states:[10240, 3584]uint8(FP4 packed, 2 values per byte)hidden_states_scale:[10240, 448]uint8→ viewed asfloat8_e4m3fnNo padding is applied in the non-swizzled layout. Scale numel =
10240 × 448 = 4,587,520.Step 2, EP allgather via dispatch()
MoEPrepareAndFinalizeNaiveDPEPModular.prepare()in naive_dp_ep.py callsget_ep_group().dispatch(), which allgathers bothhidden_statesandhidden_states_scale(passed asextra_tensors) across all 32 EP ranks:hidden_states:32 × [10240, 3584]→ [327680, 3584]hidden_states_scale:32 × [10240, 448]→[327680, 448]Step 3, Scale reshape in vLLM wrapper
In trtllm_nvfp4_moe.py, the scale is reshaped before passing to flashInfer:
At this point
hidden_states_scale.numel()= 327680 × 448 = 146,800,640.Step 4, int32 overflow in FlashInfer C++ kernel
In
csrc/trtllm_fused_moe_kernel_launcher.cu, the scale vector size is computed as:the overflow:
327680 × 7168 = 2,348,810,240INT_MAX= 2,147,483,6472,348,810,240 >
INT_MAX, signed int32 overflow (undefined behavior in C++, wraps to -1,946,157,056 on two's complement architectures)vec_size = -1,946,157,056 / 146,800,640 = -13
-13 ≠ 16 and -13 ≠ 32 will throws "Unsupported hidden state scale shape"
Step 5, why not and works
Overflow threshold for DeepSeek-R1 (hidden_size=7168):
Max safe tokens: INT_MAX / 7168 = 299,593
EP32 per-rank limit: 299,593 / 32 ≈ 9,362
Any max_num_batched_tokens > 9362 with EP32 will trigger the overflow
We confirmed the overflow boundary on an 8-node GB200 cluster (32 GPUs, EP32, DP32) with --all2all-backend
allgather_reducescatter:Reproduction
vLLM serve with EP32:
Crashes during engine initialization with:
NotImplementedError: Unsupported hidden state scale shape.(Also found this issue in vllm-project/vllm#36022 (comment))Promote the multiplication operands to int64_t before division to prevent overflow:
hidden_states_scale_vec_size: Cast num_tokens to int64_t so the multiplication chain executes in 64-bit.weight_scale_vec_size: Apply the same pattern with local_num_experts cast to int64_t, and declare the variable as int64_t for consistency.Cast the multiplication operands to int64_t before the division:
The same pattern should also be applied to weight_scale_vec_size for safety:
Impact
Zero performance impact: these are CPU-side setup computations executed once before GPU kernel launch.
Zero API change: No function signatures are modified.
Unblocks: EP32+ deployments for large-hidden-size models (DeepSeek-R1, etc.) with max_num_batched_tokens above the int32 threshold.
Environment
Model: DeepSeek-R1-0528-FP4 (NVFP4, hidden_size=7168)
Hardware: 8× GB200 nodes (32 GPUs), disaggregated prefill-decode
Configuration: DP=32, EP=32, TP=1, PP=1
vLLM: v0.17.2rc1 (bundled FlashInfer)
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit