[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access#21465
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a critical "illegal memory access" bug in the NVFP4 MoE kernel. The root cause appears to be an optimization (SWAP_AB) that swaps matrix dimensions for performance, which is not supported on the FP4 path.
The fix is implemented in csrc/quantization/cutlass_w8a8/moe/moe_data.cu and involves:
- Introducing a
may_swap_abflag that explicitly disables theSWAP_ABoptimization when using the FP4 path (identified by the presence ofblockscale_offsets). - Refactoring the CUDA kernels (
compute_expert_offsets,compute_expert_blockscale_offsets) to accept this explicit boolean flag instead of inferring the logic fromtopk_length.
The changes are logical, well-contained, and directly address the reported crash. The refactoring also improves code clarity by making the dependency on the SWAP_AB optimization explicit. The fix appears correct and robust.
| bool may_swap_ab = (!blockscale_offsets.has_value()) && | ||
| (topk_ids.numel() <= SWAP_AB_THRESHOLD); |
There was a problem hiding this comment.
Would it make sense to rather add a boolean argument to get_cutlass_moe_mm_data() that forces no swap? Looks like disabling swap will be also needed for fp8 blockwise CUTLASS and it doesn't pass blockscale_offsets to this function
There was a problem hiding this comment.
run_cutlass_moe_fp8 run_cutlass_block_scaled_fused_experts which path are you taking about?
I don't have enough context so I am thinking we can do that in following up pr
There was a problem hiding this comment.
I mean a get_cutlass_moe_mm_data() call in run_cutlass_block_scaled_fused_experts() :) But I can add that change to a separate PR
|
Can you check if this fails with modelopt fp4 as well since it should use the same kernel? |
|
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: x22x22 <wadeking@qq.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…ry access (vllm-project#21465) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Fixes #21399
Test
lm_eval --model vllm --model_args "pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size autoOrigin:
Now: