Skip to content

Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides"#34530

Merged
vllm-bot merged 1 commit intomainfrom
revert-34279-fix-fused-moe-int64-strides
Feb 13, 2026
Merged

Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides"#34530
vllm-bot merged 1 commit intomainfrom
revert-34279-fix-fused-moe-int64-strides

Conversation

@mgoin
Copy link
Copy Markdown
Member

@mgoin mgoin commented Feb 13, 2026

Reverts #34279 due to large performance degradations reported. We will search for a similar result with more careful performance analysis later

@mgoin mgoin requested a review from pavanimajety as a code owner February 13, 2026 18:33
@mergify mergify bot added the bug Something isn't working label Feb 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly reverts the explicit typing of stride parameters to tl.int64 in the fused_moe_kernel_gptq_awq and fused_moe_kernel Triton kernels. The motivation for this revert is to address significant performance degradations introduced by the original change. While this action knowingly reintroduces a bug concerning potential integer overflows with very large tensors, it is a pragmatic trade-off to restore performance. The intention to investigate a more performant solution for the overflow issue is acknowledged. The revert is implemented correctly.

@vllm-bot vllm-bot merged commit bfaa559 into main Feb 13, 2026
11 of 12 checks passed
@vllm-bot vllm-bot deleted the revert-34279-fix-fused-moe-int64-strides branch February 13, 2026 18:35
haosdent added a commit to haosdent/vllm that referenced this pull request Feb 14, 2026
…egression

PR vllm-project#34279 annotated all stride parameters as tl.int64 to fix an int32
overflow crash, but this caused ~60x perf regression on small GPUs (e.g.
NVIDIA GB10) due to register pressure. PR vllm-project#34530 reverted that fix.

This patch prevents the overflow with minimal register impact by casting
offs_token to int64 after loading instead of widening all strides. When
chunking is disabled and M is large, stride_cm * offs_token (where
stride_cm = N = w1.size(1) and offs_token up to M*topk) can exceed
int32 max. The cast leverages Triton type promotion (int32 * int64 ->
int64) following the existing pattern used for off_experts and offs_bn.

Adds a regression test that disables chunking with M=100000, n=2048,
topk=6 (product = 4096 * 600000 = 2.46B > int32 max) and validates
correctness against the torch_moe reference.

Fixes vllm-project#34413

Signed-off-by: haosdent <haosdent@gmail.com>
wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026
… strides" (vllm-project#34530)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
… strides" (vllm-project#34530)

Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
… strides" (vllm-project#34530)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants