Skip to content

fused_moe_kernel - cast accumulator after applying router weights#32002

Merged
DarkLight1337 merged 1 commit intovllm-project:mainfrom
gnovack:fused-moe-cast-fix
Jan 10, 2026
Merged

fused_moe_kernel - cast accumulator after applying router weights#32002
DarkLight1337 merged 1 commit intovllm-project:mainfrom
gnovack:fused-moe-cast-fix

Conversation

@gnovack
Copy link
Contributor

@gnovack gnovack commented Jan 9, 2026

Purpose

The tests/lora/test_olmoe_tp.py test_olmoe_lora_mixed test case has been failing since #31676 was merged. Previously the application of moe_weight (router weights) to the accumulator was performed in float32, but after #31676, this computation is now done after accumulator has been cast to compute_type.

This change in behavior caused the above test case to begin failing, and introduced a slight accuracy degradation based on lm-eval results on gsm8k and mmlu_pro.

Results Before #31676:

gsm8k:  0.680
mmlu_pro: 0.2229

Results on main:

gsm8k:  0.676
mmlu_pro: 0.2218

Test Plan

  • Run MoE Tests
  • Run Failing LoRA test
  • Rerun lm-eval tests

Test Result

tests/lora/test_olmoe_tp.py test_olmoe_lora_mixed is passing now.

lm-eval results after the change in this PR:

gsm8k: 0.2229
mmlu_pro: 0.680

Note

Cursor Bugbot is generating a summary for commit e78572a. Configure here.


Note

Adjusts compute order in the Triton fused_moe_kernel to improve numerical behavior.

  • Applies b_scale/a_scale dequant multipliers without early casting; defers accumulator.to(compute_type) until after adding bias and multiplying router moe_weight
  • Simplifies dequant branches: for int8_w8a16, multiply by b_scale; for fp8_w8a8/int8_w8a8 tensor-wise or per-channel (non-block-wise), multiply by a_scale * b_scale
  • No API changes; affects only internal accumulation/casting order in vllm/model_executor/layers/fused_moe/fused_moe.py

Written by Cursor Bugbot for commit e78572a. This will update automatically on new commits. Configure here.

Signed-off-by: gnovack <gnovack@amazon.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the accuracy regression and test failures by ensuring the router weights are applied to the accumulator in float32 precision before casting to the compute_type. The change moves the type casting to after the router weight multiplication, which restores the intended numerical behavior. The code is also simplified by removing redundant type casts. The changes look good and are well-justified by the test results provided in the description.

Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @gnovack

@varun-sundar-rabindranath
Copy link
Contributor

cc @mgoin PTAL 🙌

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, thanks for the fix

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed ci-failure Issue about an unexpected test failure in CI labels Jan 10, 2026
@DarkLight1337 DarkLight1337 merged commit d1fd802 into vllm-project:main Jan 10, 2026
58 of 59 checks passed
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…lm-project#32002)

Signed-off-by: gnovack <gnovack@amazon.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure Issue about an unexpected test failure in CI ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants