[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling by AndreasKaratzas · Pull Request #31931 · vllm-project/vllm

AndreasKaratzas · 2026-01-07T23:33:48Z

Fixes MoE accuracy regression on ROCm introduced in #31676.

Problem

PR #31676 reordered post-accumulation operations to add bias after dequantization. However, this also moved MUL_ROUTED_WEIGHT after the .to(compute_type) precision conversion, causing router weight multiplication to occur in bf16/fp16 instead of float32.

This precision loss causes different outputs on ROCm due to differences in mixed-precision handling between ROCm and CUDA Triton backends. The issue manifests as non-deterministic expert routing, particularly visible in LoRA mixed-adapter tests.

Root Cause

Before #31676 (correct):

accumulator = accumulator * moe_weight  # float32
accumulator = accumulator.to(compute_type)  # float32 to bf16

After #31676 (broken on ROCm):

accumulator = accumulator.to(compute_type)  # float32 to bf16
accumulator = accumulator * moe_weight  # bf16 x float32

Fix

Restore MUL_ROUTED_WEIGHT before precision conversion while preserving the correct bias placement after dequantization:

# Router weights in float32 (before conversion)
if MUL_ROUTED_WEIGHT:
    accumulator = accumulator * moe_weight[:, None]

# Dequantization + precision conversion
accumulator = accumulator.to(compute_type)

# Bias after dequantization (unchanged from #31676)
if HAS_BIAS:
    accumulator = accumulator + bias

Testing

test_olmoe_lora_mixed passes on ROCm

Fixes CI regression from: #31676

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-01-07T23:34:22Z

cc @tjtanaa @xuebwang-amd

gemini-code-assist

Code Review

This pull request correctly addresses an accuracy regression on ROCm for MoE models by reordering operations in the fused_moe_kernel. The change ensures that router weight multiplication is performed on float32 accumulators before any precision conversion to compute_type (e.g., bf16/fp16). This prevents precision loss that was causing non-deterministic expert routing on ROCm. The logic is sound, well-explained in the PR description, and the change is correctly localized. The corresponding fused_moe_kernel_gptq_awq kernel already has the correct order of operations, so no changes are needed there. The updated comment for bias addition also improves clarity. The changes look good to me.

tjtanaa

LGTM

xuebwang-amd · 2026-01-08T04:04:32Z

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?

In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies.
While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.

Note: platform is ROCm MI355.

AndreasKaratzas · 2026-01-08T04:07:24Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of?

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?

In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies. While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.

Note: platform is ROCm MI355.

I just tested on MI325 using pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed which tests for accuracy, and the ordering you propose does not pass. So maybe it's a gpt-oss or an MI355 fluke. Is there any other model which we could test?

xuebwang-amd · 2026-01-08T04:15:43Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

AndreasKaratzas · 2026-01-08T04:19:30Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

xuebwang-amd · 2026-01-08T04:22:48Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of?

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?
In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies. While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.
Note: platform is ROCm MI355.

I just tested on MI325 using pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed which tests for accuracy, and the ordering you propose does not pass. So maybe it's a gpt-oss or an MI355 fluke. Is there any other model which we could test?

Oh, I 'll have a see if can reproduce on MI355. Maybe we need a more robust logic here (covering LoRA, gpt-oss, etc) if so.

xuebwang-amd · 2026-01-08T06:55:21Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed
order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

AndreasKaratzas · 2026-01-08T07:04:35Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed

order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

I'm getting the feeling that this is architectural. I bisected the bug before I commented on your PR and make this bug fix. On MI325, the results were the exact opposite (without the gpt test which I didn't have in mind). If you make another related PR, please CC me, so that I can test on mi325 as well :)

Also for the #29008 PR, PM me so that we coordinate an effort to run an eval with those changes on mi325 too. It's important that the CI is as green as possible.

xuebwang-amd · 2026-01-08T10:52:09Z

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed

order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

I'm getting the feeling that this is architectural. I bisected the bug before I commented on your PR and make this bug fix. On MI325, the results were the exact opposite (without the gpt test which I didn't have in mind). If you make another related PR, please CC me, so that I can test on mi325 as well :)

Also for the #29008 PR, PM me so that we coordinate an effort to run an eval with those changes on mi325 too. It's important that the CI is as green as possible.

Here is a new PR #31962 about the computation flow. I'd welcome your feedback and review. Thanks.

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fix(moe): preserve float32 precision for router weight scaling on ROCm

c3dbbf7

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested review from mgoin and pavanimajety as code owners January 7, 2026 23:33

mergify bot added the rocm Related to AMD ROCm label Jan 7, 2026

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

tjtanaa approved these changes Jan 8, 2026

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 8, 2026

tjtanaa enabled auto-merge (squash) January 8, 2026 02:23

AndreasKaratzas mentioned this pull request Jan 8, 2026

[CI Failure]: mi325_4: LoRA TP Test (Distributed) #31691

Closed

3 tasks

tjtanaa merged commit c4041f3 into vllm-project:main Jan 8, 2026
57 checks passed

AndreasKaratzas deleted the akaratza_ci_lora branch January 8, 2026 04:18

xuebwang-amd mentioned this pull request Jan 8, 2026

[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow #31962

Merged

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router…

d45efbd

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router…

c06a20a

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router…

eefedad

… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling#31931

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling#31931
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_ci_lora

AndreasKaratzas commented Jan 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

AndreasKaratzas commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

tjtanaa left a comment

Uh oh!

xuebwang-amd commented Jan 8, 2026 •

edited

Loading

Uh oh!

AndreasKaratzas commented Jan 8, 2026 •

edited

Loading

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

Uh oh!

AndreasKaratzas commented Jan 8, 2026 •

edited

Loading

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

AndreasKaratzas commented Jan 8, 2026

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

AndreasKaratzas commented Jan 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Testing

Uh oh!

AndreasKaratzas commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

xuebwang-amd commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreasKaratzas commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

Uh oh!

AndreasKaratzas commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

AndreasKaratzas commented Jan 8, 2026

Uh oh!

xuebwang-amd commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndreasKaratzas commented Jan 7, 2026 •

edited by github-actions bot

Loading

xuebwang-amd commented Jan 8, 2026 •

edited

Loading

AndreasKaratzas commented Jan 8, 2026 •

edited

Loading

AndreasKaratzas commented Jan 8, 2026 •

edited

Loading