[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS` enabled by ganyi1996ppo · Pull Request #31523 · vllm-project/vllm

ganyi1996ppo · 2025-12-30T06:47:59Z

Purpose

This PR fix the accuracy issue introduced from #31221, which caused all the model with MOE structure have serious accuracy regression when shared experts fusion enabled. This PR add +grouped_topk into custom_ops to make the dispatch actually happens.

Test Plan

Test deepseek-r1 on gsm8k

Test Result

current vllm main branch:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0099|±  |0.0027|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

With this fix

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9477|±  |0.0061|
|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <ygan@amd.com>

gemini-code-assist

Code Review

This pull request addresses a critical accuracy issue on ROCm for MoE models when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS is enabled. The fix involves enabling the grouped_topk custom op to ensure correct kernel dispatch. While the approach is correct, the implementation could be more robust. I've suggested an improvement to handle cases where a user might have explicitly disabled this custom op, which would otherwise cause the fix to fail silently.

vllm/platforms/rocm.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/platforms/rocm.py

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa · 2025-12-30T07:03:58Z

@ganyi1996ppo can you check if when aiter fused moe is enabled, are we using rocm_aiter_grouped_topk by default? Because when VLLM_ROCM_USE_AITER=1, we always consider using the forward_hip from AITER. The custom op PR might have made it so that when +grouped_topk is not added to the custom op list, so even if we add VLLM_ROCM_USE_AITER=1, it most likely will still not use the forward_hip. Can you validate this and try to fix it in this PR as well?

This is the reason why I am very reluctant to change all to custom ops. The default behaviour now is controlled through custom_op list. Since the default behaviour now is controlled through custom_op list. We will need to add explicit custom op unit tests as well to catch this. Regular unit tests still run correctly with wrong code path (forward_native).

@ProExpertProg @MengqingCao @xinyu-intel Can we try to ask developers who are migrating the ops to custom ops in their PR to write unit tests for the custom op classes for all platforms because regular unit tests still run correctly with wrong code path (forward_native), because regular unit tests only tests for correctness.

ganyi1996ppo · 2025-12-30T07:09:26Z

@ganyi1996ppo can you check if when aiter fused moe is enabled, are we using rocm_aiter_grouped_topk by default? Because when VLLM_ROCM_USE_AITER=1, we always consider using the forward_hip from AITER. The custom op PR might have made it so that when +grouped_topk is not added to the custom op list, so even if we add VLLM_ROCM_USE_AITER=1, it most likely will still not use the forward_hip. Can you validate this and try to fix it in this PR as well?

This is the reason why I am very reluctant to change all to custom ops. The default behaviour now is controlled through custom_op list.

@tjtanaa We might still get into native when fused shared expert is not enabled, but what you said is exactly what I thought. We should add +grouped_topk as long as the aiter is enabled.

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo · 2025-12-30T07:19:14Z

@tjtanaa take another look please, I just make +grouped_topk a default configuration.

tjtanaa · 2025-12-30T07:27:52Z

vllm/platforms/rocm.py

        compilation_config = vllm_config.compilation_config
        parallel_config = vllm_config.parallel_config
        is_eager_execution = compilation_config == CUDAGraphMode.NONE
+        use_aiter = rocm_aiter_ops.is_enabled()


Let's call this directly at rocm_aiter_ops.is_fused_moe_enabled() like describe in the doc string of _aiter_ops (https://github.com/vllm-project/vllm/blob/51085c2aebe7df2c53dbe9a44d89bc3ee761793f/vllm/_aiter_ops.py#L772C8-L772C42)
topk routing functions are part of MoE related ops.

tjtanaa · 2025-12-30T07:30:16Z

vllm/platforms/rocm.py

        use_aiter_rms_norm = rocm_aiter_ops.is_rmsnorm_enabled()
        use_aiter_fp8_linear = rocm_aiter_ops.is_linear_fp8_enabled()
+        use_aiter_shared_expert = (
+            rocm_aiter_ops.is_fused_moe_enabled()


we don't need the and because rocm_aiter_ops.is_fusion_moe_shared_experts_enabled() already check for both as shown in line

vllm/vllm/_aiter_ops.py

Line 887 in 51085c2

return cls.is_fused_moe_enabled() and cls._MOE_SHARED_EXPERTS_ENABLED

Sure, done.

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa

LGTM.

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com>

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com>

fix fmoe accuracy issue on rocm

104049b

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo requested a review from tjtanaa as a code owner December 30, 2025 06:48

mergify bot added the rocm Related to AMD ROCm label Dec 30, 2025

gemini-code-assist bot reviewed Dec 30, 2025

View reviewed changes

vllm/platforms/rocm.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 30, 2025

View reviewed changes

vllm/platforms/rocm.py Outdated Show resolved Hide resolved

ganyi1996ppo added 2 commits December 30, 2025 06:51

resolve comments

a5974b9

Signed-off-by: ganyi <ygan@amd.com>

formating

a695903

Signed-off-by: ganyi <ygan@amd.com>

make +grouped_topk always open when aiter enabled

87ffdff

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa reviewed Dec 30, 2025

View reviewed changes

resolve comments

fb0da8a

Signed-off-by: ganyi <ygan@amd.com>

tjtanaa approved these changes Dec 30, 2025

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 30, 2025

tjtanaa enabled auto-merge (squash) December 30, 2025 08:28

tjtanaa merged commit 1a834df into vllm-project:main Dec 30, 2025
46 of 47 checks passed

xinyu-intel mentioned this pull request Dec 30, 2025

CustomOp: test forward dispatch for grouped_topk #31530

Merged

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_F…

d263fcd

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_F…

29d20a3

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_F…

a8e34a1

…USION_SHARED_EXPERTS` enabled (vllm-project#31523) Signed-off-by: ganyi <ygan@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS` enabled#31523

[ROCm][Bugfix] Fix accuracy issue on fmoe when `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS` enabled#31523
tjtanaa merged 5 commits intovllm-project:mainfrom
ROCm:ganyi/fix_fmoe_acc

ganyi1996ppo commented Dec 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

tjtanaa commented Dec 30, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo commented Dec 30, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo commented Dec 30, 2025

Uh oh!

tjtanaa Dec 30, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo Dec 30, 2025

Uh oh!

tjtanaa Dec 30, 2025

Uh oh!

ganyi1996ppo Dec 30, 2025

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ganyi1996ppo commented Dec 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tjtanaa commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Dec 30, 2025

Uh oh!

tjtanaa Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ganyi1996ppo commented Dec 30, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Dec 30, 2025 •

edited

Loading

ganyi1996ppo commented Dec 30, 2025 •

edited

Loading

tjtanaa Dec 30, 2025 •

edited

Loading