[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434
Open
nholmber wants to merge 1 commit into
Open
[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434nholmber wants to merge 1 commit into
nholmber wants to merge 1 commit into
Conversation
Contributor
|
Hi @nholmber, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Contributor
Author
|
Tagging @tjtanaa as the reviewer of the earlier related PR for Qwen3-Next |
BowenBao
reviewed
Jun 3, 2026
|
|
||
| is_fse = ( | ||
| rocm_aiter_ops.is_fusion_moe_shared_experts_enabled() | ||
| and not isinstance(get_current_vllm_config().quant_config, QuarkConfig) |
Contributor
There was a problem hiding this comment.
Ideally I'd suggest having a helper function checking if shared experts and fused experts share the same quant spec, that whether they can be fused.
Contributor
Author
There was a problem hiding this comment.
Adjusted. Does it match what you had in mind ?
7689fc5 to
042ba2d
Compare
auto-merge was automatically disabled
June 4, 2026 18:13
Head branch was pushed to by a user without write access
62b3a8a to
7b3faff
Compare
When VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, the shared expert
weights need to be remapped from their checkpoint names
(shared_expert.gate_proj etc.) to the fused expert slot
(experts.{num_routed}.gate_proj) so they load into FusedMoE's fused
expert tensor at the correct index.
Without this fix, the shared expert weights silently fail to load,
producing garbage output.
Changes:
- Import rocm_aiter_ops for FSE flag check
- Increment num_experts by 1 when FSE enabled (shared expert slot)
- Remap shared_expert.* weight names to experts.{num_routed}.*
- Reset is_fused_expert for shared expert weights (they have separate
gate_proj/up_proj, not fused gate_up_proj like routed experts)
Validated on Qwen3.5-397B-A17B-FP8 TP2 MI355X:
- Accuracy: FSE=0 98%/98%, FSE=1 94%/94% (GSM8K 5-shot limit=100)
- Perf: +8-17% throughput, -7-15% TPOT across conc 4-64
- Traces confirm E=512 K=10 -> E=513 K=11 fusion active
Signed-off-by: Nico Holmberg <nico.holmberg@amd.com>
7b3faff to
1301d85
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
The existing FSE (Fused Shared Expert) support (#39280) works for Qwen3-Next but fails on Qwen3.5 models because
qwen3_5.py'sload_weightsdoes not remap shared expert checkpoint weights to the fused expert slot. This causes shared expert weights to silently fail to load, producing garbage output whenVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1.Root Cause
When FSE is enabled,
Qwen3NextSparseMoeBlock.__init__(which Qwen3.5 inherits) setsself.shared_expert = Noneand passesn_shared_experts=1toFusedMoE, which handles the shared expert internally as fused expert slot E+1.Qwen3-Next's
load_weightsremaps checkpoint names accordingly:But Qwen3.5 overrides
load_weightswith its own expert loading logic (supporting both fusedgate_up_projand separategate_proj/up_projcheckpoint formats) and was missing this remapping.Additionally, Qwen3.5's
is_fused_expertflag (set by routed expert weights in fused format) persists across weights, causing the remapped shared expert to enter the wrong loading path.Changes
qwen3_5.py:shared_expert.*→experts.{num_experts}.*)is_fused_expertandexpert_params_mappingfor shared expert weights (they have separategate_proj/up_proj, not fusedgate_up_proj)qwen3_next.py:Test Plan
Qwen/Qwen3.5-397B-A17B-FP8vllm/vllm-openai-rocm:nightly-626fa9bba5663a5cf6a870debf031ee344ddb822local-completions, full 1319 samples)vllm bench serve, random 1k/1k at conc 4/8/16/32/64Test Results
Accuracy (GSM8K 5-shot, full 1319 samples)
No accuracy degradation.
Throughput — TP2 (1k/1k random)
Throughput — TP4 (1k/1k random)