[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5 by nholmber · Pull Request #44434 · vllm-project/vllm

nholmber · 2026-06-03T16:39:45Z

Purpose

The existing FSE (Fused Shared Expert) support (#39280) works for Qwen3-Next but fails on Qwen3.5 models because qwen3_5.py's load_weights does not remap shared expert checkpoint weights to the fused expert slot. This causes shared expert weights to silently fail to load, producing garbage output when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1.

Root Cause

When FSE is enabled, Qwen3NextSparseMoeBlock.__init__ (which Qwen3.5 inherits) sets self.shared_expert = None and passes n_shared_experts=1 to FusedMoE, which handles the shared expert internally as fused expert slot E+1.

Qwen3-Next's load_weights remaps checkpoint names accordingly:

# qwen3_next.py (already works)
if is_fse and "mlp.shared_expert." in name:
    name = name.replace("mlp.shared_expert.", f"mlp.experts.{num_routed}.")

But Qwen3.5 overrides load_weights with its own expert loading logic (supporting both fused gate_up_proj and separate gate_proj/up_proj checkpoint formats) and was missing this remapping.

Additionally, Qwen3.5's is_fused_expert flag (set by routed expert weights in fused format) persists across weights, causing the remapped shared expert to enter the wrong loading path.

Changes

qwen3_5.py:

Add FSE weight name remapping (shared_expert.* → experts.{num_experts}.*)
Reset is_fused_expert and expert_params_mapping for shared expert weights (they have separate gate_proj/up_proj, not fused gate_up_proj)

qwen3_next.py:

Guard FSE against Quark/MXFP4 quantization with explicit warning when falling back

Test Plan

Model: Qwen/Qwen3.5-397B-A17B-FP8
Image: vllm/vllm-openai-rocm:nightly-626fa9bba5663a5cf6a870debf031ee344ddb822
Hardware: MI355X, ROCm 7.2.2
Accuracy: lm_eval GSM8K 5-shot (local-completions, full 1319 samples)
Throughput: vllm bench serve, random 1k/1k at conc 4/8/16/32/64

# Server
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --attention-backend ROCM_AITER_UNIFIED_ATTN \
  --async-scheduling \
  --language-model-only \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --reasoning-parser qwen3

# lm_eval
lm_eval --model local-completions \
  --model_args 'model=Qwen/Qwen3.5-397B-A17B-FP8,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=50,tokenized_requests=False,max_gen_toks=512' \
  --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 44

Test Results

Accuracy (GSM8K 5-shot, full 1319 samples)

Config	Flexible	Strict
TP2 FSE=0	95.45% ± 0.57	95.15% ± 0.59
TP2 FSE=1	95.75% ± 0.56	95.60% ± 0.56
TP4 FSE=1	95.15% ± 0.59	95.07% ± 0.60

No accuracy degradation.

Throughput — TP2 (1k/1k random)

Conc	FSE=0 tok/s	FSE=1 tok/s	Δ Throughput	Δ TPOT
4	227	265	+16.4%	-14.5%
8	399	468	+17.3%	-14.7%
16	684	783	+14.5%	-13.0%
32	1117	1244	+11.4%	-10.6%
64	1137	1226	+7.8%	-7.4%

Throughput — TP4 (1k/1k random)

Conc	FSE=0 tok/s	FSE=1 tok/s	Δ Throughput	Δ TPOT
4	275	335	+21.6%	-18.0%
8	497	626	+25.9%	-20.6%
16	793	901	+13.6%	-11.8%
32	1380	1586	+14.9%	-12.8%
64	1396	1579	+13.1%	-11.7%

mergify · 2026-06-03T16:40:52Z

Hi @nholmber, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

nholmber · 2026-06-03T16:42:59Z

Tagging @tjtanaa as the reviewer of the earlier related PR for Qwen3-Next

BowenBao · 2026-06-03T23:31:45Z

+
+        is_fse = (
+            rocm_aiter_ops.is_fusion_moe_shared_experts_enabled()
+            and not isinstance(get_current_vllm_config().quant_config, QuarkConfig)


Ideally I'd suggest having a helper function checking if shared experts and fused experts share the same quant spec, that whether they can be fused.

Adjusted. Does it match what you had in mind ?

tjtanaa

LGTM

When VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, the shared expert weights need to be remapped from their checkpoint names (shared_expert.gate_proj etc.) to the fused expert slot (experts.{num_routed}.gate_proj) so they load into FusedMoE's fused expert tensor at the correct index. Without this fix, the shared expert weights silently fail to load, producing garbage output. Changes: - Import rocm_aiter_ops for FSE flag check - Increment num_experts by 1 when FSE enabled (shared expert slot) - Remap shared_expert.* weight names to experts.{num_routed}.* - Reset is_fused_expert for shared expert weights (they have separate gate_proj/up_proj, not fused gate_up_proj like routed experts) Validated on Qwen3.5-397B-A17B-FP8 TP2 MI355X: - Accuracy: FSE=0 98%/98%, FSE=1 94%/94% (GSM8K 5-shot limit=100) - Perf: +8-17% throughput, -7-15% TPOT across conc 4-64 - Traces confirm E=512 K=10 -> E=513 K=11 fusion active Signed-off-by: Nico Holmberg <nico.holmberg@amd.com>

nholmber requested review from sighingnow and vadiklyutiy as code owners June 3, 2026 16:39

mergify Bot added qwen Related to Qwen models rocm Related to AMD ROCm labels Jun 3, 2026

github-project-automation Bot added this to AMD Jun 3, 2026

github-project-automation Bot moved this to Todo in AMD Jun 3, 2026

nholmber changed the title ~~fix(rocm): enable shared expert fusion for Qwen3.5~~ [ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5 Jun 3, 2026

mergify Bot added the bug Something isn't working label Jun 3, 2026

BowenBao reviewed Jun 3, 2026

View reviewed changes

nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 7689fc5 to 042ba2d Compare June 4, 2026 05:20

tjtanaa approved these changes Jun 4, 2026

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 4, 2026

tjtanaa enabled auto-merge (squash) June 4, 2026 14:48

auto-merge was automatically disabled June 4, 2026 18:13
Head branch was pushed to by a user without write access

nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 62b3a8a to 7b3faff Compare June 4, 2026 18:13

nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 7b3faff to 1301d85 Compare June 5, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434

[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434
nholmber wants to merge 1 commit into
vllm-project:mainfrom
nholmber:fix/qwen35-fse-weight-loading

nholmber commented Jun 3, 2026

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

nholmber commented Jun 3, 2026

Uh oh!

BowenBao Jun 3, 2026

Uh oh!

nholmber Jun 4, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nholmber commented Jun 3, 2026

Purpose

Root Cause

Changes

Test Plan

Test Results

Accuracy (GSM8K 5-shot, full 1319 samples)

Throughput — TP2 (1k/1k random)

Throughput — TP4 (1k/1k random)

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

nholmber commented Jun 3, 2026

Uh oh!

BowenBao Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

nholmber Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants