Skip to content

[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434

Open
nholmber wants to merge 1 commit into
vllm-project:mainfrom
nholmber:fix/qwen35-fse-weight-loading
Open

[ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5#44434
nholmber wants to merge 1 commit into
vllm-project:mainfrom
nholmber:fix/qwen35-fse-weight-loading

Conversation

@nholmber
Copy link
Copy Markdown
Contributor

@nholmber nholmber commented Jun 3, 2026

Purpose

The existing FSE (Fused Shared Expert) support (#39280) works for Qwen3-Next but fails on Qwen3.5 models because qwen3_5.py's load_weights does not remap shared expert checkpoint weights to the fused expert slot. This causes shared expert weights to silently fail to load, producing garbage output when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1.

Root Cause

When FSE is enabled, Qwen3NextSparseMoeBlock.__init__ (which Qwen3.5 inherits) sets self.shared_expert = None and passes n_shared_experts=1 to FusedMoE, which handles the shared expert internally as fused expert slot E+1.

Qwen3-Next's load_weights remaps checkpoint names accordingly:

# qwen3_next.py (already works)
if is_fse and "mlp.shared_expert." in name:
    name = name.replace("mlp.shared_expert.", f"mlp.experts.{num_routed}.")

But Qwen3.5 overrides load_weights with its own expert loading logic (supporting both fused gate_up_proj and separate gate_proj/up_proj checkpoint formats) and was missing this remapping.

Additionally, Qwen3.5's is_fused_expert flag (set by routed expert weights in fused format) persists across weights, causing the remapped shared expert to enter the wrong loading path.

Changes

qwen3_5.py:

  • Add FSE weight name remapping (shared_expert.*experts.{num_experts}.*)
  • Reset is_fused_expert and expert_params_mapping for shared expert weights (they have separate gate_proj/up_proj, not fused gate_up_proj)

qwen3_next.py:

  • Guard FSE against Quark/MXFP4 quantization with explicit warning when falling back

Test Plan

  • Model: Qwen/Qwen3.5-397B-A17B-FP8
  • Image: vllm/vllm-openai-rocm:nightly-626fa9bba5663a5cf6a870debf031ee344ddb822
  • Hardware: MI355X, ROCm 7.2.2
  • Accuracy: lm_eval GSM8K 5-shot (local-completions, full 1319 samples)
  • Throughput: vllm bench serve, random 1k/1k at conc 4/8/16/32/64
# Server
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --attention-backend ROCM_AITER_UNIFIED_ATTN \
  --async-scheduling \
  --language-model-only \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --reasoning-parser qwen3

# lm_eval
lm_eval --model local-completions \
  --model_args 'model=Qwen/Qwen3.5-397B-A17B-FP8,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=50,tokenized_requests=False,max_gen_toks=512' \
  --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 44

Test Results

Accuracy (GSM8K 5-shot, full 1319 samples)

Config Flexible Strict
TP2 FSE=0 95.45% ± 0.57 95.15% ± 0.59
TP2 FSE=1 95.75% ± 0.56 95.60% ± 0.56
TP4 FSE=1 95.15% ± 0.59 95.07% ± 0.60

No accuracy degradation.

Throughput — TP2 (1k/1k random)

Conc FSE=0 tok/s FSE=1 tok/s Δ Throughput Δ TPOT
4 227 265 +16.4% -14.5%
8 399 468 +17.3% -14.7%
16 684 783 +14.5% -13.0%
32 1117 1244 +11.4% -10.6%
64 1137 1226 +7.8% -7.4%

Throughput — TP4 (1k/1k random)

Conc FSE=0 tok/s FSE=1 tok/s Δ Throughput Δ TPOT
4 275 335 +21.6% -18.0%
8 497 626 +25.9% -20.6%
16 793 901 +13.6% -11.8%
32 1380 1586 +14.9% -12.8%
64 1396 1579 +13.1% -11.7%

@mergify mergify Bot added qwen Related to Qwen models rocm Related to AMD ROCm labels Jun 3, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 3, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 3, 2026

Hi @nholmber, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@nholmber
Copy link
Copy Markdown
Contributor Author

nholmber commented Jun 3, 2026

Tagging @tjtanaa as the reviewer of the earlier related PR for Qwen3-Next

@nholmber nholmber changed the title fix(rocm): enable shared expert fusion for Qwen3.5 [ROCm][Bugfix][Perf] enable shared expert fusion for Qwen3.5 Jun 3, 2026
@mergify mergify Bot added the bug Something isn't working label Jun 3, 2026
Comment thread vllm/model_executor/models/qwen3_5.py Outdated

is_fse = (
rocm_aiter_ops.is_fusion_moe_shared_experts_enabled()
and not isinstance(get_current_vllm_config().quant_config, QuarkConfig)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd suggest having a helper function checking if shared experts and fused experts share the same quant spec, that whether they can be fused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted. Does it match what you had in mind ?

@nholmber nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 7689fc5 to 042ba2d Compare June 4, 2026 05:20
Copy link
Copy Markdown
Member

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 4, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) June 4, 2026 14:48
auto-merge was automatically disabled June 4, 2026 18:13

Head branch was pushed to by a user without write access

@nholmber nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 62b3a8a to 7b3faff Compare June 4, 2026 18:13
When VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, the shared expert
weights need to be remapped from their checkpoint names
(shared_expert.gate_proj etc.) to the fused expert slot
(experts.{num_routed}.gate_proj) so they load into FusedMoE's fused
expert tensor at the correct index.

Without this fix, the shared expert weights silently fail to load,
producing garbage output.

Changes:
- Import rocm_aiter_ops for FSE flag check
- Increment num_experts by 1 when FSE enabled (shared expert slot)
- Remap shared_expert.* weight names to experts.{num_routed}.*
- Reset is_fused_expert for shared expert weights (they have separate
  gate_proj/up_proj, not fused gate_up_proj like routed experts)

Validated on Qwen3.5-397B-A17B-FP8 TP2 MI355X:
- Accuracy: FSE=0 98%/98%, FSE=1 94%/94% (GSM8K 5-shot limit=100)
- Perf: +8-17% throughput, -7-15% TPOT across conc 4-64
- Traces confirm E=512 K=10 -> E=513 K=11 fusion active

Signed-off-by: Nico Holmberg <nico.holmberg@amd.com>
@nholmber nholmber force-pushed the fix/qwen35-fse-weight-loading branch from 7b3faff to 1301d85 Compare June 5, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants