[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 by zhentaocc · Pull Request #20736 · sgl-project/sglang

zhentaocc · 2026-03-17T03:02:11Z

Motivation

Qwen2 MoE and Qwen3.5 MoE models use a shared expert in addition to routed experts. When shared_expert_intermediate_size == moe_intermediate_size, the shared expert can be fused with routed experts so that each token attends to top-k routed experts plus one shared expert (topk+1) in a single MoE dispatch, reducing kernel launches and improving inference efficiency. This PR adds shared expert fusion support for Qwen2 MoE (when using Aiter on ROCm/HIP) and improves Qwen3.5 MoE weight loading to correctly handle the fused shared expert layout.

Modifications

`python/sglang/srt/models/qwen2_moe.py`

_determine_num_fused_shared_experts(): New helper that returns 1 when shared expert fusion is enabled (requires shared_expert_intermediate_size == moe_intermediate_size, not disabled via --disable-shared-experts-fusion, and SGLANG_USE_AITER=1 on HIP).
_get_shared_expert_weights(): Returns sigmoid(shared_expert_gate(hidden_states)) for the fused shared expert weights.
_append_shared_to_topk_output(): Appends shared expert IDs and weights to the top-k output before the fused MoE forward.
_forward_router_experts(): After top-k selection on gate logits, appends shared expert via _append_shared_to_topk_output() when fusion is enabled.
Experts and TopK: top_k and num_experts now include num_fused_shared_experts when fusion is active.

`python/sglang/srt/models/qwen3_5.py`

_get_num_fused_shared_experts(): New helper used by Qwen3_5MoeForConditionalGeneration to obtain num_fused_shared_experts from the first layer’s MLP.
load_weights (Qwen3_5MoeForConditionalGeneration):
- Same num_experts adjustment for expert params mapping.
- Remaps mlp.shared_expert.* to mlp.experts.{num_experts_base}.* when fusion is enabled.
- Adds fused_expert_params_mapping entries for shared expert (gate_proj, up_proj, down_proj, and combined gate_up_proj).
- Handles both separate (gate_proj/up_proj) and combined (gate_up_proj) checkpoint layouts for the shared expert.

Accuracy Tests

Model: Qwen/Qwen3.5-397B-A17B
enable fusion:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9682	±	0.0048
		strict-match	5	exact_match	↑	0.9697	±	0.0047

disable fusion:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9697	±	0.0047
		strict-match	5	exact_match	↑	0.9712	±	0.0046

Benchmarking

Before

 sglang serve     --attention-backend triton     --model-path $MODEL     --host=0.0.0.0     --port $PORT     --tensor-parallel-size $TP     --trust-remote-code     --mem-fraction-static 0.8

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  192.69    
Total input tokens:                      147038    
Total input text tokens:                 147038    
Total generated tokens:                  147952    
Total generated tokens (retokenized):    147623    
Request throughput (req/s):              0.83      
Input token throughput (tok/s):          763.08    
Output token throughput (tok/s):         767.82    
Peak output token throughput (tok/s):    864.00    
Peak concurrent requests:                21        
Total token throughput (tok/s):          1530.90   
Concurrency:                             15.39     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18535.68  
Median E2E Latency (ms):                 18662.69  
P90 E2E Latency (ms):                    20331.16  
P99 E2E Latency (ms):                    20857.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          128.35    
Median TTFT (ms):                        97.52     
P99 TTFT (ms):                           409.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.92     
Median TPOT (ms):                        20.06     
P99 TPOT (ms):                           20.38     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.93     
Median ITL (ms):                         18.77     
P95 ITL (ms):                            19.03     
P99 ITL (ms):                            99.89     
Max ITL (ms):                            405.48    
==================================================

After


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  184.30    
Total input tokens:                      147038    
Total input text tokens:                 147038    
Total generated tokens:                  147952    
Total generated tokens (retokenized):    147620    
Request throughput (req/s):              0.87      
Input token throughput (tok/s):          797.83    
Output token throughput (tok/s):         802.79    
Peak output token throughput (tok/s):    896.00    
Peak concurrent requests:                21        
Total token throughput (tok/s):          1600.62   
Concurrency:                             15.39     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17732.18  
Median E2E Latency (ms):                 17863.97  
P90 E2E Latency (ms):                    19450.08  
P99 E2E Latency (ms):                    19969.85  
---------------Time to First Token----------------
Mean TTFT (ms):                          124.12    
Median TTFT (ms):                        93.28     
P99 TTFT (ms):                           406.08    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.06     
Median TPOT (ms):                        19.18     
P99 TPOT (ms):                           19.49     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.06     
Median ITL (ms):                         17.94     
P95 ITL (ms):                            18.25     
P99 ITL (ms):                            95.79     
Max ITL (ms):                            414.58    
==================================================

Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-17T03:02:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yichiche · 2026-03-27T04:54:41Z

@zhentaocc Please fix lint issue, I will kick off CI again once it's done

zhentaocc · 2026-03-27T05:51:51Z

@zhentaocc Please fix lint issue, I will kick off CI again once it's done

Done.

zhentaocc · 2026-03-27T07:52:35Z

+            or not _use_aiter
+            or quant_config is not None


This PR only works on BF16 for now for easier review. And I also need some time to resolve accuracy issue on FP8 weight scale loading. I can raise new PR to address this for FP8 and remove line 268. @HaiShaw @yichiche

additional works we might collaborate on:

FP8/MXFP4 support

Qwen35 has separate shared_gate, which I also try to fuse it with gate_proj.

FP8 accuracy issue identified, will need aiter upgrade to fix split_k issue.

…ed_moe case

- Eliminated redundant weight mappings for `gate_proj` and `up_proj` in the fused expert parameters, streamlining the weight loading process. - This change enhances code clarity and reduces complexity while maintaining existing functionality.

- Consolidated the initialization of the `num_experts` variable to improve clarity and consistency in weight loading processes. - Updated references to `num_experts` throughout the code to ensure accurate mapping of shared experts when fused, enhancing the overall functionality of the model. - Added comments to clarify the logic for loading fused expert weights, improving code maintainability.

- Simplified the weight loading process by removing conditional checks for `num_experts` related to fused MoE, ensuring a more straightforward implementation. - Enhanced code clarity and maintainability by streamlining the parameters passed during weight loading.

- Introduced a new function `can_fuse_shared_expert` to determine if shared experts can be fused based on configuration and server arguments. - Updated the initialization of `enable_shared_expert_fusion` and `num_fused_shared_experts` to reflect the new fusion logic. - Refactored related code sections to ensure correct handling of shared experts during weight loading and processing, improving overall model functionality and maintainability.

- Enhanced comments to specify loading behavior for `down_proj`, `gate_proj`, and `up_proj` in the weight loading process. - Improved code documentation to aid understanding of expert weight handling in the model.

- Updated the logic for determining the number of shared experts based on configuration settings, allowing for more flexible expert handling. - Defaulted `enable_shared_expert_fusion` to False and adjusted its initialization to depend on the `_use_aiter` flag, improving clarity and maintainability of the code. - Enhanced comments to clarify the conditions under which shared expert fusion is enabled.

- Adjusted the initialization of `num_shared_experts` to ensure it defaults to 0 when no configuration is provided, enhancing clarity and robustness. - Improved the handling of shared expert configuration settings, allowing for more flexible expert management in the model.

- Cleaned up the initialization logic for `num_shared_experts` and `enable_shared_expert_fusion`, improving code clarity and maintainability. - Enhanced comments to clarify the conditions for shared expert configuration, ensuring better understanding of the model's behavior.

- Updated the initialization logic for `num_shared_experts` to use `hasattr` for better attribute checking, enhancing robustness and clarity. - Improved conditions for determining shared expert settings, ensuring more flexible configuration handling in the model.

- Updated the logic for calculating the total number of experts by directly calling `get_global_server_args().ep_num_redundant_experts`, improving code clarity and maintainability. - Enhanced the initialization of the `experts` attribute to streamline the configuration process for expert management in the model.

HaiShaw · 2026-04-09T09:04:04Z

/tag-and-rerun-ci

…ter experts for Qwen3.5 BF16 & FP8

HaiShaw · 2026-04-14T23:48:23Z

@amd-bot ci-status

amd-bot · 2026-04-14T23:56:28Z

@HaiShaw

CI Status for PR #20736

PR: [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8
Changed files: python/sglang/srt/models/qwen2_moe.py (+108/-5), python/sglang/srt/models/qwen3_5.py (+110/-3)

AMD: 9 failures (0 likely related) | Others: 15 failures (0 related)

The PR adds shared expert fusion for Qwen3.5 MoE models, gated behind _use_aiter (requires SGLANG_USE_AITER=true + ROCm). The support_shared_expert_fusion=True flag is only set in qwen3_5.py, not in qwen3.py or qwen2_moe.py callers. No CI test exercises Qwen3.5 in this run, so the new code path is never activated by any failing test.

AMD CI Failures

Job	Error	Related?	Explanation	Log
small-amd (4)	`Memory access fault by GPU node-2` in `store_kvcache` (LLaDA2)	🟢 Unlikely	GPU memory fault on LLaDA2 model, unrelated to MoE fusion	Log
small-amd (8)	`Memory access fault by GPU node-2` during CUDA graph capture (Qwen3-30B-A3B) → 30-min timeout	🟢 Unlikely	Qwen3-30B-A3B uses `qwen3.py` (not `qwen3_5.py`), `support_shared_expert_fusion` is not set; fault is in `store_kvcache` JIT kernel, same pattern as other GPU faults	Log
small-amd (9)	`Health check failed` + watchdog timeout (Qwen2.5-VL-3B) → 30-min timeout	🟢 Unlikely	VLM model (not MoE), server hung in `store_kvcache`	Log
small-amd (10)	`test_lora_load_from_tensor` — scheduler crashed with exit code -6	🟢 Unlikely	LoRA test on Llama-3.1-8B, unrelated to MoE	Log
small-amd (11)	`AssertionError: 87.015 not less than 86` — TTFT threshold	🟢 Unlikely	Flaky perf test (~1ms over threshold), unrelated to MoE	Log
small-amd (12)	`Memory access fault by GPU node-2` in `store_kvcache` (LLaDA2)	🟢 Unlikely	Same GPU memory fault pattern as partition 4	Log
small-amd (13)	`test_transformers_models` / `test_int4fp8_moe` → 30-min timeout	🟢 Unlikely	Tests on transformers models, server hung in `store_kvcache`	Log
nondeterministic	`test_reward_models` — scheduler hung in `store_kvcache` (Qwen3 classification)	🟢 Unlikely	Reward model test, same GPU hang pattern	Log
large-amd (1)	`Memory access fault by GPU node-2` during score API (Qwen3-30B-A3B) → timeout	🟢 Unlikely	Same `store_kvcache` fault pattern; uses `qwen3.py` not `qwen3_5.py`	Log
mi35x	`test_mxfp4_20b` — 1200s timeout	🟢 Unlikely	Hardware timeout on MI350X, unrelated to MoE fusion	Log

Other CI Failures

Job	Error	Related?	Explanation	Log
1-gpu-large (4)	`test_vlm_input_format.py` killed (SIGKILL)	🟢 Unlikely	VLM test OOM/infra issue on Nvidia runner	Log
1-gpu-large (6-13), 1-gpu-small (6,7)	`Fast-fail: skipping -- root cause: stage-b-test-1-gpu-large (4)`	🟢 Unlikely	Cascade from partition 4 failure — not real failures	Log
build-test (all)	`rmsnorm_cpu` error: `input must be a 2D tensor` (Intel AMX)	🟢 Unlikely	Intel AMX backend kernel error, unrelated	Log
build-and-test	`UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY` (Intel XPU)	🟢 Unlikely	XPU OOM during model loading, unrelated	Log
multimodal-gen-test-1-npu-a3	Diffusion latency above threshold (flux + wan2.1)	🟢 Unlikely	NPU perf test, unrelated to MoE	Log
multimodal-gen-test-8-npu-a3	Diffusion latency 3-4x over threshold (wan2.2 14B 8-NPU)	🟢 Unlikely	NPU perf test, unrelated to MoE	Log

Details

No failures are related to this PR. The PR's new code path (enable_shared_expert_fusion) is only activated when:

Running on ROCm with SGLANG_USE_AITER=true
Using a Qwen3.5 model (the only model that passes support_shared_expert_fusion=True)
disable_shared_experts_fusion is not set

No CI test in this run exercises a Qwen3.5 model. The Qwen3-30B-A3B tests (partitions 8, large-1) use qwen3.py, which does not set support_shared_expert_fusion=True.

AMD failures: 6 of 10 share the same pattern — Memory access fault by GPU node-2 in the store_kvcache JIT kernel across multiple unrelated models (LLaDA2, Qwen3-30B, Qwen3-classification, Qwen2.5-VL, Llava). This is a pre-existing ROCm infrastructure issue on the MI325 runners, not a code regression. The remaining AMD failures are a flaky perf threshold (partition 11), a LoRA test crash (partition 10), and an MI350X timeout.

Nvidia failures: All 10 failed jobs cascade from a single OOM/SIGKILL in partition 4 (test_vlm_input_format.py). The fast-fail mechanism skipped the rest.

Other failures: Intel AMX kernel bug, Intel XPU OOM, and NPU perf threshold violations — all unrelated to MoE code.

Generated by amd-bot using Claude Code CLI

… & FP8 (sgl-project#20736) Co-authored-by: Chen, Todd <zhenchen@amd.com> Co-authored-by: jacky.cheng <yichiche@amd.com>

zhentaocc requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners March 17, 2026 03:02

zhentaocc force-pushed the fuse_share_expert branch 4 times, most recently from 25fd0a0 to bdf97b2 Compare March 23, 2026 13:14

hubertlu-tw added amd run-ci labels Mar 23, 2026

hubertlu-tw reviewed Mar 24, 2026

View reviewed changes

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated

zhentaocc force-pushed the fuse_share_expert branch 2 times, most recently from d144606 to 0daf8a1 Compare March 24, 2026 06:51

zhentaocc changed the title ~~[AMD] Enable share expert fusion with router experts for Qwen3.5~~ [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 Mar 24, 2026

hubertlu-tw requested a review from yichiche March 24, 2026 21:29

zhentaocc force-pushed the fuse_share_expert branch from 0daf8a1 to 5275269 Compare March 25, 2026 17:45

yichiche reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated

Comment thread python/sglang/srt/models/qwen3_5.py

Comment thread python/sglang/srt/models/qwen3_5.py Outdated

zhentaocc force-pushed the fuse_share_expert branch 3 times, most recently from 3a072ef to 3d25f7e Compare March 27, 2026 03:05

yichiche self-assigned this Mar 27, 2026

zhentaocc force-pushed the fuse_share_expert branch from 3d25f7e to c7d3887 Compare March 27, 2026 05:51

zhentaocc commented Mar 27, 2026

View reviewed changes

yichiche and others added 16 commits April 7, 2026 11:45

[Fix] Fix typo issue

84fb0c3

Add back condition to get rid of re-compute self.shared_expert in fus…

2c8ae8b

…ed_moe case

Fix lint test

35a3a29

Fix lint test

37666f2

Update comments in Qwen3.5 MoE model weight loading for clarity

d3d8eba

- Enhanced comments to specify loading behavior for `down_proj`, `gate_proj`, and `up_proj` in the weight loading process. - Improved code documentation to aid understanding of expert weight handling in the model.

Refactor

8176955

Accuracy fix for fp8 after refactor

c2dafde

zhentaocc force-pushed the fuse_share_expert branch from d9e2f77 to c2dafde Compare April 7, 2026 03:45

yichiche changed the title ~~[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16~~ [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 Apr 7, 2026

Merge branch 'main' into fuse_share_expert

eeba8e2

Merge branch 'main' into fuse_share_expert

0484a9f

hubertlu-tw added a commit to hubertlu-tw/sglang that referenced this pull request Apr 14, 2026

Merge PR sgl-project#20736: [AMD] Enable share expert fusion with rou…

9483102

…ter experts for Qwen3.5 BF16 & FP8

HaiShaw approved these changes Apr 15, 2026

View reviewed changes

HaiShaw merged commit ea05ea5 into sgl-project:main Apr 15, 2026
95 of 141 checks passed

zhentaocc mentioned this pull request Apr 16, 2026

[AMD][MI35X]Update qwen3.5 perf SemiAnalysisAI/InferenceX#1036

Merged

mqhc2020 mentioned this pull request Apr 16, 2026

[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled #22948

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736

[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736
HaiShaw merged 42 commits intosgl-project:mainfrom
zhentaocc:fuse_share_expert

zhentaocc commented Mar 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yichiche commented Mar 27, 2026

Uh oh!

zhentaocc commented Mar 27, 2026

Uh oh!

zhentaocc Mar 27, 2026

Uh oh!

zhentaocc Mar 27, 2026 •

edited by yichiche

Loading

Uh oh!

yichiche Mar 31, 2026 •

edited

Loading

Uh oh!

HaiShaw commented Apr 9, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

amd-bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhentaocc commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

python/sglang/srt/models/qwen2_moe.py

python/sglang/srt/models/qwen3_5.py

Accuracy Tests

Benchmarking

Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yichiche commented Mar 27, 2026

Uh oh!

zhentaocc commented Mar 27, 2026

Uh oh!

zhentaocc Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

zhentaocc Mar 27, 2026 • edited by yichiche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yichiche Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Apr 9, 2026

Uh oh!

HaiShaw commented Apr 14, 2026

Uh oh!

amd-bot commented Apr 14, 2026

CI Status for PR #20736

AMD CI Failures

Other CI Failures

Details

Other failures: Intel AMX kernel bug, Intel XPU OOM, and NPU perf threshold violations — all unrelated to MoE code.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhentaocc commented Mar 17, 2026 •

edited

Loading

`python/sglang/srt/models/qwen2_moe.py`

`python/sglang/srt/models/qwen3_5.py`

zhentaocc Mar 27, 2026 •

edited by yichiche

Loading

yichiche Mar 31, 2026 •

edited

Loading