Skip to content

[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736

Merged
HaiShaw merged 42 commits intosgl-project:mainfrom
zhentaocc:fuse_share_expert
Apr 15, 2026
Merged

[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736
HaiShaw merged 42 commits intosgl-project:mainfrom
zhentaocc:fuse_share_expert

Conversation

@zhentaocc
Copy link
Copy Markdown
Contributor

@zhentaocc zhentaocc commented Mar 17, 2026

Motivation

Qwen2 MoE and Qwen3.5 MoE models use a shared expert in addition to routed experts. When shared_expert_intermediate_size == moe_intermediate_size, the shared expert can be fused with routed experts so that each token attends to top-k routed experts plus one shared expert (topk+1) in a single MoE dispatch, reducing kernel launches and improving inference efficiency. This PR adds shared expert fusion support for Qwen2 MoE (when using Aiter on ROCm/HIP) and improves Qwen3.5 MoE weight loading to correctly handle the fused shared expert layout.

Modifications

python/sglang/srt/models/qwen2_moe.py

  • _determine_num_fused_shared_experts(): New helper that returns 1 when shared expert fusion is enabled (requires shared_expert_intermediate_size == moe_intermediate_size, not disabled via --disable-shared-experts-fusion, and SGLANG_USE_AITER=1 on HIP).
  • _get_shared_expert_weights(): Returns sigmoid(shared_expert_gate(hidden_states)) for the fused shared expert weights.
  • _append_shared_to_topk_output(): Appends shared expert IDs and weights to the top-k output before the fused MoE forward.
  • _forward_router_experts(): After top-k selection on gate logits, appends shared expert via _append_shared_to_topk_output() when fusion is enabled.
  • Experts and TopK: top_k and num_experts now include num_fused_shared_experts when fusion is active.

python/sglang/srt/models/qwen3_5.py

  • _get_num_fused_shared_experts(): New helper used by Qwen3_5MoeForConditionalGeneration to obtain num_fused_shared_experts from the first layer’s MLP.
  • load_weights (Qwen3_5MoeForConditionalGeneration):
    • Same num_experts adjustment for expert params mapping.
    • Remaps mlp.shared_expert.* to mlp.experts.{num_experts_base}.* when fusion is enabled.
    • Adds fused_expert_params_mapping entries for shared expert (gate_proj, up_proj, down_proj, and combined gate_up_proj).
    • Handles both separate (gate_proj/up_proj) and combined (gate_up_proj) checkpoint layouts for the shared expert.

Accuracy Tests

Model: Qwen/Qwen3.5-397B-A17B
enable fusion:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9682 ± 0.0048
strict-match 5 exact_match 0.9697 ± 0.0047

disable fusion:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9697 ± 0.0047
strict-match 5 exact_match 0.9712 ± 0.0046

Benchmarking

Before

 sglang serve     --attention-backend triton     --model-path $MODEL     --host=0.0.0.0     --port $PORT     --tensor-parallel-size $TP     --trust-remote-code     --mem-fraction-static 0.8 
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  192.69    
Total input tokens:                      147038    
Total input text tokens:                 147038    
Total generated tokens:                  147952    
Total generated tokens (retokenized):    147623    
Request throughput (req/s):              0.83      
Input token throughput (tok/s):          763.08    
Output token throughput (tok/s):         767.82    
Peak output token throughput (tok/s):    864.00    
Peak concurrent requests:                21        
Total token throughput (tok/s):          1530.90   
Concurrency:                             15.39     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18535.68  
Median E2E Latency (ms):                 18662.69  
P90 E2E Latency (ms):                    20331.16  
P99 E2E Latency (ms):                    20857.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          128.35    
Median TTFT (ms):                        97.52     
P99 TTFT (ms):                           409.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.92     
Median TPOT (ms):                        20.06     
P99 TPOT (ms):                           20.38     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.93     
Median ITL (ms):                         18.77     
P95 ITL (ms):                            19.03     
P99 ITL (ms):                            99.89     
Max ITL (ms):                            405.48    
==================================================

After


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  184.30    
Total input tokens:                      147038    
Total input text tokens:                 147038    
Total generated tokens:                  147952    
Total generated tokens (retokenized):    147620    
Request throughput (req/s):              0.87      
Input token throughput (tok/s):          797.83    
Output token throughput (tok/s):         802.79    
Peak output token throughput (tok/s):    896.00    
Peak concurrent requests:                21        
Total token throughput (tok/s):          1600.62   
Concurrency:                             15.39     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17732.18  
Median E2E Latency (ms):                 17863.97  
P90 E2E Latency (ms):                    19450.08  
P99 E2E Latency (ms):                    19969.85  
---------------Time to First Token----------------
Mean TTFT (ms):                          124.12    
Median TTFT (ms):                        93.28     
P99 TTFT (ms):                           406.08    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.06     
Median TPOT (ms):                        19.18     
P99 TPOT (ms):                           19.49     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.06     
Median ITL (ms):                         17.94     
P95 ITL (ms):                            18.25     
P99 ITL (ms):                            95.79     
Max ITL (ms):                            414.58    
==================================================

Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhentaocc zhentaocc force-pushed the fuse_share_expert branch 4 times, most recently from 25fd0a0 to bdf97b2 Compare March 23, 2026 13:14
Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
@zhentaocc zhentaocc force-pushed the fuse_share_expert branch 2 times, most recently from d144606 to 0daf8a1 Compare March 24, 2026 06:51
@zhentaocc zhentaocc changed the title [AMD] Enable share expert fusion with router experts for Qwen3.5 [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 Mar 24, 2026
@hubertlu-tw hubertlu-tw requested a review from yichiche March 24, 2026 21:29
Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
Comment thread python/sglang/srt/models/qwen3_5.py
Comment thread python/sglang/srt/models/qwen3_5.py Outdated
@zhentaocc zhentaocc force-pushed the fuse_share_expert branch 3 times, most recently from 3a072ef to 3d25f7e Compare March 27, 2026 03:05
@yichiche
Copy link
Copy Markdown
Collaborator

@zhentaocc Please fix lint issue, I will kick off CI again once it's done

@yichiche yichiche self-assigned this Mar 27, 2026
@zhentaocc
Copy link
Copy Markdown
Contributor Author

@zhentaocc Please fix lint issue, I will kick off CI again once it's done

Done.

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
Comment on lines +267 to +268
or not _use_aiter
or quant_config is not None
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only works on BF16 for now for easier review. And I also need some time to resolve accuracy issue on FP8 weight scale loading. I can raise new PR to address this for FP8 and remove line 268. @HaiShaw @yichiche

Copy link
Copy Markdown
Contributor Author

@zhentaocc zhentaocc Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional works we might collaborate on:

  • FP8/MXFP4 support
  • Qwen35 has separate shared_gate, which I also try to fuse it with gate_proj.

Copy link
Copy Markdown
Collaborator

@yichiche yichiche Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 accuracy issue identified, will need aiter upgrade to fix split_k issue.

yichiche and others added 16 commits April 7, 2026 11:45
- Eliminated redundant weight mappings for `gate_proj` and `up_proj` in the fused expert parameters, streamlining the weight loading process.
- This change enhances code clarity and reduces complexity while maintaining existing functionality.
- Consolidated the initialization of the `num_experts` variable to improve clarity and consistency in weight loading processes.
- Updated references to `num_experts` throughout the code to ensure accurate mapping of shared experts when fused, enhancing the overall functionality of the model.
- Added comments to clarify the logic for loading fused expert weights, improving code maintainability.
- Simplified the weight loading process by removing conditional checks for `num_experts` related to fused MoE, ensuring a more straightforward implementation.
- Enhanced code clarity and maintainability by streamlining the parameters passed during weight loading.
- Introduced a new function `can_fuse_shared_expert` to determine if shared experts can be fused based on configuration and server arguments.
- Updated the initialization of `enable_shared_expert_fusion` and `num_fused_shared_experts` to reflect the new fusion logic.
- Refactored related code sections to ensure correct handling of shared experts during weight loading and processing, improving overall model functionality and maintainability.
- Enhanced comments to specify loading behavior for `down_proj`, `gate_proj`, and `up_proj` in the weight loading process.
- Improved code documentation to aid understanding of expert weight handling in the model.
- Updated the logic for determining the number of shared experts based on configuration settings, allowing for more flexible expert handling.
- Defaulted `enable_shared_expert_fusion` to False and adjusted its initialization to depend on the `_use_aiter` flag, improving clarity and maintainability of the code.
- Enhanced comments to clarify the conditions under which shared expert fusion is enabled.
- Adjusted the initialization of `num_shared_experts` to ensure it defaults to 0 when no configuration is provided, enhancing clarity and robustness.
- Improved the handling of shared expert configuration settings, allowing for more flexible expert management in the model.
- Cleaned up the initialization logic for `num_shared_experts` and `enable_shared_expert_fusion`, improving code clarity and maintainability.
- Enhanced comments to clarify the conditions for shared expert configuration, ensuring better understanding of the model's behavior.
- Updated the initialization logic for `num_shared_experts` to use `hasattr` for better attribute checking, enhancing robustness and clarity.
- Improved conditions for determining shared expert settings, ensuring more flexible configuration handling in the model.
- Updated the logic for calculating the total number of experts by directly calling `get_global_server_args().ep_num_redundant_experts`, improving code clarity and maintainability.
- Enhanced the initialization of the `experts` attribute to streamline the configuration process for expert management in the model.
@zhentaocc zhentaocc force-pushed the fuse_share_expert branch from d9e2f77 to c2dafde Compare April 7, 2026 03:45
@yichiche yichiche changed the title [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 Apr 7, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 9, 2026

/tag-and-rerun-ci

hubertlu-tw added a commit to hubertlu-tw/sglang that referenced this pull request Apr 14, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 14, 2026

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 14, 2026

@HaiShaw

CI Status for PR #20736

PR: [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8
Changed files: python/sglang/srt/models/qwen2_moe.py (+108/-5), python/sglang/srt/models/qwen3_5.py (+110/-3)

AMD: 9 failures (0 likely related) | Others: 15 failures (0 related)

The PR adds shared expert fusion for Qwen3.5 MoE models, gated behind _use_aiter (requires SGLANG_USE_AITER=true + ROCm). The support_shared_expert_fusion=True flag is only set in qwen3_5.py, not in qwen3.py or qwen2_moe.py callers. No CI test exercises Qwen3.5 in this run, so the new code path is never activated by any failing test.

AMD CI Failures

Job Error Related? Explanation Log
small-amd (4) Memory access fault by GPU node-2 in store_kvcache (LLaDA2) 🟢 Unlikely GPU memory fault on LLaDA2 model, unrelated to MoE fusion Log
small-amd (8) Memory access fault by GPU node-2 during CUDA graph capture (Qwen3-30B-A3B) → 30-min timeout 🟢 Unlikely Qwen3-30B-A3B uses qwen3.py (not qwen3_5.py), support_shared_expert_fusion is not set; fault is in store_kvcache JIT kernel, same pattern as other GPU faults Log
small-amd (9) Health check failed + watchdog timeout (Qwen2.5-VL-3B) → 30-min timeout 🟢 Unlikely VLM model (not MoE), server hung in store_kvcache Log
small-amd (10) test_lora_load_from_tensor — scheduler crashed with exit code -6 🟢 Unlikely LoRA test on Llama-3.1-8B, unrelated to MoE Log
small-amd (11) AssertionError: 87.015 not less than 86 — TTFT threshold 🟢 Unlikely Flaky perf test (~1ms over threshold), unrelated to MoE Log
small-amd (12) Memory access fault by GPU node-2 in store_kvcache (LLaDA2) 🟢 Unlikely Same GPU memory fault pattern as partition 4 Log
small-amd (13) test_transformers_models / test_int4fp8_moe → 30-min timeout 🟢 Unlikely Tests on transformers models, server hung in store_kvcache Log
nondeterministic test_reward_models — scheduler hung in store_kvcache (Qwen3 classification) 🟢 Unlikely Reward model test, same GPU hang pattern Log
large-amd (1) Memory access fault by GPU node-2 during score API (Qwen3-30B-A3B) → timeout 🟢 Unlikely Same store_kvcache fault pattern; uses qwen3.py not qwen3_5.py Log
mi35x test_mxfp4_20b — 1200s timeout 🟢 Unlikely Hardware timeout on MI350X, unrelated to MoE fusion Log

Other CI Failures

Job Error Related? Explanation Log
1-gpu-large (4) test_vlm_input_format.py killed (SIGKILL) 🟢 Unlikely VLM test OOM/infra issue on Nvidia runner Log
1-gpu-large (6-13), 1-gpu-small (6,7) Fast-fail: skipping -- root cause: stage-b-test-1-gpu-large (4) 🟢 Unlikely Cascade from partition 4 failure — not real failures Log
build-test (all) rmsnorm_cpu error: input must be a 2D tensor (Intel AMX) 🟢 Unlikely Intel AMX backend kernel error, unrelated Log
build-and-test UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY (Intel XPU) 🟢 Unlikely XPU OOM during model loading, unrelated Log
multimodal-gen-test-1-npu-a3 Diffusion latency above threshold (flux + wan2.1) 🟢 Unlikely NPU perf test, unrelated to MoE Log
multimodal-gen-test-8-npu-a3 Diffusion latency 3-4x over threshold (wan2.2 14B 8-NPU) 🟢 Unlikely NPU perf test, unrelated to MoE Log

Details

No failures are related to this PR. The PR's new code path (enable_shared_expert_fusion) is only activated when:

  1. Running on ROCm with SGLANG_USE_AITER=true
  2. Using a Qwen3.5 model (the only model that passes support_shared_expert_fusion=True)
  3. disable_shared_experts_fusion is not set

No CI test in this run exercises a Qwen3.5 model. The Qwen3-30B-A3B tests (partitions 8, large-1) use qwen3.py, which does not set support_shared_expert_fusion=True.

AMD failures: 6 of 10 share the same pattern — Memory access fault by GPU node-2 in the store_kvcache JIT kernel across multiple unrelated models (LLaDA2, Qwen3-30B, Qwen3-classification, Qwen2.5-VL, Llava). This is a pre-existing ROCm infrastructure issue on the MI325 runners, not a code regression. The remaining AMD failures are a flaky perf threshold (partition 11), a LoRA test crash (partition 10), and an MI350X timeout.

Nvidia failures: All 10 failed jobs cascade from a single OOM/SIGKILL in partition 4 (test_vlm_input_format.py). The fast-fail mechanism skipped the rest.

Other failures: Intel AMX kernel bug, Intel XPU OOM, and NPU perf threshold violations — all unrelated to MoE code.

Generated by amd-bot using Claude Code CLI

@HaiShaw HaiShaw merged commit ea05ea5 into sgl-project:main Apr 15, 2026
95 of 141 checks passed
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
… & FP8 (sgl-project#20736)

Co-authored-by: Chen, Todd <zhenchen@amd.com>
Co-authored-by: jacky.cheng <yichiche@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants