[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736
[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8#20736HaiShaw merged 42 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
25fd0a0 to
bdf97b2
Compare
d144606 to
0daf8a1
Compare
0daf8a1 to
5275269
Compare
3a072ef to
3d25f7e
Compare
|
@zhentaocc Please fix lint issue, I will kick off CI again once it's done |
3d25f7e to
c7d3887
Compare
Done. |
| or not _use_aiter | ||
| or quant_config is not None |
There was a problem hiding this comment.
additional works we might collaborate on:
- FP8/MXFP4 support
- Qwen35 has separate shared_gate, which I also try to fuse it with gate_proj.
There was a problem hiding this comment.
FP8 accuracy issue identified, will need aiter upgrade to fix split_k issue.
- Eliminated redundant weight mappings for `gate_proj` and `up_proj` in the fused expert parameters, streamlining the weight loading process. - This change enhances code clarity and reduces complexity while maintaining existing functionality.
- Consolidated the initialization of the `num_experts` variable to improve clarity and consistency in weight loading processes. - Updated references to `num_experts` throughout the code to ensure accurate mapping of shared experts when fused, enhancing the overall functionality of the model. - Added comments to clarify the logic for loading fused expert weights, improving code maintainability.
- Simplified the weight loading process by removing conditional checks for `num_experts` related to fused MoE, ensuring a more straightforward implementation. - Enhanced code clarity and maintainability by streamlining the parameters passed during weight loading.
- Introduced a new function `can_fuse_shared_expert` to determine if shared experts can be fused based on configuration and server arguments. - Updated the initialization of `enable_shared_expert_fusion` and `num_fused_shared_experts` to reflect the new fusion logic. - Refactored related code sections to ensure correct handling of shared experts during weight loading and processing, improving overall model functionality and maintainability.
- Enhanced comments to specify loading behavior for `down_proj`, `gate_proj`, and `up_proj` in the weight loading process. - Improved code documentation to aid understanding of expert weight handling in the model.
- Updated the logic for determining the number of shared experts based on configuration settings, allowing for more flexible expert handling. - Defaulted `enable_shared_expert_fusion` to False and adjusted its initialization to depend on the `_use_aiter` flag, improving clarity and maintainability of the code. - Enhanced comments to clarify the conditions under which shared expert fusion is enabled.
- Adjusted the initialization of `num_shared_experts` to ensure it defaults to 0 when no configuration is provided, enhancing clarity and robustness. - Improved the handling of shared expert configuration settings, allowing for more flexible expert management in the model.
- Cleaned up the initialization logic for `num_shared_experts` and `enable_shared_expert_fusion`, improving code clarity and maintainability. - Enhanced comments to clarify the conditions for shared expert configuration, ensuring better understanding of the model's behavior.
- Updated the initialization logic for `num_shared_experts` to use `hasattr` for better attribute checking, enhancing robustness and clarity. - Improved conditions for determining shared expert settings, ensuring more flexible configuration handling in the model.
- Updated the logic for calculating the total number of experts by directly calling `get_global_server_args().ep_num_redundant_experts`, improving code clarity and maintainability. - Enhanced the initialization of the `experts` attribute to streamline the configuration process for expert management in the model.
d9e2f77 to
c2dafde
Compare
|
/tag-and-rerun-ci |
…ter experts for Qwen3.5 BF16 & FP8
|
@amd-bot ci-status |
CI Status for PR #20736PR: [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 AMD: 9 failures (0 likely related) | Others: 15 failures (0 related) The PR adds shared expert fusion for Qwen3.5 MoE models, gated behind AMD CI Failures
Other CI Failures
DetailsNo failures are related to this PR. The PR's new code path (
No CI test in this run exercises a Qwen3.5 model. The Qwen3-30B-A3B tests (partitions 8, large-1) use AMD failures: 6 of 10 share the same pattern — Nvidia failures: All 10 failed jobs cascade from a single OOM/SIGKILL in partition 4 ( Other failures: Intel AMX kernel bug, Intel XPU OOM, and NPU perf threshold violations — all unrelated to MoE code.Generated by amd-bot using Claude Code CLI |
… & FP8 (sgl-project#20736) Co-authored-by: Chen, Todd <zhenchen@amd.com> Co-authored-by: jacky.cheng <yichiche@amd.com>
Motivation
Qwen2 MoE and Qwen3.5 MoE models use a shared expert in addition to routed experts. When
shared_expert_intermediate_size == moe_intermediate_size, the shared expert can be fused with routed experts so that each token attends to top-k routed experts plus one shared expert (topk+1) in a single MoE dispatch, reducing kernel launches and improving inference efficiency. This PR adds shared expert fusion support for Qwen2 MoE (when using Aiter on ROCm/HIP) and improves Qwen3.5 MoE weight loading to correctly handle the fused shared expert layout.Modifications
python/sglang/srt/models/qwen2_moe.py_determine_num_fused_shared_experts(): New helper that returns 1 when shared expert fusion is enabled (requiresshared_expert_intermediate_size == moe_intermediate_size, not disabled via--disable-shared-experts-fusion, andSGLANG_USE_AITER=1on HIP)._get_shared_expert_weights(): Returnssigmoid(shared_expert_gate(hidden_states))for the fused shared expert weights._append_shared_to_topk_output(): Appends shared expert IDs and weights to the top-k output before the fused MoE forward._forward_router_experts(): After top-k selection on gate logits, appends shared expert via_append_shared_to_topk_output()when fusion is enabled.top_kandnum_expertsnow includenum_fused_shared_expertswhen fusion is active.python/sglang/srt/models/qwen3_5.py_get_num_fused_shared_experts(): New helper used byQwen3_5MoeForConditionalGenerationto obtainnum_fused_shared_expertsfrom the first layer’s MLP.num_expertsadjustment for expert params mapping.mlp.shared_expert.*tomlp.experts.{num_experts_base}.*when fusion is enabled.fused_expert_params_mappingentries for shared expert (gate_proj,up_proj,down_proj, and combinedgate_up_proj).gate_proj/up_proj) and combined (gate_up_proj) checkpoint layouts for the shared expert.Accuracy Tests
Model: Qwen/Qwen3.5-397B-A17B
enable fusion:
disable fusion:
Benchmarking
Before
After
Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci