Fuse shared experts into trtllm_gen moe (fp8) by wenscarl · Pull Request #21491 · sgl-project/sglang

wenscarl · 2026-03-26T22:46:34Z

Motivation

Modifications

Accuracy Tests

python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-R1-0528   --tp 4   --dp 4   --enable-dp-attention   --kv-cache-dtype fp8_e4m3 --load-format dummy --mem-fraction-static 0.8

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist

Code Review

This pull request enables support for fused shared experts within the FlashInfer TRT-LLM MoE backend, specifically for FP8 and MXFP8 quantization types. The changes include passing the number of fused shared experts through the MoE layers, adjusting top-k and expert counts accordingly, and updating the server configuration to allow shared expert fusion on CUDA when using the FlashInfer backend. Feedback focuses on improving the clarity of warning and error messages to accurately reflect these new support conditions.

gemini-code-assist · 2026-03-26T22:48:31Z

python/sglang/srt/models/deepseek_v2.py

+        ) and not (
+            _is_cuda and get_moe_runner_backend().is_flashinfer_trtllm()
        ):
            disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism."


The reason for disabling shared expert fusion is now potentially misleading. With this change, fusion is also enabled for CUDA with the flashinfer_trtllm backend under expert parallelism. The message should be updated to reflect this to avoid confusion for users on other CUDA configurations.

Suggested change

disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism."

disable_reason = "Shared experts fusion under expert parallelism is only supported on AMD-platform with capability >= gfx942(MI30x) or on CUDA with the flashinfer_trtllm backend."

gemini-code-assist · 2026-03-26T22:48:31Z

python/sglang/srt/server_args.py

+            if self.quantization not in ["fp8", "mxfp8"]:
+                self.disable_shared_experts_fusion = True
+                logger.warning(
+                    "FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."


The warning message is correct but could be more informative. It states that shared expert fusion is disabled but doesn't explain why. The code comment explains the reason well; incorporating that into the log message would improve user experience.

Suggested change

"FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."

"FlashInfer TRTLLM MoE is enabled, but fused shared experts are only supported for fp8/mxfp8 quantization. --disable-shared-experts-fusion is automatically set."

wip

6452d3c

github-actions bot added the deepseek label Mar 26, 2026

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Add log

1b72435

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse shared experts into trtllm_gen moe (fp8)#21491

Fuse shared experts into trtllm_gen moe (fp8)#21491
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:shared_exp_integration

wenscarl commented Mar 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism."
	disable_reason = "Shared experts fusion under expert parallelism is only supported on AMD-platform with capability >= gfx942(MI30x) or on CUDA with the flashinfer_trtllm backend."

	"FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."
	"FlashInfer TRTLLM MoE is enabled, but fused shared experts are only supported for fp8/mxfp8 quantization. --disable-shared-experts-fusion is automatically set."

Conversation

wenscarl commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wenscarl commented Mar 26, 2026 •

edited

Loading