Skip to content

Fuse shared experts into trtllm_gen moe (fp8)#21491

Draft
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:shared_exp_integration
Draft

Fuse shared experts into trtllm_gen moe (fp8)#21491
wenscarl wants to merge 2 commits intosgl-project:mainfrom
wenscarl:shared_exp_integration

Conversation

@wenscarl
Copy link
Copy Markdown
Collaborator

@wenscarl wenscarl commented Mar 26, 2026

Motivation

flashinfer-ai/flashinfer#2625

Modifications

Accuracy Tests

python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-R1-0528   --tp 4   --dp 4   --enable-dp-attention   --kv-cache-dtype fp8_e4m3 --load-format dummy --mem-fraction-static 0.8

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables support for fused shared experts within the FlashInfer TRT-LLM MoE backend, specifically for FP8 and MXFP8 quantization types. The changes include passing the number of fused shared experts through the MoE layers, adjusting top-k and expert counts accordingly, and updating the server configuration to allow shared expert fusion on CUDA when using the FlashInfer backend. Feedback focuses on improving the clarity of warning and error messages to accurately reflect these new support conditions.

) and not (
_is_cuda and get_moe_runner_backend().is_flashinfer_trtllm()
):
disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reason for disabling shared expert fusion is now potentially misleading. With this change, fusion is also enabled for CUDA with the flashinfer_trtllm backend under expert parallelism. The message should be updated to reflect this to avoid confusion for users on other CUDA configurations.

Suggested change
disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism."
disable_reason = "Shared experts fusion under expert parallelism is only supported on AMD-platform with capability >= gfx942(MI30x) or on CUDA with the flashinfer_trtllm backend."

if self.quantization not in ["fp8", "mxfp8"]:
self.disable_shared_experts_fusion = True
logger.warning(
"FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warning message is correct but could be more informative. It states that shared expert fusion is disabled but doesn't explain why. The code comment explains the reason well; incorporating that into the log message would improve user experience.

Suggested change
"FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."
"FlashInfer TRTLLM MoE is enabled, but fused shared experts are only supported for fp8/mxfp8 quantization. --disable-shared-experts-fusion is automatically set."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant