Fuse shared experts into trtllm_gen moe (fp8)#21491
Fuse shared experts into trtllm_gen moe (fp8)#21491wenscarl wants to merge 2 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables support for fused shared experts within the FlashInfer TRT-LLM MoE backend, specifically for FP8 and MXFP8 quantization types. The changes include passing the number of fused shared experts through the MoE layers, adjusting top-k and expert counts accordingly, and updating the server configuration to allow shared expert fusion on CUDA when using the FlashInfer backend. Feedback focuses on improving the clarity of warning and error messages to accurately reflect these new support conditions.
| ) and not ( | ||
| _is_cuda and get_moe_runner_backend().is_flashinfer_trtllm() | ||
| ): | ||
| disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism." |
There was a problem hiding this comment.
The reason for disabling shared expert fusion is now potentially misleading. With this change, fusion is also enabled for CUDA with the flashinfer_trtllm backend under expert parallelism. The message should be updated to reflect this to avoid confusion for users on other CUDA configurations.
| disable_reason = "Only Deepseek V3/R1 on AMD-platform with capability >= gfx942(MI30x) can use shared experts fusion optimization under expert parallelism." | |
| disable_reason = "Shared experts fusion under expert parallelism is only supported on AMD-platform with capability >= gfx942(MI30x) or on CUDA with the flashinfer_trtllm backend." |
| if self.quantization not in ["fp8", "mxfp8"]: | ||
| self.disable_shared_experts_fusion = True | ||
| logger.warning( | ||
| "FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set." |
There was a problem hiding this comment.
The warning message is correct but could be more informative. It states that shared expert fusion is disabled but doesn't explain why. The code comment explains the reason well; incorporating that into the log message would improve user experience.
| "FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set." | |
| "FlashInfer TRTLLM MoE is enabled, but fused shared experts are only supported for fp8/mxfp8 quantization. --disable-shared-experts-fusion is automatically set." |
Motivation
flashinfer-ai/flashinfer#2625
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci