Skip to content

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE#19537

Merged
Fridge003 merged 24 commits intosgl-project:mainfrom
zianglih:agent-flashinfer-mxfp8-moe
Mar 10, 2026
Merged

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE#19537
Fridge003 merged 24 commits intosgl-project:mainfrom
zianglih:agent-flashinfer-mxfp8-moe

Conversation

@zianglih
Copy link
Contributor

@zianglih zianglih commented Feb 28, 2026

Motivation

@HumansAnd

Modifications

This PR integrates:

  • Expand existing flashinfer.fused_moe.trtllm_fp8_block_scale_moe with mxfp8
  • Add flashinfer.fused_moe.trtllm_fp8_block_scale_routed_moe which supports mxfp8 and deepseek fp8
  • Add flashinfer.mm_mxfp8
  • Expand test coverage

Accuracy Tests

The following expanded tests passed on B200:

  • test_flashinfer_trtllm_gen_moe_backend.py
    • mxfp8 trtllm_fp8_block_scale_moe
    • fp8 & mxfp8 trtllm_fp8_block_scale_routed_moe
  • test_fp8_blockwise_gemm.py
    • mxfp8 triton dense linear
    • mxfp8 flashinfer_trtllm dense linear

Benchmarking and Profiling

cd /sgl-workspace/sglang
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum

Qwen-30B-A3B, MXFP8

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.965
Invalid: 0.000
Latency: 8.733 s
Output throughput: 19333.410 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.224 s
Output throughput: 20529.920 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.077 s
Output throughput: 20905.019 token/s
python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.961
Invalid: 0.000
Latency: 10.147 s
Output throughput: 16812.828 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 10.116 s
Output throughput: 16864.250 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 9.672 s
Output throughput: 17637.603 token/s
python -m sglang.launch_server --disable-cuda-graph --disable-piecewise-cuda-graph --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.961
Invalid: 0.000
Latency: 21.103 s
Output throughput: 8083.506 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 20.921 s
Output throughput: 8153.735 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 20.143 s
Output throughput: 8468.786 token/s

Qwen-30B-A3B, FP8

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 &
Accuracy: 0.963
Invalid: 0.000
Latency: 10.227 s
Output throughput: 16662.979 token/s
Accuracy: 0.963
Invalid: 0.000
Latency: 9.319 s
Output throughput: 18287.525 token/s
Accuracy: 0.963
Invalid: 0.000
Latency: 9.223 s
Output throughput: 18477.602 token/s

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 &
Accuracy: 0.966
Invalid: 0.000
Latency: 11.228 s
Output throughput: 15189.365 token/s
Accuracy: 0.966
Invalid: 0.000
Latency: 10.201 s
Output throughput: 16719.158 token/s
Accuracy: 0.966
Invalid: 0.000
Latency: 10.561 s
Output throughput: 16148.546 token/s

DeepSeek-V3.2, DeepSeek FP8

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm &
Accuracy: 0.980
Invalid: 0.000
Latency: 19.572 s
Output throughput: 5749.341 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 14.888 s
Output throughput: 7610.951 token/s
Accuracy: 0.976
Invalid: 0.000
Latency: 15.195 s
Output throughput: 7454.477 token/s

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed &
Accuracy: 0.984
Invalid: 0.000
Latency: 15.011 s
Output throughput: 7490.571 token/s
Accuracy: 0.974
Invalid: 0.000
Latency: 18.532 s
Output throughput: 6125.993 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 14.183 s
Output throughput: 7949.668 token/s

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026
zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 28, 2026
@ziang-and ziang-and force-pushed the agent-flashinfer-mxfp8-moe branch 2 times, most recently from 12e7d36 to 1d698c0 Compare March 2, 2026 18:42
@ziang-and ziang-and force-pushed the agent-flashinfer-mxfp8-moe branch from abe9d2b to 1e8a692 Compare March 3, 2026 00:16
@zianglih
Copy link
Contributor Author

zianglih commented Mar 3, 2026

Fixing a torch compile related failure

@zianglih
Copy link
Contributor Author

zianglih commented Mar 3, 2026

Previous failure is fixed by 863f9d5. This PR is now piecewise CUDA graph compatible.

@zianglih zianglih marked this pull request as ready for review March 3, 2026 06:17
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zianglih
Copy link
Contributor Author

zianglih commented Mar 4, 2026

The accuracy discrepancy between routed and non-routed backend has been eliminated by 6ed9c53

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &

fused baseline

Accuracy: 0.965
Invalid: 0.000
Latency: 8.733 s
Output throughput: 19333.410 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.224 s
Output throughput: 20529.920 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.077 s
Output throughput: 20905.019 token/s
python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &

routed BEFORE

Accuracy: 0.961
Invalid: 0.000
Latency: 10.147 s
Output throughput: 16812.828 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 10.116 s
Output throughput: 16864.250 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 9.672 s
Output throughput: 17637.603 token/s

routed AFTER

Accuracy: 0.965
Invalid: 0.000
Latency: 10.111 s
Output throughput: 16793.321 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 9.996 s
Output throughput: 16986.518 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 9.574 s
Output throughput: 17736.517 token/s

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit 76ee4bb into sgl-project:main Mar 10, 2026
209 of 222 checks passed
alisonshao pushed a commit that referenced this pull request Mar 12, 2026
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants