[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE by zianglih · Pull Request #19537 · sgl-project/sglang

zianglih · 2026-02-28T02:13:37Z

Motivation

@HumansAnd

Modifications

This PR integrates:

Expand existing flashinfer.fused_moe.trtllm_fp8_block_scale_moe with mxfp8
Add flashinfer.fused_moe.trtllm_fp8_block_scale_routed_moe which supports mxfp8 and deepseek fp8
Add flashinfer.mm_mxfp8
Expand test coverage

Accuracy Tests

The following expanded tests passed on B200:

test_flashinfer_trtllm_gen_moe_backend.py
- mxfp8 trtllm_fp8_block_scale_moe
- fp8 & mxfp8 trtllm_fp8_block_scale_routed_moe
test_fp8_blockwise_gemm.py
- mxfp8 triton dense linear
- mxfp8 flashinfer_trtllm dense linear

Benchmarking and Profiling

cd /sgl-workspace/sglang
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum

Qwen-30B-A3B, MXFP8

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.965
Invalid: 0.000
Latency: 8.733 s
Output throughput: 19333.410 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.224 s
Output throughput: 20529.920 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.077 s
Output throughput: 20905.019 token/s
python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.961
Invalid: 0.000
Latency: 10.147 s
Output throughput: 16812.828 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 10.116 s
Output throughput: 16864.250 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 9.672 s
Output throughput: 17637.603 token/s
python -m sglang.launch_server --disable-cuda-graph --disable-piecewise-cuda-graph --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &
Accuracy: 0.961
Invalid: 0.000
Latency: 21.103 s
Output throughput: 8083.506 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 20.921 s
Output throughput: 8153.735 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 20.143 s
Output throughput: 8468.786 token/s

Qwen-30B-A3B, FP8

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 &
Accuracy: 0.963
Invalid: 0.000
Latency: 10.227 s
Output throughput: 16662.979 token/s
Accuracy: 0.963
Invalid: 0.000
Latency: 9.319 s
Output throughput: 18287.525 token/s
Accuracy: 0.963
Invalid: 0.000
Latency: 9.223 s
Output throughput: 18477.602 token/s

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 &
Accuracy: 0.966
Invalid: 0.000
Latency: 11.228 s
Output throughput: 15189.365 token/s
Accuracy: 0.966
Invalid: 0.000
Latency: 10.201 s
Output throughput: 16719.158 token/s
Accuracy: 0.966
Invalid: 0.000
Latency: 10.561 s
Output throughput: 16148.546 token/s

DeepSeek-V3.2, DeepSeek FP8

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm &
Accuracy: 0.980
Invalid: 0.000
Latency: 19.572 s
Output throughput: 5749.341 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 14.888 s
Output throughput: 7610.951 token/s
Accuracy: 0.976
Invalid: 0.000
Latency: 15.195 s
Output throughput: 7454.477 token/s

python3 -m sglang.launch_server --model-path /data/models/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed &
Accuracy: 0.984
Invalid: 0.000
Latency: 15.011 s
Output throughput: 7490.571 token/s
Accuracy: 0.974
Invalid: 0.000
Latency: 18.532 s
Output throughput: 6125.993 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 14.183 s
Output throughput: 7949.668 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-28T02:13:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-02-28T02:16:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zianglih · 2026-03-03T04:53:36Z

Fixing a torch compile related failure

zianglih · 2026-03-03T06:17:09Z

Previous failure is fixed by 863f9d5. This PR is now piecewise CUDA graph compatible.

gemini-code-assist · 2026-03-03T06:17:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zianglih · 2026-03-04T00:52:43Z

The accuracy discrepancy between routed and non-routed backend has been eliminated by 6ed9c53

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &

fused baseline

Accuracy: 0.965
Invalid: 0.000
Latency: 8.733 s
Output throughput: 19333.410 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.224 s
Output throughput: 20529.920 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 8.077 s
Output throughput: 20905.019 token/s

python -m sglang.launch_server --fp8-gemm-backend flashinfer_trtllm --moe-runner-backend flashinfer_trtllm_routed --model zianglih/Qwen3-30B-A3B-Instruct-2507-MXFP8 &

routed BEFORE

Accuracy: 0.961
Invalid: 0.000
Latency: 10.147 s
Output throughput: 16812.828 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 10.116 s
Output throughput: 16864.250 token/s
Accuracy: 0.961
Invalid: 0.000
Latency: 9.672 s
Output throughput: 17637.603 token/s

routed AFTER

Accuracy: 0.965
Invalid: 0.000
Latency: 10.111 s
Output throughput: 16793.321 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 9.996 s
Output throughput: 16986.518 token/s
Accuracy: 0.965
Invalid: 0.000
Latency: 9.574 s
Output throughput: 17736.517 token/s

…G compatible

docs/advanced_features/server_arguments.md

python/sglang/srt/layers/quantization/fp8.py

python/sglang/srt/layers/quantization/fp8_utils.py

Fridge003 · 2026-03-10T02:21:19Z

/tag-and-rerun-ci

…, and routed MoE (#19537)" This reverts commit 76ee4bb.

…uted MoE (sgl-project#19537)

zianglih mentioned this pull request Feb 28, 2026

[Bug] [MXFP8 Online] AssertionError: n=64 must be divisible by 128 #18277

Closed

5 tasks

zianglih marked this pull request as ready for review February 28, 2026 02:16

zianglih requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 28, 2026 02:16

zianglih mentioned this pull request Feb 28, 2026

[FlashInfer] Bump FlashInfer version from 0.6.3 to 0.6.4 #19005

Merged

5 tasks

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026

Cherry-pick PR sgl-project#19537 onto sglang-miles (squashed)

aa24f6c

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026

Cherry-pick PR sgl-project#19537 onto sglang-miles (squashed)

dad4cca

github-actions bot added the documentation Improvements or additions to documentation label Feb 28, 2026

ziang-and force-pushed the agent-flashinfer-mxfp8-moe branch 2 times, most recently from 12e7d36 to 1d698c0 Compare March 2, 2026 18:42

zianglih added 9 commits March 2, 2026 16:16

Initial flashinfer mxfp8 integration

da4444c

Clean up

acdab35

Refactor with copy_or_rebind_param

3dd4f78

Fix _handle_moe_kernel_config

49af4e7

Clean up

7cee453

Add doc

5194a31

Clean up

8166e6d

Expand test to include mxfp8 and flashinfer_trtllm_routed

cad547b

Fix flashinfer_trtllm_routed EP

1e8a692

ziang-and force-pushed the agent-flashinfer-mxfp8-moe branch from abe9d2b to 1e8a692 Compare March 3, 2026 00:16

Fix piece wise cuda graph

863f9d5

zianglih marked this pull request as ready for review March 3, 2026 06:17

zianglih mentioned this pull request Mar 3, 2026

[mxfp8] [numerics] trtllm_fp8_block_scale_routed_moe accuracy is slightly worse than trtllm_fp8_block_scale_moe flashinfer-ai/flashinfer#2676

Closed

Add raw logits topk

6ed9c53

zianglih mentioned this pull request Mar 4, 2026

Expose a fused_topk_raw_logits API flashinfer-ai/flashinfer#2682

Closed

5 tasks

zianglih added 3 commits March 3, 2026 17:20

Use flashinfer mxfp8 in test for now since triton mxfp8 is not yet PC…

86ded54

…G compatible

Minor fix for DeepSeek

c8cb1e1

Merge branch 'main' into agent-flashinfer-mxfp8-moe

4e3f31e

zianglih mentioned this pull request Mar 9, 2026

[FlashInfer v0.6.6][RL] Support fp8-last-n-bf16 RL for flashinfer_trtllm_routed moe backend #20214

Merged

5 tasks

Fridge003 reviewed Mar 10, 2026

View reviewed changes

github-actions bot added the run-ci label Mar 10, 2026

zianglih added 6 commits March 9, 2026 19:30

Add docs

ae08f23

Lazy import block_scale_interleave

5f65dc2

Add comments

542775a

Refine docs

1febb3e

Rename for pytest compatibility

9cfe5c9

Merge branch 'main' into agent-flashinfer-mxfp8-moe

181c130

Fridge003 approved these changes Mar 10, 2026

View reviewed changes

Fridge003 merged commit 76ee4bb into sgl-project:main Mar 10, 2026
209 of 222 checks passed

alisonshao pushed a commit that referenced this pull request Mar 12, 2026

Revert "[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE…

c8b677d

…, and routed MoE (#19537)" This reverts commit 76ee4bb.

alisonshao mentioned this pull request Mar 12, 2026

Fix CI failures from driver 570→575 upgrade on SCI H200 #20402

Closed

5 tasks

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and ro…

c42de0e

…uted MoE (sgl-project#19537)

trevor-m mentioned this pull request Mar 13, 2026

[NVIDIA] Enable fp8 flashinfer_trtllm_routed MoE for MiniMax-M2.5 #20394

Open

5 tasks

zianglih mentioned this pull request Mar 15, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

15 tasks

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and ro…

e306e0f

…uted MoE (sgl-project#19537)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE#19537

[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE#19537
Fridge003 merged 24 commits intosgl-project:mainfrom
zianglih:agent-flashinfer-mxfp8-moe

zianglih commented Feb 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

zianglih commented Mar 3, 2026

Uh oh!

zianglih commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

zianglih commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zianglih commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Qwen-30B-A3B, MXFP8

Qwen-30B-A3B, FP8

DeepSeek-V3.2, DeepSeek FP8

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

zianglih commented Mar 3, 2026

Uh oh!

zianglih commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

zianglih commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zianglih commented Feb 28, 2026 •

edited

Loading

zianglih commented Mar 4, 2026 •

edited

Loading