[RL] Support FlashInfer per-token NVFP4 MoE by zianglih · Pull Request #22918 · sgl-project/sglang

zianglih · 2026-04-16T01:46:43Z

Motivation

@HumansAnd

FlashInfer PR:

[feat] Trtllm-gen Per-token Nvfp4 MoE flashinfer-ai/flashinfer#3027

Modifications

When SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE is true, ignore fp32 input activation scale in checkpoint, and use online per-token path instead
Expand test coverage

Accuracy Tests

Baseline


python -m sglang.launch_server \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-flashinfer-autotune \
  --kv-cache-dtype bf16 \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.923
Invalid: 0.000
Latency: 10.510 s
Output throughput: 14359.210 token/s
Accuracy: 0.927
Invalid: 0.000
Latency: 9.423 s
Output throughput: 15836.511 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "nvidia/Qwen3-30B-A3B-NVFP4",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.926
Invalid: 0.000
Latency: 9.313 s
Output throughput: 16046.278 token/s
Accuracy: 0.935
Invalid: 0.000
Latency: 9.254 s
Output throughput: 16089.310 token/s

With per-token NVFP4


SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.939
Invalid: 0.000
Latency: 9.873 s
Output throughput: 14751.988 token/s
Accuracy: 0.940
Invalid: 0.000
Latency: 9.291 s
Output throughput: 15633.533 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "nvidia/Qwen3-30B-A3B-NVFP4",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.940
Invalid: 0.000
Latency: 9.305 s
Output throughput: 15661.988 token/s
Accuracy: 0.939
Invalid: 0.000
Latency: 9.198 s
Output throughput: 15764.530 token/s

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-16T01:46:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 16, 2026

zianglih mentioned this pull request Apr 17, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

24 tasks

zianglih changed the title ~~Support FlashInfer per-token NVFP4 MoE~~ [RL] Support FlashInfer per-token NVFP4 MoE Apr 18, 2026

zianglih added 4 commits April 18, 2026 01:04

Intial impl

20ecec1

Add missing w13 input scale override

b117f9e

Expand test coverage

37d3443

Drop file

60cc443

ziang-and force-pushed the per-token-fp4 branch from 5ed7e6b to 60cc443 Compare April 18, 2026 08:04

zianglih marked this pull request as ready for review April 18, 2026 08:07

zianglih requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock and merrymercy as code owners April 18, 2026 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Support FlashInfer per-token NVFP4 MoE#22918

[RL] Support FlashInfer per-token NVFP4 MoE#22918
zianglih wants to merge 4 commits intosgl-project:mainfrom
zianglih:per-token-fp4

zianglih commented Apr 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zianglih commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zianglih commented Apr 16, 2026 •

edited

Loading