Skip to content

[RL] Support FlashInfer per-token NVFP4 MoE#22918

Open
zianglih wants to merge 4 commits intosgl-project:mainfrom
zianglih:per-token-fp4
Open

[RL] Support FlashInfer per-token NVFP4 MoE#22918
zianglih wants to merge 4 commits intosgl-project:mainfrom
zianglih:per-token-fp4

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented Apr 16, 2026

Motivation

@HumansAnd

FlashInfer PR:

Modifications

  • When SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE is true, ignore fp32 input activation scale in checkpoint, and use online per-token path instead
  • Expand test coverage

Accuracy Tests

Baseline


python -m sglang.launch_server \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-flashinfer-autotune \
  --kv-cache-dtype bf16 \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.923
Invalid: 0.000
Latency: 10.510 s
Output throughput: 14359.210 token/s
Accuracy: 0.927
Invalid: 0.000
Latency: 9.423 s
Output throughput: 15836.511 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "nvidia/Qwen3-30B-A3B-NVFP4",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.926
Invalid: 0.000
Latency: 9.313 s
Output throughput: 16046.278 token/s
Accuracy: 0.935
Invalid: 0.000
Latency: 9.254 s
Output throughput: 16089.310 token/s

With per-token NVFP4


SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.939
Invalid: 0.000
Latency: 9.873 s
Output throughput: 14751.988 token/s
Accuracy: 0.940
Invalid: 0.000
Latency: 9.291 s
Output throughput: 15633.533 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "nvidia/Qwen3-30B-A3B-NVFP4",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.940
Invalid: 0.000
Latency: 9.305 s
Output throughput: 15661.988 token/s
Accuracy: 0.939
Invalid: 0.000
Latency: 9.198 s
Output throughput: 15764.530 token/s

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 16, 2026
@zianglih zianglih changed the title Support FlashInfer per-token NVFP4 MoE [RL] Support FlashInfer per-token NVFP4 MoE Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant