Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent #12376

wenscarl · 2025-10-30T04:45:08Z

Motivation

Flashinfer Introduced1927 2 nvfp4 quantization APIs:
silu_and_mul_scaled_nvfp4_experts_quantize and scaled_nvfp4_grouped_quantize which are equivalent in implementation with sglang's silu_and_mul_scaled_fp4_grouped_quant and scaled_fp4_grouped_quant respectively. This PR made the replacement for future maintenance.

Modifications

Accuracy Tests

SGLANG_DEEPEP_BF16_DISPATCH=1 \                                                                                                                                                                                                            
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 \                                                                                                                                                                                       
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH=0 \                                                                                                                                                                                                      
python3 -m sglang.launch_server \                                                                                                                                                                                                          
  --model-path $model_str \                                                                                                                                                                                                                
  --trust-remote-code \                                                                                                                                                                                                                    
  --disable-radix-cache \                                                                                                                                                                                                                  
  --max-running-requests 256 \                                                                                                                                                                                                             
  --chunked-prefill-size 1024 \                                                                                                                                                                                                            
  --mem-fraction-static 0.89 \                                                                                                                                                                                                             
  --max-prefill-tokens 16384 \
  --tp 4 \                                                
  --ep 4 \                                                
  --dp 4 \                                                
  --enable-dp-attention \                                 
  --attention-backend trtllm_mla \
  --moe-dense-tp-size 1 \                                 
  --quantization modelopt_fp4 \
  --moe-a2a-backend deepep \                              
  --deepep-mode low_latency \
  --moe-runner-backend flashinfer_cutedsl

python3 benchmark/gsm8k/bench_sglang.py   --num-questions 256   --parallel 32   --num-shots 8

This PR:
Accuracy: 0.984
Invalid: 0.000
Latency: 92.829 s
Output throughput: 312.521 token/s

before this PR:

Accuracy: 0.980
Invalid: 0.000
Latency: 93.843 s
Output throughput: 291.104 token/s

Benchmarking and Profiling

perf

python3 -m sglang.bench_serving \
  --model nvidia/DeepSeek-R1-0528-FP4 \
  --dataset-name random \
  --backend sglang-oai \
  --random-range-ratio 1 \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 256 \
  --num-prompts 512 \
  --base-url http://127.0.0.1:30000

this PR:


============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 256
Successful requests:                     512
Benchmark duration (s):                  125.52
Total input tokens:                      524288
Total input text tokens:                 524288
Total input vision tokens:               0
Total generated tokens:                  524288
Total generated tokens (retokenized):    522729
Request throughput (req/s):              4.08
Input token throughput (tok/s):          4176.93
Output token throughput (tok/s):         4176.93
Total token throughput (tok/s):          8353.85
Concurrency:                             255.85
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   62722.91
Median E2E Latency (ms):                 62699.70
---------------Time to First Token----------------
Mean TTFT (ms):                          14263.65
Median TTFT (ms):                        14195.89
P99 TTFT (ms):                           27870.53
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.55
Median ITL (ms):                         33.93
P95 ITL (ms):                            37.15
P99 ITL (ms):                            38.94
Max ITL (ms):                            27735.41
==================================================

before this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 256
Successful requests:                     512
Benchmark duration (s):                  124.88
Total input tokens:                      524288
Total input text tokens:                 524288
Total input vision tokens:               0
Total generated tokens:                  524288
Total generated tokens (retokenized):    523035
Request throughput (req/s):              4.10
Input token throughput (tok/s):          4198.30
Output token throughput (tok/s):         4198.30
Total token throughput (tok/s):          8396.60
Concurrency:                             255.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   62410.39
Median E2E Latency (ms):                 62400.90
---------------Time to First Token----------------
Mean TTFT (ms):                          13982.02
Median TTFT (ms):                        13977.48
P99 TTFT (ms):                           27522.50
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.49
Median ITL (ms):                         33.99
P95 ITL (ms):                            36.05
P99 ITL (ms):                            37.72
Max ITL (ms):                            27403.35
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Fridge003 · 2025-11-05T19:12:27Z

@wenscarl Please post the results of the bench/test files you modified?

wenscarl · 2025-11-12T16:25:27Z

@Fridge003 the numbers are posted.

…quant by flashinfer equivalent

sgl-kernel/python/sgl_kernel/__init__.py

wenscarl marked this pull request as ready for review October 30, 2025 15:47

wenscarl requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners October 30, 2025 15:47

wenscarl force-pushed the scaled_fp4_expert_quant_replace branch 3 times, most recently from 3cac7f2 to 26f6bc6 Compare November 5, 2025 15:23

wenscarl requested a review from Fridge003 November 5, 2025 15:35

wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 26f6bc6 to 5ac39d0 Compare November 10, 2025 15:39

github-actions bot added documentation Improvements or additions to documentation performance quant LLM Quantization sgl-kernel labels Nov 10, 2025

wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 5ac39d0 to 8193fa3 Compare November 10, 2025 15:42

wenscarl requested review from hnyls2002 and xiezhq-hermann as code owners November 12, 2025 16:26

wenscarl added 2 commits November 12, 2025 16:26

Replace scaled_fp4_grouped_quant and silu_and_mul_scaled_fp4_grouped_…

ebdb6bb

…quant by flashinfer equivalent

fix pre-commit

f111861

wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 56bab86 to f111861 Compare November 12, 2025 16:27

Upd.

4b0211d

Fridge003 requested changes Nov 12, 2025

View reviewed changes

sgl-kernel/python/sgl_kernel/__init__.py Show resolved Hide resolved

wenscarl added 2 commits November 12, 2025 14:29

Add backend kernels.

2e527f4

Upd

f995302

wenscarl requested a review from Fridge003 November 12, 2025 20:33

Fridge003 approved these changes Nov 12, 2025

View reviewed changes

Fridge003 added the run-ci label Nov 12, 2025

Fridge003 merged commit 6664083 into sgl-project:main Nov 13, 2025
193 of 212 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent #12376

Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent #12376

wenscarl commented Oct 30, 2025 •

edited

Loading

Uh oh!

Fridge003 commented Nov 5, 2025

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent #12376

Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent #12376

Conversation

wenscarl commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Fridge003 commented Nov 5, 2025

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenscarl commented Oct 30, 2025 •

edited

Loading