Skip to content

Conversation

@wenscarl
Copy link
Collaborator

@wenscarl wenscarl commented Oct 30, 2025

Motivation

Flashinfer Introduced1927 2 nvfp4 quantization APIs:
silu_and_mul_scaled_nvfp4_experts_quantize and scaled_nvfp4_grouped_quantize which are equivalent in implementation with sglang's silu_and_mul_scaled_fp4_grouped_quant and scaled_fp4_grouped_quant respectively. This PR made the replacement for future maintenance.

Modifications

Accuracy Tests

SGLANG_DEEPEP_BF16_DISPATCH=1 \                                                                                                                                                                                                            
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 \                                                                                                                                                                                       
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH=0 \                                                                                                                                                                                                      
python3 -m sglang.launch_server \                                                                                                                                                                                                          
  --model-path $model_str \                                                                                                                                                                                                                
  --trust-remote-code \                                                                                                                                                                                                                    
  --disable-radix-cache \                                                                                                                                                                                                                  
  --max-running-requests 256 \                                                                                                                                                                                                             
  --chunked-prefill-size 1024 \                                                                                                                                                                                                            
  --mem-fraction-static 0.89 \                                                                                                                                                                                                             
  --max-prefill-tokens 16384 \
  --tp 4 \                                                
  --ep 4 \                                                
  --dp 4 \                                                
  --enable-dp-attention \                                 
  --attention-backend trtllm_mla \
  --moe-dense-tp-size 1 \                                 
  --quantization modelopt_fp4 \
  --moe-a2a-backend deepep \                              
  --deepep-mode low_latency \
  --moe-runner-backend flashinfer_cutedsl

python3 benchmark/gsm8k/bench_sglang.py   --num-questions 256   --parallel 32   --num-shots 8

This PR:
Accuracy: 0.984
Invalid: 0.000
Latency: 92.829 s
Output throughput: 312.521 token/s

before this PR:

Accuracy: 0.980
Invalid: 0.000
Latency: 93.843 s
Output throughput: 291.104 token/s

Benchmarking and Profiling

perf

python3 -m sglang.bench_serving \
  --model nvidia/DeepSeek-R1-0528-FP4 \
  --dataset-name random \
  --backend sglang-oai \
  --random-range-ratio 1 \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 256 \
  --num-prompts 512 \
  --base-url http://127.0.0.1:30000
  

this PR:


============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 256
Successful requests:                     512
Benchmark duration (s):                  125.52
Total input tokens:                      524288
Total input text tokens:                 524288
Total input vision tokens:               0
Total generated tokens:                  524288
Total generated tokens (retokenized):    522729
Request throughput (req/s):              4.08
Input token throughput (tok/s):          4176.93
Output token throughput (tok/s):         4176.93
Total token throughput (tok/s):          8353.85
Concurrency:                             255.85
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   62722.91
Median E2E Latency (ms):                 62699.70
---------------Time to First Token----------------
Mean TTFT (ms):                          14263.65
Median TTFT (ms):                        14195.89
P99 TTFT (ms):                           27870.53
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.55
Median ITL (ms):                         33.93
P95 ITL (ms):                            37.15
P99 ITL (ms):                            38.94
Max ITL (ms):                            27735.41
==================================================

before this PR:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 256
Successful requests:                     512
Benchmark duration (s):                  124.88
Total input tokens:                      524288
Total input text tokens:                 524288
Total input vision tokens:               0
Total generated tokens:                  524288
Total generated tokens (retokenized):    523035
Request throughput (req/s):              4.10
Input token throughput (tok/s):          4198.30
Output token throughput (tok/s):         4198.30
Total token throughput (tok/s):          8396.60
Concurrency:                             255.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   62410.39
Median E2E Latency (ms):                 62400.90
---------------Time to First Token----------------
Mean TTFT (ms):                          13982.02
Median TTFT (ms):                        13977.48
P99 TTFT (ms):                           27522.50
---------------Inter-Token Latency----------------
Mean ITL (ms):                           47.49
Median ITL (ms):                         33.99
P95 ITL (ms):                            36.05
P99 ITL (ms):                            37.72
Max ITL (ms):                            27403.35
==================================================

Checklist

@wenscarl wenscarl marked this pull request as ready for review October 30, 2025 15:47
@wenscarl wenscarl force-pushed the scaled_fp4_expert_quant_replace branch 3 times, most recently from 3cac7f2 to 26f6bc6 Compare November 5, 2025 15:23
@wenscarl wenscarl requested a review from Fridge003 November 5, 2025 15:35
@Fridge003
Copy link
Collaborator

@wenscarl Please post the results of the bench/test files you modified?

@wenscarl wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 26f6bc6 to 5ac39d0 Compare November 10, 2025 15:39
@github-actions github-actions bot added documentation Improvements or additions to documentation performance quant LLM Quantization sgl-kernel labels Nov 10, 2025
@wenscarl wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 5ac39d0 to 8193fa3 Compare November 10, 2025 15:42
@wenscarl
Copy link
Collaborator Author

@Fridge003 the numbers are posted.

@wenscarl wenscarl force-pushed the scaled_fp4_expert_quant_replace branch from 56bab86 to f111861 Compare November 12, 2025 16:27
@wenscarl wenscarl requested a review from Fridge003 November 12, 2025 20:33
@Fridge003 Fridge003 merged commit 6664083 into sgl-project:main Nov 13, 2025
193 of 212 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation performance quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants