Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` by leejnau · Pull Request #15347 · sgl-project/sglang

leejnau · 2025-12-18T00:14:32Z

Motivation

flashinfer has an optimized routing kernel for DeepSeek V3: flashinfer-ai/flashinfer#2099
The API was renamed to fused_topk_deepseek here: flashinfer-ai/flashinfer#2181

Modifications

Replace the call to moe_fused_gate with fused_topk_deepseek.

Accuracy Tests

Server Command:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-V2 --tensor-parallel-size=8 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 8 --enable-symm-mem --moe-runner-backend flashinfer_cutlass --quantization modelopt_fp4

Client Benchmark Command:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

Before:

Accuracy: 0.954
Invalid: 0.000
Latency: 143.606 s
Output throughput: 1009.301 token/s

After:

Accuracy: 0.960
Invalid: 0.000
Latency: 70.387 s
Output throughput: 2060.780 token/s

Benchmarking and Profiling

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 2048 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 2048

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  319.04    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2091930   
Request throughput (req/s):              6.42      
Input token throughput (tok/s):          6573.25   
Output token throughput (tok/s):         6573.25   
Peak output token throughput (tok/s):    9728.00   
Peak concurrent requests:                2048      
Total token throughput (tok/s):          13146.50  
Concurrency:                             1154.13   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   179793.11 
Median E2E Latency (ms):                 179816.14 
---------------Time to First Token----------------
Mean TTFT (ms):                          142579.01 
Median TTFT (ms):                        142477.90 
P99 TTFT (ms):                           282920.44 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.38     
Median TPOT (ms):                        36.24     
P99 TPOT (ms):                           38.56     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           36.38     
Median ITL (ms):                         34.66     
P95 ITL (ms):                            36.35     
P99 ITL (ms):                            39.03     
Max ITL (ms):                            3984.88   
==================================================

After:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  312.63    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2090916   
Request throughput (req/s):              6.55      
Input token throughput (tok/s):          6708.17   
Output token throughput (tok/s):         6708.17   
Peak output token throughput (tok/s):    8449.00   
Peak concurrent requests:                2048      
Total token throughput (tok/s):          13416.34  
Concurrency:                             1153.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   176083.48 
Median E2E Latency (ms):                 176110.58 
---------------Time to First Token----------------
Mean TTFT (ms):                          139678.72 
Median TTFT (ms):                        139542.14 
P99 TTFT (ms):                           277228.84 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.59     
Median TPOT (ms):                        35.50     
P99 TPOT (ms):                           37.81     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.59     
Median ITL (ms):                         33.88     
P95 ITL (ms):                            36.34     
P99 ITL (ms):                            38.77     
Max ITL (ms):                            3960.43   
==================================================

The kernel itself is approximately 4x faster.

Before:

After:

gemini-code-assist · 2025-12-18T00:14:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

trevor-m

Thanks! I left some comments.

Can you also add a unit test which instantiates the TopK class such that this new kernel is called and verify the output?

python/sglang/srt/layers/moe/topk.py

trevor-m

Thanks, looks good!

Can you run pre-commit to fix the linting? https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit

test/registered/kernels/test_fused_topk_deepseek.py

leejnau · 2026-01-05T20:32:03Z

@Fridge003 Is it ok to merge?

Fridge003 · 2026-01-07T01:48:01Z

@leejnau Is this PR depending on Flashinfer with version newer than 0.5.3?

test/registered/kernels/test_fused_topk_deepseek.py

leejnau · 2026-01-07T22:39:41Z

@leejnau Is this PR depending on Flashinfer with version newer than 0.5.3?

For the optimized path (fused_topk_deepseek) it relies upon flashinfer Release v0.6.0rc1
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.0rc1

Rename noauxtc to fused_topk_deepseek by @nv-yunzheq in Rename noauxtc to fused_topk_deepseek flashinfer-ai/flashinfer#2181

However, the check for this API availability means that we should be able to merge this without waiting for flashinfer Release v0.6.0rc1. If you'd prefer to wait for flashinfer Release v0.6.0rc1 that's fine too.

Fridge003 · 2026-01-08T14:59:40Z

@leejnau We need to wait for 0.6.0 update. By then this PR can be thoroughly tested

Fridge003 · 2026-01-12T16:26:54Z

@leejnau Can you please test this PR on DeepSeek R1/V3.2 with the GPQA dataset with this command

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3

I just tested gpqa on dpsk fp4 with gb200 and the accuracy dropped like ~1%. Not sure whether that's fluctuation

leejnau · 2026-01-13T05:19:13Z

@leejnau Can you please test this PR on DeepSeek R1/V3.2 with the GPQA dataset with this command
python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3
I just tested gpqa on dpsk fp4 with gb200 and the accuracy dropped like ~1%. Not sure whether that's fluctuation

I ran with server command:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-V2 --tensor-parallel-size=4 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 4 --enable-symm-mem --moe-runner-backend flashinfer_cutlass --quantization modelopt_fp4

and the client command you requested above. I can reproduce the accuracy drop with fused_topk_deepseek:

commit 21e4828f0eec47eabb6b966bc4050771e6316c37 (HEAD -> fused_topk_routing, origin/fused_topk_routing)
"mean_score": 0.7651515151515151

commit 6037267f5bc14105965d00ecd712e8aa5e7aaff4 (HEAD -> main, origin/main, origin/HEAD)
"mean_score": 0.7916666666666667

This reverts commit ad9f404.

yzh119 · 2026-01-15T06:05:59Z

Might be relevant to flashinfer-ai/flashinfer#2325 which is fixed in flashinfer v0.6.1.

Fridge003 · 2026-01-16T03:36:31Z

Just tested again with v0.6.1, and the accuracy restored to expected number
@leejnau @yzh119

Fridge003 · 2026-01-16T03:48:42Z

Waiting for upgrade of flashinfer

Fridge003 · 2026-01-18T09:15:28Z

@leejnau The accept length in this test drops, please take a look
https://github.com/sgl-project/sglang/actions/runs/21086349532/job/60703667670?pr=15347

We can relax the accept length to 2.8, since the prior runs hit like 2.92. 2.9 might be too tight for the CI
https://github.com/sgl-project/sglang/blob/main/test/srt/test_deepseek_v3_mtp.py#L86

Fridge003 · 2026-01-19T02:33:16Z

/rerun-stage unit-test-backend-8-gpu-h200

github-actions · 2026-01-19T02:33:45Z

✅ Triggered unit-test-backend-8-gpu-h200 to run independently (skipping dependencies).

github-actions · 2026-01-19T02:33:51Z

🔗 View workflow run

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

leejnau requested a review from trevor-m December 18, 2025 16:31

leejnau added performance deepseek nvidia labels Dec 18, 2025

trevor-m requested changes Dec 18, 2025

View reviewed changes

python/sglang/srt/layers/moe/topk.py Show resolved Hide resolved

github-actions bot added the sgl-kernel label Dec 19, 2025

Fridge003 mentioned this pull request Dec 21, 2025

Update flashinfer to 0.6.1 #15551

Merged

6 tasks

trevor-m approved these changes Dec 22, 2025

View reviewed changes

leejnau marked this pull request as ready for review December 22, 2025 17:44

leejnau requested review from BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock, merrymercy, yizhang2077 and zhyncs as code owners December 22, 2025 17:44

trevor-m requested changes Dec 22, 2025

View reviewed changes

test/registered/kernels/test_fused_topk_deepseek.py Show resolved Hide resolved

trevor-m approved these changes Dec 22, 2025

View reviewed changes

leejnau removed the sgl-kernel label Dec 22, 2025

b8zhong added the run-ci label Jan 2, 2026

Fridge003 requested changes Jan 7, 2026

View reviewed changes

test/registered/kernels/test_fused_topk_deepseek.py Show resolved Hide resolved

use fused_topk_deepseek instead of moe_fused_gate

8cb9f3d

leejnau added 5 commits January 7, 2026 14:51

keep old kernel; use new one when possible

59d1bee

add unit test for fused_topk_deepseek

2a37618

run pre-commit to fix formatting

c4f49ef

move test_fused_topk_deepseeek from sgl-kernel to sglang/test

67a1850

move test to test/registered/kernels and register as nightly-1-gpu

21e4828

leejnau force-pushed the fused_topk_routing branch from 3981815 to 21e4828 Compare January 7, 2026 22:57

Fridge003 added a commit that referenced this pull request Jan 12, 2026

Use optimized routing kernel (Lee #15347)

ad9f404

Fridge003 added a commit that referenced this pull request Jan 12, 2026

Use optimized routing kernel (Lee #15347)

b28031e

Fridge003 added a commit that referenced this pull request Jan 14, 2026

Revert "Use optimized routing kernel (Lee #15347)"

869a671

This reverts commit ad9f404.

Fridge003 added a commit that referenced this pull request Jan 16, 2026

Use optimized routing kernel (Lee #15347)

28751f1

Fridge003 approved these changes Jan 16, 2026

View reviewed changes

Merge branch 'main' into fused_topk_routing

67878ac

lower accept length threshold to 2.8

ba18369

Fridge003 merged commit 84c8390 into sgl-project:main Jan 19, 2026
97 of 105 checks passed

yudian0504 mentioned this pull request Jan 19, 2026

Fix kernel selection in biased_grouped_topk_gpu #17325

Merged

5 tasks

leejnau mentioned this pull request Mar 4, 2026

[Feature] Integrate new flashinfer optimizations for DeepSeekV3 #14453

Open

Conversation

leejnau commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Uh oh!

trevor-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trevor-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leejnau commented Jan 5, 2026

Uh oh!

Fridge003 commented Jan 7, 2026

Uh oh!

Uh oh!

leejnau commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

Fridge003 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejnau commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Jan 15, 2026

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

Fridge003 commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leejnau commented Dec 18, 2025 •

edited

Loading

Fridge003 commented Jan 12, 2026 •

edited

Loading

leejnau commented Jan 13, 2026 •

edited

Loading

Fridge003 commented Jan 18, 2026 •

edited

Loading