Skip to content

Use dsv3 optimized routing fused_topk_deepseek instead of moe_fused_gate#15347

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
leejnau:fused_topk_routing
Jan 19, 2026
Merged

Use dsv3 optimized routing fused_topk_deepseek instead of moe_fused_gate#15347
Fridge003 merged 8 commits intosgl-project:mainfrom
leejnau:fused_topk_routing

Conversation

@leejnau
Copy link
Collaborator

@leejnau leejnau commented Dec 18, 2025

Motivation

flashinfer has an optimized routing kernel for DeepSeek V3: flashinfer-ai/flashinfer#2099
The API was renamed to fused_topk_deepseek here: flashinfer-ai/flashinfer#2181

Modifications

Replace the call to moe_fused_gate with fused_topk_deepseek.

Accuracy Tests

Server Command:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-V2 --tensor-parallel-size=8 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 8 --enable-symm-mem --moe-runner-backend flashinfer_cutlass --quantization modelopt_fp4

Client Benchmark Command:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1316 \
  --parallel 1316

Before:

Accuracy: 0.954
Invalid: 0.000
Latency: 143.606 s
Output throughput: 1009.301 token/s

After:

Accuracy: 0.960
Invalid: 0.000
Latency: 70.387 s
Output throughput: 2060.780 token/s

Benchmarking and Profiling

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 2048 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 2048

Before:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  319.04    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2091930   
Request throughput (req/s):              6.42      
Input token throughput (tok/s):          6573.25   
Output token throughput (tok/s):         6573.25   
Peak output token throughput (tok/s):    9728.00   
Peak concurrent requests:                2048      
Total token throughput (tok/s):          13146.50  
Concurrency:                             1154.13   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   179793.11 
Median E2E Latency (ms):                 179816.14 
---------------Time to First Token----------------
Mean TTFT (ms):                          142579.01 
Median TTFT (ms):                        142477.90 
P99 TTFT (ms):                           282920.44 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.38     
Median TPOT (ms):                        36.24     
P99 TPOT (ms):                           38.56     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           36.38     
Median ITL (ms):                         34.66     
P95 ITL (ms):                            36.35     
P99 ITL (ms):                            39.03     
Max ITL (ms):                            3984.88   
==================================================

After:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 2048      
Successful requests:                     2048      
Benchmark duration (s):                  312.63    
Total input tokens:                      2097152   
Total input text tokens:                 2097152   
Total input vision tokens:               0         
Total generated tokens:                  2097152   
Total generated tokens (retokenized):    2090916   
Request throughput (req/s):              6.55      
Input token throughput (tok/s):          6708.17   
Output token throughput (tok/s):         6708.17   
Peak output token throughput (tok/s):    8449.00   
Peak concurrent requests:                2048      
Total token throughput (tok/s):          13416.34  
Concurrency:                             1153.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   176083.48 
Median E2E Latency (ms):                 176110.58 
---------------Time to First Token----------------
Mean TTFT (ms):                          139678.72 
Median TTFT (ms):                        139542.14 
P99 TTFT (ms):                           277228.84 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.59     
Median TPOT (ms):                        35.50     
P99 TPOT (ms):                           37.81     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.59     
Median ITL (ms):                         33.88     
P95 ITL (ms):                            36.34     
P99 ITL (ms):                            38.77     
Max ITL (ms):                            3960.43   
==================================================

The kernel itself is approximately 4x faster.

Before:

moe_fused_gate

After:

fused_topk_deepseek

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Collaborator

@trevor-m trevor-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left some comments.

Can you also add a unit test which instantiates the TopK class such that this new kernel is called and verify the output?

@Fridge003 Fridge003 mentioned this pull request Dec 21, 2025
6 tasks
Copy link
Collaborator

@trevor-m trevor-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

Can you run pre-commit to fix the linting? https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit

@leejnau
Copy link
Collaborator Author

leejnau commented Jan 5, 2026

@Fridge003 Is it ok to merge?

@Fridge003
Copy link
Collaborator

@leejnau Is this PR depending on Flashinfer with version newer than 0.5.3?

@leejnau
Copy link
Collaborator Author

leejnau commented Jan 7, 2026

@leejnau Is this PR depending on Flashinfer with version newer than 0.5.3?

For the optimized path (fused_topk_deepseek) it relies upon flashinfer Release v0.6.0rc1
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.0rc1

However, the check for this API availability means that we should be able to merge this without waiting for flashinfer Release v0.6.0rc1. If you'd prefer to wait for flashinfer Release v0.6.0rc1 that's fine too.

@leejnau leejnau force-pushed the fused_topk_routing branch from 3981815 to 21e4828 Compare January 7, 2026 22:57
@Fridge003
Copy link
Collaborator

@leejnau We need to wait for 0.6.0 update. By then this PR can be thoroughly tested

Fridge003 added a commit that referenced this pull request Jan 12, 2026
Fridge003 added a commit that referenced this pull request Jan 12, 2026
@Fridge003
Copy link
Collaborator

Fridge003 commented Jan 12, 2026

@leejnau Can you please test this PR on DeepSeek R1/V3.2 with the GPQA dataset with this command

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3

I just tested gpqa on dpsk fp4 with gb200 and the accuracy dropped like ~1%. Not sure whether that's fluctuation

@leejnau
Copy link
Collaborator Author

leejnau commented Jan 13, 2026

@leejnau Can you please test this PR on DeepSeek R1/V3.2 with the GPQA dataset with this command

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3

I just tested gpqa on dpsk fp4 with gb200 and the accuracy dropped like ~1%. Not sure whether that's fluctuation

I ran with server command:

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4-V2 --tensor-parallel-size=4 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 4 --enable-symm-mem --moe-runner-backend flashinfer_cutlass --quantization modelopt_fp4

and the client command you requested above. I can reproduce the accuracy drop with fused_topk_deepseek:

commit 21e4828f0eec47eabb6b966bc4050771e6316c37 (HEAD -> fused_topk_routing, origin/fused_topk_routing)
"mean_score": 0.7651515151515151
commit 6037267f5bc14105965d00ecd712e8aa5e7aaff4 (HEAD -> main, origin/main, origin/HEAD)
"mean_score": 0.7916666666666667

Fridge003 added a commit that referenced this pull request Jan 14, 2026
@yzh119
Copy link
Collaborator

yzh119 commented Jan 15, 2026

Might be relevant to flashinfer-ai/flashinfer#2325 which is fixed in flashinfer v0.6.1.

@Fridge003
Copy link
Collaborator

Just tested again with v0.6.1, and the accuracy restored to expected number
@leejnau @yzh119

Fridge003 added a commit that referenced this pull request Jan 16, 2026
@Fridge003
Copy link
Collaborator

Waiting for upgrade of flashinfer

@Fridge003
Copy link
Collaborator

Fridge003 commented Jan 18, 2026

@leejnau The accept length in this test drops, please take a look
https://github.com/sgl-project/sglang/actions/runs/21086349532/job/60703667670?pr=15347

We can relax the accept length to 2.8, since the prior runs hit like 2.92. 2.9 might be too tight for the CI
https://github.com/sgl-project/sglang/blob/main/test/srt/test_deepseek_v3_mtp.py#L86

@Fridge003
Copy link
Collaborator

/rerun-stage unit-test-backend-8-gpu-h200

@github-actions
Copy link
Contributor

✅ Triggered unit-test-backend-8-gpu-h200 to run independently (skipping dependencies).

@github-actions
Copy link
Contributor

🔗 View workflow run

@Fridge003 Fridge003 merged commit 84c8390 into sgl-project:main Jan 19, 2026
97 of 105 checks passed
DotSlash-A pushed a commit to DotSlash-A/sglang that referenced this pull request Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256)

* [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225)

Signed-off-by: Lancer <maruixiang6688@gmail.com>

* Add runner utilization report workflow (sgl-project#17234)

* cli: support sglang version (sgl-project#17250)

* Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261)

* [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Tiny] Improve docs (sgl-project#17264)

* [diffusion] fix: set guidance_scale default to None (sgl-project#17182)

* Tiny fix comment typo (sgl-project#17287)

* [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974)

* Add kl test for swa radix cache (sgl-project#17281)

* fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* Move radix cache related tests (sgl-project#17295)

* [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534)

Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>

* [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296)

* [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649)

* [NPU]Support GPT-OSS for NPU (sgl-project#14197)

* [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235)

Co-authored-by: root <root@ubuntu-nvidia.localdomain>

* Update CODEOWNERS for multimodal_gen (sgl-project#17308)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] overlap LoRA weight loading with compute (sgl-project#15512)

* [PD] Optimize MHA models pp util calculation logic (sgl-project#17306)

* [Minor] Correct sglang version when installing from source (sgl-project#17315)

* Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347)

* [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961)

* Update code sync scripts (sgl-project#17319)

* [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* support new qwen3_coder_detector (sgl-project#16744)

Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>

* Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325)

* KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412)

* [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)

Co-authored-by: Minglei Zhu <zminglei@linkedin.com>

* [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245)

* fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238)

* Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332)

* Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177)

Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

* Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158)

* [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309)

* [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix v32 continue_final_message not work (sgl-project#16567)

* Evict swa kv cache during decoding (sgl-project#17220)

* [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142)

Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

* [AMD CI] Migrate and Add More Testcases (sgl-project#17116)

Co-authored-by: yctseng0211 <yctseng@amd.com>

* [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345)

* Restore deepseek_v2.py to main's code, except the utils

* Ran `pre-commit`

---------

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Hudson Xing <1277646412@qq.com>
Co-authored-by: Lancer <402430575@qq.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Jerry Ji <jerryjilol@gmail.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Koushik Dutta <koush@koushikdutta.com>
Co-authored-by: root <root@ubuntu-nvidia.localdomain>
Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com>
Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Minglei Zhu <zminglei@linkedin.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: zhangheng <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants