[DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation by xu-yfei · Pull Request #16961 · sgl-project/sglang

xu-yfei · 2026-01-12T13:28:32Z

Motivation

1、draft_extend and target_verify use nsa_decode_backend as backend implementation instead of nsa_prefill_backend
In the MTP scenario, the NSA attention implementation selected for prefill, draft_extend, and target_verify is all nsa_prefill_backend, which is typically set to flashmla_sparse. For draft_extend and target_verify scenarios with a small number of tokens, we observed that fa3 delivers better performance. However, when flashmla_sparse is chosen for prefill, there is no interface to additionally specify the NSA attention implementation for draft_extend and target_verify. This PR modifies the backend implementation of draft_extend and target_verify to nsa_decode_backend —in practice, their CUDA Graph initialization and replay already use nsa_decode_impl.

num_draft_tokens=4, TPOT:

batch size	flashmla_sparse	fa3
1	11.87	9.44
2	13.34	13.08
8	15.25	15.87
16	21.25	21.50
32	32.94	33.79

2、Opt cuda graph batch sizes when MTP:
In the Prefill CP scenario, require_gathered_buffer is set to True. Assuming the number of MTP num_draft_tokens is 4, the CUDA Graph tokens in the MTP scenario are calculated as follows:
get_batch_sizes_to_capture returns batch sizes that are multiples of attn_tp_size, i.e., [8, 16, 24, 32, ....].
In capture_one_batch_size, num_tokens = bs * self.num_tokens_per_bs, meaning the actual number of tokens is [32, 64, 96, 128, ...].
When bs=1, the number of tokens is 4 but needs to be padded to 32, resulting in significant performance overhead. In reality, num_tokens only needs to be a multiple of attn_tp_size.
Modification: get_batch_sizes_to_capture is updated to return batch sizes where batch size * num_draft_tokens is divisible by attn_tp_size, i.e., [2, 4, 6, 8, ....].

Modifications

Accuracy Tests

MODEL_PATH=/home/models/DeepSeek-V3.2/

python3 -m sglang.launch_server --model-path $MODEL_PATH --trust-remote-code \
--port 8000 --host 0.0.0.0 --attention-backend  nsa \
--enable-metrics --mem-fraction-static 0.8 --max-running-requests 128 --enable-cache-report --page-size 64 \
--tp-size 8 --moe-dense-tp-size 1 \
--tool-call-parser deepseekv32 \
--reasoning-parser deepseek-v3 \
--moe-dense-tp-size 1 \
--chunked-prefill-size 16384 \
--nsa-decode-backend fa3 \
--enable-nsa-prefill-context-parallel \
--nsa-prefill-cp-mode round-robin-split \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

# gsm8k
Accuracy: 0.948
Invalid: 0.000
Latency: 192.326 s
Output throughput: 630.665 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-12T13:28:51Z

Summary of Changes

Hello @xu-yfei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance and efficiency improvements for the DeepSeek v3.2 model, particularly in scenarios involving multi-token prediction. It refines the selection of attention backend implementations for specific forward modes like draft_extend and target_verify, allowing for the use of more performant algorithms like fa3 where appropriate. Additionally, it optimizes the CUDA graph batch size determination process to prevent excessive token padding, leading to more efficient resource utilization during model execution.

Highlights

Dynamic NSA Attention Backend Selection: Refactored the NSA attention implementation selection for draft_extend and target_verify operations to dynamically use nsa_decode_backend. This change enables the use of more performant algorithms like fa3 for small token counts in these specific scenarios, improving efficiency.
Optimized CUDA Graph Batch Sizes for MTP: Optimized the calculation of CUDA graph batch sizes in Multi-Token Prediction (MTP) scenarios. The updated logic ensures that the total number of tokens (batch size multiplied by tokens per batch) is divisible by attn_tp_size, thereby reducing unnecessary padding and improving CUDA graph capture efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

xu-yfei · 2026-01-12T13:30:08Z

@Fridge003 Could you please have a review?

gemini-code-assist

Code Review

This pull request introduces two performance optimizations for Multi-Token Prediction (MTP) scenarios. First, it switches the NSA attention backend to nsa_decode_backend for draft_extend and target_verify modes, which is more efficient for smaller token counts. Second, it optimizes CUDA graph batch sizes to minimize padding overhead. The changes are well-aligned with the motivations described, and the refactoring in CudaGraphRunner is clean. I have one suggestion to ensure the first optimization is consistently applied to all relevant draft extend modes.

python/sglang/srt/layers/attention/nsa_backend.py

Fridge003 · 2026-01-16T10:20:18Z

/tag-and-rerun-ci

python/sglang/srt/model_executor/cuda_graph_runner.py

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

xu-yfei requested review from Fridge003, Qiaolin-Yu, Ying1123, hebiao064, hnyls2002, ispobock and merrymercy as code owners January 12, 2026 13:28

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

python/sglang/srt/layers/attention/nsa_backend.py Outdated Show resolved Hide resolved

github-actions bot added the run-ci label Jan 16, 2026

Fridge003 reviewed Jan 16, 2026

View reviewed changes

python/sglang/srt/model_executor/cuda_graph_runner.py Show resolved Hide resolved

xu-yfei added 4 commits January 16, 2026 22:16

opt ds32 decode with mtp

7b671d9

revert specv2 opt

10e521e

include v2

0eef2ed

add comment

7adb046

xu-yfei force-pushed the xyf/ds32_decode branch from 634ebbb to 7adb046 Compare January 16, 2026 14:30

Fridge003 merged commit d2105d4 into sgl-project:main Jan 19, 2026
286 of 308 checks passed

Fridge003 mentioned this pull request Jan 19, 2026

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation#16961

[DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation#16961
Fridge003 merged 4 commits intosgl-project:mainfrom
antgroup:xyf/ds32_decode

xu-yfei commented Jan 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Uh oh!

xu-yfei commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xu-yfei commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

xu-yfei commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Fridge003 commented Jan 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xu-yfei commented Jan 12, 2026 •

edited

Loading