[Deepseek V3.2] Support Overlap Spec + NSA by b8zhong · Pull Request #15307 · sgl-project/sglang

b8zhong · 2025-12-17T05:56:17Z

Motivation

Part of V3.2 Roadmap #15025

Enable overlap spec and EAGLE + NSA backend.

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp 8 --speculative-algorithm EAGLE

Modifications

In EAGLE V1, we had (with python3 -m sglang.test.send_one --stream --max-new-tokens 1024)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    7.922    |  1024  |   2.960    |     129.26      |
+-------------+--------+------------+-----------------+

After simply adding in the guards for include_v2=True, we had a slowdown to:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    9.015    |  1024  |   2.960    |     113.59      |
+-------------+--------+------------+-----------------+

After profiling, we were able to find the root cause:

EAGLE V1

EAGLE V2 (before code change)

When we use extend_seq_lens_cpu, it will cause an unneeded sync.

EAGLE V2 (after code change)

It will increase to:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    7.306    |  1024  |   2.960    |     140.17      |
+-------------+--------+------------+-----------------+

So it is around 8%.

Accuracy Tests

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Before:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:35<00:00,  8.46it/s]
Accuracy: 0.946
Invalid: 0.000
Latency: 156.114 s
Output throughput: 817.299 token/s

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:23<00:00,  9.20it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 147.938 s
Output throughput: 868.237 token/s

After:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:56<00:00, 11.31it/s]
Accuracy: 0.948
Invalid: 0.000
Latency: 116.809 s
Output throughput: 1112.259 token/s

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:33<00:00, 14.16it/s]
Accuracy: 0.954
Invalid: 0.000
Latency: 93.615 s
Output throughput: 1390.591 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-12-17T05:56:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2025-12-17T05:56:40Z

/tag-and-rerun-ci again?

Fridge003 · 2025-12-17T06:03:12Z

@b8zhong That's really good result!
Can you also test some accuracy benchmarks, like gsm8k or gpqa?
https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

Fridge003

Nice job!

hzh0425 · 2025-12-17T07:59:38Z

@b8zhong

Hi, could you please share your launch command?

I get an error when setting export SGLANG_ENABLE_SPEC_V2=1.

b8zhong · 2025-12-17T16:33:56Z

@hzh0425 Sure. It's:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp 8 --speculative-algorithm EAGLE

By the way, I did not use DP attention.

…n3_pp * 'main' of https://github.com/sgl-project/sglang: (74 commits) [bug fix][pp] fix inconsistent latency between tp (sgl-project#15379) Fix warp illegal instruction in kimi k2 thinking PCG (sgl-project#15306) Fix gpt-oss yarn with `truncate` argument (sgl-project#14270) Monkey patch deepseek-ocr's `v_head_dim` (sgl-project#15384) [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (sgl-project#15361) [PP] Fix dynamic chunking strategy for PP (sgl-project#15372) Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (sgl-project#12967) Split test_piecewise_cuda_graph.py to optimize CI resource usage (sgl-project#15290) unified management of environment variables for vlm cuda ipc transport (sgl-project#14501) Mistral Large 3 NVFP4 TRTLLM MoE support (sgl-project#15049) fix: adjust time for test_epd_disaggregation.py (sgl-project#15354) Add doc for qwen3 next (sgl-project#15337) feat: DeepSeek-V3.2 Streaming tool call output (sgl-project#15278) Feature/trtllm mha workspace size configurable sgl-project#15089 (sgl-project#15131) [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V (sgl-project#15205) [Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307) Add request-level timestamp for when prefill finishes (sgl-project#14860) [CI] Migrate LoRA tests to test/registered/lora/ (sgl-project#15176) Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness (sgl-project#15277) Fix condition check for require_gathered_buffer (sgl-project#15328) ...

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

sync script

1d3ed5d

b8zhong requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 17, 2025 05:56

github-actions bot added the run-ci label Dec 17, 2025

Fridge003 mentioned this pull request Dec 17, 2025

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

36 tasks

Fridge003 added the high priority label Dec 17, 2025

Fridge003 approved these changes Dec 17, 2025

View reviewed changes

more

ae933bd

github-actions bot added documentation Improvements or additions to documentation deepseek labels Dec 17, 2025

b8zhong merged commit d20699a into sgl-project:main Dec 17, 2025
267 of 287 checks passed

b8zhong deleted the brayden/sync-oss/nsa-support-spec-v2 branch December 17, 2025 21:35

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025

[Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307)

a622bf3

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307)

2623aa0

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307)

d1756e8

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deepseek V3.2] Support Overlap Spec + NSA#15307

[Deepseek V3.2] Support Overlap Spec + NSA#15307
b8zhong merged 2 commits intosgl-project:mainfrom
bzhng-development:brayden/sync-oss/nsa-support-spec-v2

b8zhong commented Dec 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Uh oh!

b8zhong commented Dec 17, 2025 •

edited

Loading

Uh oh!

Fridge003 commented Dec 17, 2025

Uh oh!

Fridge003 left a comment

Uh oh!

hzh0425 commented Dec 17, 2025

Uh oh!

b8zhong commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

EAGLE V1

EAGLE V2 (before code change)

EAGLE V2 (after code change)

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Uh oh!

b8zhong commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Dec 17, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

hzh0425 commented Dec 17, 2025

Uh oh!

b8zhong commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Dec 17, 2025 •

edited

Loading

b8zhong commented Dec 17, 2025 •

edited

Loading