Skip to content

[Deepseek V3.2] Support Overlap Spec + NSA#15307

Merged
b8zhong merged 2 commits intosgl-project:mainfrom
bzhng-development:brayden/sync-oss/nsa-support-spec-v2
Dec 17, 2025
Merged

[Deepseek V3.2] Support Overlap Spec + NSA#15307
b8zhong merged 2 commits intosgl-project:mainfrom
bzhng-development:brayden/sync-oss/nsa-support-spec-v2

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Dec 17, 2025

Motivation

Part of V3.2 Roadmap #15025

Enable overlap spec and EAGLE + NSA backend.

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp 8 --speculative-algorithm EAGLE

Modifications

In EAGLE V1, we had (with python3 -m sglang.test.send_one --stream --max-new-tokens 1024)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    7.922    |  1024  |   2.960    |     129.26      |
+-------------+--------+------------+-----------------+

After simply adding in the guards for include_v2=True, we had a slowdown to:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    9.015    |  1024  |   2.960    |     113.59      |
+-------------+--------+------------+-----------------+

After profiling, we were able to find the root cause:

EAGLE V1

Screenshot 2025-12-16 at 9 43 56 PM

EAGLE V2 (before code change)

Screenshot 2025-12-16 at 9 45 11 PM

When we use extend_seq_lens_cpu, it will cause an unneeded sync.

EAGLE V2 (after code change)

Screenshot 2025-12-16 at 9 52 52 PM

It will increase to:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    7.306    |  1024  |   2.960    |     140.17      |
+-------------+--------+------------+-----------------+

So it is around 8%.

Accuracy Tests

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Before:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:35<00:00,  8.46it/s]
Accuracy: 0.946
Invalid: 0.000
Latency: 156.114 s
Output throughput: 817.299 token/s
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:23<00:00,  9.20it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 147.938 s
Output throughput: 868.237 token/s

After:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:56<00:00, 11.31it/s]
Accuracy: 0.948
Invalid: 0.000
Latency: 116.809 s
Output throughput: 1112.259 token/s
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:33<00:00, 14.16it/s]
Accuracy: 0.954
Invalid: 0.000
Latency: 93.615 s
Output throughput: 1390.591 token/s

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong
Copy link
Collaborator Author

b8zhong commented Dec 17, 2025

/tag-and-rerun-ci again?

@Fridge003
Copy link
Collaborator

@b8zhong That's really good result!
Can you also test some accuracy benchmarks, like gsm8k or gpqa?
https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

Copy link
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!

@github-actions github-actions bot added documentation Improvements or additions to documentation deepseek labels Dec 17, 2025
@hzh0425
Copy link
Collaborator

hzh0425 commented Dec 17, 2025

@b8zhong

Hi, could you please share your launch command?

I get an error when setting export SGLANG_ENABLE_SPEC_V2=1.

@b8zhong
Copy link
Collaborator Author

b8zhong commented Dec 17, 2025

@hzh0425 Sure. It's:

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp 8 --speculative-algorithm EAGLE

By the way, I did not use DP attention.

@b8zhong b8zhong merged commit d20699a into sgl-project:main Dec 17, 2025
267 of 287 checks passed
@b8zhong b8zhong deleted the brayden/sync-oss/nsa-support-spec-v2 branch December 17, 2025 21:35
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 19, 2025
…n3_pp

* 'main' of https://github.com/sgl-project/sglang: (74 commits)
  [bug fix][pp] fix inconsistent latency between tp (sgl-project#15379)
  Fix warp illegal instruction in kimi k2 thinking PCG (sgl-project#15306)
  Fix gpt-oss yarn with `truncate` argument (sgl-project#14270)
  Monkey patch deepseek-ocr's `v_head_dim` (sgl-project#15384)
  [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (sgl-project#15361)
  [PP] Fix dynamic chunking strategy for PP (sgl-project#15372)
  Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (sgl-project#12967)
  Split test_piecewise_cuda_graph.py to optimize CI resource usage (sgl-project#15290)
  unified management of environment variables for vlm cuda ipc transport  (sgl-project#14501)
  Mistral Large 3 NVFP4 TRTLLM MoE support (sgl-project#15049)
  fix: adjust time for test_epd_disaggregation.py (sgl-project#15354)
  Add doc for qwen3 next (sgl-project#15337)
  feat: DeepSeek-V3.2 Streaming tool call output (sgl-project#15278)
  Feature/trtllm mha workspace size configurable sgl-project#15089 (sgl-project#15131)
  [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V (sgl-project#15205)
  [Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307)
  Add request-level timestamp for when prefill finishes (sgl-project#14860)
  [CI] Migrate LoRA tests to test/registered/lora/ (sgl-project#15176)
  Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness (sgl-project#15277)
  Fix condition check for require_gathered_buffer (sgl-project#15328)
  ...
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants