Skip to content

[Feature] Add zero bubble for spec v2#21895

Open
litmei wants to merge 48 commits intosgl-project:mainfrom
litmei:mtp_v2_push
Open

[Feature] Add zero bubble for spec v2#21895
litmei wants to merge 48 commits intosgl-project:mainfrom
litmei:mtp_v2_push

Conversation

@litmei
Copy link
Copy Markdown
Contributor

@litmei litmei commented Apr 2, 2026

Motivation

In the current SGLang framework, the EAGLE3 Spec V2 implementation suffers from a CPU-side scheduling bottleneck. Specifically, the CPU dispatch process between consecutive decode steps is "bound" by the draft model's overhead. This creates execution bubbles that cannot be effectively hidden by the overlap scheduler. The goal of this PR is to refactor the scheduling logic to minimize or completely eliminate these CPU-originated bubbles.

Modifications

  • Asynchronous Data Transfer: Based on profiling results, all identified to("cpu") operations have been made asynchronous using .pin_memory().to("cpu", non_blocking=True) to reduce synchronization stalls. See also this PR: Use pin_memory in forward_batch.init_new to reduce decoding latency#21360

  • Scheduling Refactor & Draft Pre-execution:

    • Reorganized the execution sequence of the draft model. The original "current round" draft task is replaced with prepare_for_verify, which handles input construction for verification or output restoration from the previous round.
    • The "next round" draft task is moved forward (pre-executed) to follow the draft_extend phase of the current round. This effectively hides the CPU dispatch latency of the draft model.
    • Removal of CPU Synchronizations: To fully eliminate D-to-H (Device-to-Host) bubbles, we removed the synchronization of ForwardBatch.seq_lens_cpu during the drafting phase.
    • Scope & Impact: This optimization is best suited for models like DeepSeek-V3.2, which do not rely on seq_lens_cpu during the decode stage. For models like Qwen3 that require these lengths, this change may affect the accepted length (causing it to fluctuate higher or lower).
  • Rollback PR#21507, the native implementation leads to significant degradation in MTP scenarios.

Accuracy Tests

Theoretically, for dsa model, this feature will have absolutely no impact on its accuracy or accepted length, as seen in DeepSeek V3.2:

We need to investigate why DeepSeek-V3.2 didn't yield any gains here.

Before:

image image

After:

image image

Speed Tests and Profiling

TODO

H20 dsv3.2 (Layer pruning) prof

Before:

image

After:

image

Ascend A3 dsv3.2-w8a8 prof

Before:

image

After:

image

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

litmei added 10 commits March 12, 2026 09:22
…self_2

# Conflicts:
#	python/sglang/srt/environ.py
#	python/sglang/srt/model_executor/forward_batch_info.py
# Conflicts:
#	python/sglang/srt/model_executor/forward_batch_info.py
#	python/sglang/srt/model_executor/model_runner.py
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants