[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support by dcw02 · Pull Request #20547 · sgl-project/sglang

dcw02 · 2026-03-13T20:58:33Z

Motivation

Add spec v2 path for DFlash. Should be merged after #16818

TLDR
B200, GSM8K, qwen3-8b, tp size 1, concurrency 32, max new tokens 2k, greedy decoding
9,688.26 tok/s -> 12,360.49 tok/s

Modifications

Adds v2 worker and related files

Accuracy and Benchmarks

Tested on a gcp b200 machine

Commands:

# regular v1
python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

# overlap scheduling (spec v2)
SGLANG_ENABLE_SPEC_V2=1 SGLANG_ENABLE_DFLASH_SPEC_V2=1 SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python benchmark/dflash/bench_dflash_gsm8k_sweep.py --tp-sizes 1 --concurrencies 32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 --page-size 64 --skip-baseline

v1 performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc        32
-------  --------
      1  9,688.26

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.470

overlap scheduling (spec v2) performance

=== DFLASH GSM8K Sweep Summary ===
target_model=Qwen/Qwen3-8B
draft_model=z-lab/Qwen3-8B-DFlash-b16
max_new_tokens=2048
sampling=temperature:0.0, top_p:1.0, top_k:1
attention_backends=trtllm_mha
speculative_draft_attention_backend=fa4
speculative_dflash_draft_window_size=None
tp_sizes=1
concurrencies=32
questions_per_concurrency_base=128
device_sm=100
skip_baseline=True

=== Backend: trtllm_mha ===

Baseline output tok/s
tp\conc   32
-------  ---
      1  N/A

Baseline accuracy
tp\conc   32
-------  ---
      1  N/A

DFLASH output tok/s
tp\conc         32
-------  ---------
      1  12,360.49

DFLASH accuracy
tp\conc     32
-------  -----
      1  0.850

Speedup (DFLASH / baseline)
tp\conc   32
-------  ---
      1  N/A

DFLASH acceptance length (mean spec_accept_length)
tp\conc     32
-------  -----
      1  6.467

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…ock-size to server args

gemini-code-assist · 2026-03-13T21:00:08Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement for DFlash speculative decoding by integrating a new version of overlap scheduling, referred to as spec v2. The changes involve adding specialized worker logic and data structures for DFlash, optimizing KV cache operations with fused Triton kernels, and enabling auxiliary hidden state capture in various models. This update aims to boost token generation throughput, as demonstrated by the provided benchmarks, while also laying the groundwork for more advanced speculative decoding capabilities.

Highlights

DFLASH Speculative Decoding v2: Implemented DFLASH speculative decoding with overlap scheduling (spec v2) to significantly improve token generation throughput.
Performance Improvement: Achieved a notable speedup from 9,688.26 tok/s to 12,360.49 tok/s on a B200 machine with Qwen3-8B and concurrency 32, representing a 27.6% increase.
New DFlash Worker and Components: Introduced dedicated DFlash worker implementations (DFlashWorker and DFlashWorkerV2) and new data structures (DFlashDraftInput, DFlashVerifyInput, DFlashDraftInputV2) to manage the speculative decoding process.
Auxiliary Hidden State Capture: Added support for capturing auxiliary hidden states in target models (e.g., GPT-OSS, Llama, Qwen3, Qwen3.5, Qwen3-MoE, Qwen3-Next, Qwen3-VL) required for DFlash context feature projection.
Fused KV Materialization: Integrated a Triton kernel for fused KV materialization, optimizing the process of projecting and storing Key/Value states in the draft model's cache.
Configuration and Restrictions: Added new server arguments (--speculative-dflash-block-size, --speculative-dflash-draft-window-size) and enforced restrictions for DFlash spec v2, such as supporting only greedy decoding and disallowing logprobs, hidden states, and grammar constraints in phase 1.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmark/dflash/bench_dflash_gsm8k_sweep.py
- Added a new benchmark script to evaluate DFlash performance on the GSM8K dataset, including support for speculative decoding v2.
python/sglang/srt/environ.py
- Added a new environment variable, SGLANG_ENABLE_DFLASH_SPEC_V2, to control the experimental DFlash spec v2 overlap scheduling.
python/sglang/srt/layers/attention/flashinfer_backend.py
- Modified the FlashInfer attention backend to correctly handle custom masks for DFlash, ensuring proper non-causal attention for draft blocks.
python/sglang/srt/managers/schedule_batch.py
- Updated the maybe_wait_verify_done method to support synchronization for DFlash spec v2 overlap scheduling.
python/sglang/srt/managers/scheduler.py
- Implemented new validation checks and restrictions for DFlash speculative decoding requests, particularly for spec v2, to ensure compatibility with supported features like greedy decoding.
python/sglang/srt/model_executor/cuda_graph_runner.py
- Extended CUDA graph capture logic to include DFlash, allowing for efficient execution of DFlash draft and verify steps, and integrated auxiliary hidden state capture.
python/sglang/srt/model_executor/model_runner.py
- Integrated DFlash-specific configuration parsing, enabled auxiliary hidden state capture for DFlash, and adjusted KV cache scaling to accommodate combined target and draft KV pools.
python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
- Modified KV cache profiling to account for the combined memory footprint of target and DFlash draft KV pools, ensuring accurate memory allocation.
python/sglang/srt/models/dflash.py
- Added a new DFlash draft model implementation, including DFlashAttention, DFlashMLP, and DFlashDecoderLayer, designed for efficient speculative decoding.
python/sglang/srt/models/gpt_oss.py
- Added get_input_embeddings and set_dflash_layers_to_capture methods to support DFlash auxiliary hidden state capture in GPT-OSS models.
python/sglang/srt/models/llama.py
- Added the set_dflash_layers_to_capture method to enable DFlash auxiliary hidden state capture in Llama models.
python/sglang/srt/models/qwen3.py
- Added the set_dflash_layers_to_capture method to enable DFlash auxiliary hidden state capture in Qwen3 models.
python/sglang/srt/models/qwen3_5.py
- Modified the Qwen3.5 model to support DFlash auxiliary hidden state capture and added the set_dflash_layers_to_capture method.
python/sglang/srt/models/qwen3_moe.py
- Added the set_dflash_layers_to_capture method to enable DFlash auxiliary hidden state capture in Qwen3-MoE models.
python/sglang/srt/models/qwen3_next.py
- Added get_input_embeddings and set_dflash_layers_to_capture methods to support DFlash auxiliary hidden state capture in Qwen3-Next models.
python/sglang/srt/models/qwen3_vl.py
- Modified the Qwen3-VL model to support DFlash auxiliary hidden state capture and added the set_dflash_layers_to_capture method.
python/sglang/srt/server_args.py
- Added new server arguments, --speculative-dflash-block-size and --speculative-dflash-draft-window-size, and updated DFlash configuration logic.
python/sglang/srt/speculative/dflash_info.py
- Added new data structures, DFlashDraftInput and DFlashVerifyInput, to manage DFlash speculative decoding state for non-overlap scheduling.
python/sglang/srt/speculative/dflash_info_v2.py
- Added new data structures, DFlashDraftInputV2, specifically designed for DFlash speculative decoding with overlap scheduling (spec v2).
python/sglang/srt/speculative/dflash_utils.py
- Added utility functions for DFlash configuration parsing, KV cell size scaling, and accept length computation, including support for non-greedy sampling verification.
python/sglang/srt/speculative/dflash_worker.py
- Added the DFlash speculative decoding worker for non-overlap scheduling, handling draft generation and target verification steps.
python/sglang/srt/speculative/dflash_worker_v2.py
- Added the DFlash speculative decoding worker for overlap scheduling (spec v2), designed for improved performance by overlapping draft and verify phases.
python/sglang/srt/speculative/spec_info.py
- Updated the SpeculativeAlgorithm enum to include DFLASH and extended the supports_spec_v2 logic to conditionally enable DFlash spec v2.
python/sglang/srt/speculative/triton_ops/init.py
- Added FusedKVMaterializeHelper to the Triton operations module for DFlash optimizations.
python/sglang/srt/speculative/triton_ops/fused_kv_materialize.py
- Added a Triton kernel for fused KV materialization, combining KV projection, RMSNorm, and RoPE operations for DFlash performance.

Activity

The pull request introduces DFlash speculative decoding with overlap scheduling (spec v2).
Benchmarks provided in the description show a significant performance improvement from 9,688.26 tok/s to 12,360.49 tok/s.
The author has included a checklist, with items for code formatting, unit tests, documentation, and code style remaining unchecked.
No explicit comments or reviews have been made on the pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

dcw02 · 2026-03-13T21:01:42Z

Currently the only missing thing compared to v1 is non-greedy decoding support, it is being worked on.

gemini-code-assist

Code Review

The pull request introduces DFLASH speculative decoding, adding a specialized DFlashDraftModel and updating core components like ModelRunner, CudaGraphRunner, and Scheduler to support its specific requirements, including auxiliary hidden state capture and handling DFLASH-specific server arguments. New data structures (DFlashDraftInput, DFlashVerifyInput) and worker implementations (DFlashWorker, DFlashWorkerV2) manage the drafting and verification process, with optimizations like fused KV materialization. A new benchmark script is also included. An improvement opportunity exists in the scheduler to refactor duplicated logic for aborting requests with unsupported DFLASH features into a helper method for better maintainability.

gemini-code-assist · 2026-03-13T21:05:27Z

python/sglang/srt/managers/scheduler.py

+        if self.spec_algorithm.is_dflash() and req.return_logprob:
+            req.set_finish_with_abort(
+                "DFLASH speculative decoding does not support return_logprob yet."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return
+        if (
+            self.spec_algorithm.is_dflash()
+            and self.enable_overlap
+            and req.return_hidden_states
+        ):
+            req.set_finish_with_abort(
+                "DFLASH spec-v2 phase 1 does not support return_hidden_states yet."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return
+        if self.spec_algorithm.is_dflash() and (
+            req.sampling_params.json_schema is not None
+            or req.sampling_params.regex is not None
+            or req.sampling_params.ebnf is not None
+            or req.sampling_params.structural_tag is not None
+        ):
+            req.set_finish_with_abort(
+                "DFLASH speculative decoding does not support grammar-constrained decoding yet."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return
+        if (
+            self.spec_algorithm.is_dflash()
+            and self.enable_overlap
+            and (
+                req.sampling_params.top_k > 1
+                or req.sampling_params.frequency_penalty != 0.0
+                or req.sampling_params.presence_penalty != 0.0
+                or req.sampling_params.repetition_penalty != 1.0
+                or req.sampling_params.logit_bias is not None
+                or req.custom_logit_processor is not None
+            )
+        ):
+            req.set_finish_with_abort(
+                "DFLASH spec-v2 phase 1 only supports plain greedy decoding yet. "
+                "Non-greedy sampling, penalties, logit_bias, and custom logit processors are not enabled."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return


The logic for aborting requests with unsupported DFLASH features is duplicated across several if blocks. This can be refactored into a helper method to reduce code repetition and improve maintainability.

For example, you could create a helper like this:

def _abort_dflash_request(self, req: Req, message: str): req.set_finish_with_abort(message) self.init_req_max_new_tokens(req) self._add_request_to_queue(req)

Then you can simplify the checks:

if self.spec_algorithm.is_dflash(): if req.return_logprob: self._abort_dflash_request(req, "DFLASH speculative decoding does not support return_logprob yet.") return if self.enable_overlap and req.return_hidden_states: self._abort_dflash_request(req, "DFLASH spec-v2 phase 1 does not support return_hidden_states yet.") return # ... and so on

…project#20547) Cherry-pick from sgl-project#20547, resolved conflicts with PR sgl-project#16818. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…project#20547) Cherry-pick from sgl-project#20547 onto v0.5.9, resolved conflicts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pec v2 and overlap plan stream), needs clean-up

ggg-s · 2026-04-09T07:09:06Z

@dcw02 Does it currently support PCG?

dcw02 · 2026-04-09T07:11:24Z

@dcw02 Does it currently support PCG?

i've enabled it without issues with --enforce-piecewise-cuda-graph

dcw02 · 2026-04-09T07:12:11Z

i'm closing this PR and reopening it soon, from another branch. have some extra improvements

dcw02 added 30 commits January 6, 2026 23:24

starting dflash impl

10e563f

fix verify mismatch

289e748

add gsm8k bench

f1efc03

support more backends, investigate accuracy

e807216

native sglang backend

99e140a

remove hf backend

2c64b0e

dflash support flashinfer

f1a4262

remove manual management of dflash kv pool

2c5b346

add cuda graph

6a38e63

add cuda graph to draft worker

40a81af

update test

510bf0c

fix flashinfer backend

c54f336

initial radix cache support

8c8ee9c

tp_size > 1 support

0edea3f

add optional dflash_config for overrides, add --speculative-dflash-bl…

f23555b

…ock-size to server args

fix OOMs with default settings

63c0b9a

clean up

9309764

clean up dflash load_weights

644ab29

attention selection logic

ff6876a

Merge remote-tracking branch 'upstream/main' into dflash

32c3dd0

decouple context feature count K from draft num layers

d808ac9

clean up naming

e589ac1

performance optimizations

074efb2

skip Q, fused mlp

fcc9bf7

reuse buffers for decode

a79264f

optimize greedy sampling

ad5adbf

preallocate for tp>1

37fc3f1

more buffers for tp>1

72cbd9d

dflash gsm8k benchmark sweep

5a577a3

fix benchmark

d968532

dcw02 requested review from HaiShaw, Qiaolin-Yu, hebiao064, ispobock and xiezhq-hermann as code owners March 13, 2026 20:58

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

dcw02 added 2 commits March 14, 2026 00:15

add optimized non-greedy decoding

8017f43

avoid OOB in masked req_to_token gathers

7759843

This was referenced Mar 18, 2026

[DFlash] Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 #20884

Closed

Sporadic CUDA illegal memory access in dflash_worker_v2.py:335 on A6000 (SM86) z-lab/dflash#38

Open

dcw02 added 10 commits March 29, 2026 03:09

clean up dflash cuda graph runner paths

5b7ebd2

dflash spec v2 changes for cuda graph runner changes

9314e5e

clean up dflash request validation

6b6683c

clean up stop string handling

4860317

inline stop strings logic

e825041

fix cuda IMA?

c62c37c

fix cuda IMA for bs > 1 and overlap plan streams

fed6b3e

messy auto memory calculation for hybrid models for dflash (include s…

339f25e

…pec v2 and overlap plan stream), needs clean-up

update auto memory sizing

4926ca2

add dflash support for kimi k2.5

e67a0d4

dcw02 requested review from ch-wan and fzyzcjy as code owners April 7, 2026 23:31

github-actions bot added the deepseek label Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547
dcw02 wants to merge 74 commits intosgl-project:mainfrom
modal-labs:dflash_v2

dcw02 commented Mar 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

dcw02 commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 13, 2026

Uh oh!

ggg-s commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dcw02 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy and Benchmarks

v1 performance

overlap scheduling (spec v2) performance

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

dcw02 commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ggg-s commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

dcw02 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcw02 commented Mar 13, 2026 •

edited

Loading