[Feature] Spec V2 DFlash Support by dcw02 · Pull Request #23000 · sgl-project/sglang

dcw02 · 2026-04-16T22:51:52Z

Motivation

Add spec v2 to DFlash

Benchmarks

Run on gcp b200:8 node, using a gsm8k sweep script, qwen3-8b target, z-lab/Qwen3-8B-DFlash-b16 draft model, trtllm_mha target attention, fa4 draft attention, piecewise cuda graphs on.

v1 performance

DFLASH output tok/s
tp\conc       1         32
-------  ------  ---------
      1  845.98  11,405.85

DFLASH accuracy
tp\conc      1     32
-------  -----  -----
      1  0.852  0.844

DFLASH acceptance length (mean spec_accept_length)
tp\conc      1     32
-------  -----  -----
      1  6.345  6.487

v2 performance

DFLASH output tok/s
tp\conc         1         32
-------  --------  ---------
      1  1,161.88  15,326.81

DFLASH accuracy
tp\conc      1     32
-------  -----  -----
      1  0.852  0.844

DFLASH acceptance length (mean spec_accept_length)
tp\conc      1     32
-------  -----  -----
      1  6.352  6.482

this spec v2 version also brings in some extra optimizations compared to #20547 which brought bs1 performance from 900 -> 1161 tok/s and bs32 from 12,300 -> 15,326 tok/s.

Benchmarking is done with this script using the command SGLANG_ENABLE_SPEC_V2=1 SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 python benchmark/dflash/bench_dflash_gsm8k_sweep.py --skip-baseline --tp-sizes 1 --concurrencies 1,32 --attention-backends trtllm_mha --speculative-draft-attention-backend fa4 on 1xB200

i removed mamba memory calculations to add later once i figure out the best way to do that

gemini-code-assist · 2026-04-16T22:51:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

dcw02 · 2026-04-16T22:54:13Z

/rerun-test test/registered/spec/dflash/test_dflash.py

github-actions · 2026-04-16T22:54:46Z

✅ 1-gpu-5090 (1 test): View workflow run

cd test/ && python3 registered/spec/dflash/test_dflash.py

ggg-s · 2026-04-16T23:46:52Z

What optimizations were made on top of PR #20547? PCG?

dcw02 · 2026-04-16T23:49:34Z

What optimizations were made on top of PR #20547? PCG?

I rewrote the fused kv helper, added some new triton ops, removed some syncs, etc. PCG already exists, I did not add it.

dcw02 · 2026-04-24T15:28:26Z

I am investigating accept length degradations for both v1 and v2 paths in this PR but not in #20547

dcw02 · 2026-04-25T03:32:14Z

the accept length degradation issue has been fixed, it was a rope config handling issue when transformers version got bumped

dcw02 · 2026-04-25T08:46:19Z

so i realized we can carry reserved kv allocation metadata through overlap draft state and let next-step prep use the prepared allocation watermark, we could get rid of a scheduling bubble that helps low concurrency a lot. for correctness, scheduler output processing applies the request watermark monotonically later.

prior v2 baseline:

DFLASH output tok/s
tp\conc       1         32
-------  ------  ---------
      1  975.04  12,956.91

decoupling:

DFLASH output tok/s
tp\conc         1         32
-------  --------  ---------
      1  1,073.64  12,995.40

dcw02 · 2026-04-27T15:24:08Z

@tugot17 yes, we can merge it after spec v2 dflash is merged. thanks for your contribution!

liusy58 · 2026-04-27T15:30:56Z

@dcw02 Thank you for your reply. Can we chat on slack?

tugot17 · 2026-04-27T16:13:09Z

@dcw02
I added the LFM changes, but if it will be easier to add it after the DFLash is merge to main in the first place than let's wait

https://github.com/sgl-project/sglang/pull/23847/changes

liusy58 · 2026-04-28T09:34:17Z

@dcw02 Could you please resolve these merge conflicts?

…h-spec-v2

dcw02 · 2026-04-28T22:31:04Z

@liusy58 fixed merged conflicts

dcw02 · 2026-05-01T05:03:21Z

I will put up separate PRs for the draft swa layers and gemma 4 support so they can be merged in first for v1

…roject#23000 Cherry-picked the two files needed for smcsd's DFlash direct-load path: - python/sglang/srt/models/dflash.py (DFlashDraftModel + DFlashDecoderLayer) - python/sglang/srt/speculative/dflash_utils.py (helpers used by the model) Copied from sglang upstream PR refs/pull/23000/head, which is the canonical implementation of DFlash speculative decoding referenced by checkpoints like z-lab/Qwen3.6-27B-DFlash. Adding the model class to our branch lets smcsd's _init_dflash_direct load DFlash drafts directly via sglang's class registry instead of transformers' trust_remote_code (which would 404 on dflash.py). The other DFlash files in PR sgl-project#23000 (dflash_worker, dflash_info, dflash_accept_bonus, etc.) are sglang-side speculative decoding scaffolding not used by smcsd's SMC-DFlash worker.

ggg-s · 2026-05-07T11:56:36Z

hi @dcw02 Is the current PR compatible with DFLASH + FlashInfer + mixed batches?

dcw02 · 2026-05-07T17:58:07Z

hi @dcw02 Is the current PR compatible with DFLASH + FlashInfer + mixed batches?

I haven't tested that myself so I'm unsure

ggg-s · 2026-05-08T03:40:59Z

hi @dcw02 Can the current PCG be used?

Qiaolin-Yu · 2026-05-08T20:47:01Z

+class TestDFlashServerSpecV2(TestDFlashServerBase):
+    spec_v2 = True
+
+    @unittest.skip


qq: why do we need to skip this?

I just tested and it passes, so I re-enabled it. I think prior to the merge commit f39d86d4 there was a bug with PCG in flashinfer backend that caused it to fail.

Qiaolin-Yu · 2026-05-08T20:51:47Z

@@ -26,6 +28,8 @@ class TestDFlashServerBase(CustomTestCase, MatchedStopMixin, GSM8KMixin):
    attention_backend = "flashinfer"


qq: Does dflash only support flashinfer?

DFlash supports fa3, fa4, flashinfer, and triton as speculative draft attention backends. For best performance on b200s you can mix trtllm_mha target attention backend with fa4 draft attention backend. trtllm_mha attention isn't supported for DFlash draft since it requires non-causal full_attention/ENCODER_ONLY.

Qiaolin-Yu · 2026-05-08T20:55:12Z

@@ -110,6 +97,23 @@ def _lazy_init_buf(self, draft_input: EagleDraftInput):
            device=self.device,
        )

+        if self.spec_algo.is_dflash():


nit: I prefer adding a more general function (something like need_topk) instead of checking whether it's dflash here. What do you think?

yes agreed, I changed it to SpeculativeAlgorithm.need_topk() instead of special casing it to DFlash

Qiaolin-Yu · 2026-05-08T22:37:01Z

-            logger.warning(
-                "Overlap scheduler is disabled when using DFLASH speculative decoding (spec v2 is not supported yet)."
-            )
+            if envs.SGLANG_ENABLE_SPEC_V2.get():


spec v2 is opened by default. the logic here may need to be changed

yes let me do a merge from main and I will update the logic here

ok updated the logic now that spec v2 is default

dssugar · 2026-05-11T13:41:08Z

FYI: I tried this PR's gemma4_causal.py / gemma4_mm.py changes against
dense Gemma 4 31B (RedHatAI/gemma-4-31B-it-NVFP4) with z-lab/gemma-4-31B-it-DFlash
as the drafter on a single RTX 5090 (sm120) with attention_backend=triton
(fa4 / trtllm_mha are not available on this device). v1 DFlash path.

Results (warm, temperature=0, short prompts):

code 100w: 158.4 tok/s (vs MTP baseline 83.8 = 1.89x; vs the device's
vLLM main + MTP at 164 = 97%)
haiku: 92.9 tok/s, jp: 64.0 tok/s
server-log accept length (code peak): 4.47, accept rate 0.23
jp accept length: 1.80, rate 0.05 (predictable: JP is harder to draft)

One small thing that wasn't covered by this PR: Gemma 4 ties lm_head to
embed_tokens (a plain nn.Embedding subclass, not VocabParallelEmbedding),
so dflash_worker._prepare_for_speculative_decoding rejects it at
hasattr(lm_head, "shard_indices"). I worked around it by setattr-ing a
trivial VocabParallelEmbeddingShardIndices (tp=1, num_added=0) onto
lm_head, which lets the fast path (tp_size == 1 and num_added == 0) match
without touching the TP / added-vocab branches. Approximate diff:

# in gemma4_causal.py
from sglang.srt.layers.vocab_parallel_embedding import (
    ParallelLMHead, VocabParallelEmbeddingShardIndices,
)

def _ensure_dflash_shard_indices(lm_head, vocab_size: int) -> None:
    if getattr(lm_head, \"shard_indices\", None) is not None:
        return
    lm_head.shard_indices = VocabParallelEmbeddingShardIndices(
        padded_org_vocab_start_index=0,
        padded_org_vocab_end_index=vocab_size,
        padded_added_vocab_start_index=vocab_size,
        padded_added_vocab_end_index=vocab_size,
        org_vocab_start_index=0,
        org_vocab_end_index=vocab_size,
        added_vocab_start_index=vocab_size,
        added_vocab_end_index=vocab_size,
    )

# call after the lm_head assignment in __init__ of
# Gemma4ForCausalLM and Gemma4ForConditionalGeneration:
_ensure_dflash_shard_indices(self.lm_head, vocab_size)

Not asking for changes to this PR (its scope is V2 / overlap scheduling).
Just sharing in case someone hits the same wall on a tied-embedding model.
I've only verified short/temp=0 generation so far — haven't checked for the
gibberish loop reported on vLLM #41262 (TP=2). Thanks for the work on V2!

dcw02 · 2026-05-11T13:45:16Z

@dssugar feel free to put up a PR to add gemma 4 support to v1!

…fore target verify

dcw02 · 2026-05-13T02:50:30Z

I added a DFlash only prefill refill heuristic for online scheduling, I think it can be removed when mixed prefill decode is fully implemented. Previously, the scheduler would admit new prefill as soon as a single running request finished, at max concurrency this produced many one request prefill batches near full decode occupancy that massively reduced throughput. The heuristic now waits until a small target number of running request slots are free before admitting prefill work to refill together.

I ran extensive sweeps and ablations using many target / draft model combinations and set a good default target of 2/3/4/4 for max-running 8/16/32/64. An env override SGLANG_DFLASH_PREFILL_REFILL_TARGET remains for benchmarking/restoring old immediate-refill behavior.

This could be more generalized to other spec methods but I'm keeping it DFlash specific since all the benchmarking evidence and tuning is for DFlash's refill cadence. Other spec algorithms are unchanged.

Qwen/Qwen3-8B gsm8k concurrency 32 performance: 13,879.59 tok/s -> 15,326.81 tok/s

tugot17 · 2026-05-20T17:13:50Z

@dcw02 any progess on this?

dcw02 added 6 commits April 16, 2026 19:19

feat(spec): add dflash spec v2

70bdac0

remove benchmark sweep

9e87ef4

remove dflash spec v2 specific env

c0a329d

clean up

de0372a

remove mamba memory calculations

a722fee

update test for spec v2 and overlap plan streams

ad6f0bb

dcw02 requested review from Qiaolin-Yu, Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 16, 2026 22:51

dcw02 mentioned this pull request Apr 16, 2026

[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support #20547

Closed

5 tasks

dcw02 added the run-ci label Apr 16, 2026

Qiaolin-Yu mentioned this pull request Apr 17, 2026

Speculative Decoding Development Roadmap (2026 Q2) #23005

Open

11 tasks

dcw02 added 2 commits April 25, 2026 00:43

fix dflash rope config for transformers v5

7ea9fbd

small cleanup

6465189

decouple dflash v2 next step planning from lagging host metadata

ce09806

draft swa layer support

89a4a26

Merge branch 'main' of github.com:sgl-project/sglang into dcw02/dflas…

f39d86d

…h-spec-v2

dcw02 requested a review from kpham-sgl as a code owner May 1, 2026 04:56

gemma 4 support

9893ef8

dcw02 force-pushed the dcw02/dflash-spec-v2 branch from 8ae7dd3 to 9893ef8 Compare May 1, 2026 05:01

Qiaolin-Yu reviewed May 8, 2026

View reviewed changes

dssugar mentioned this pull request May 11, 2026

[Gemma 4] Add DFLASH speculative decoding support #24985

Open

dcw02 added 10 commits May 11, 2026 22:03

clean up dead methods from gemma 4 dflash support

fe8ceef

re-enable greedy determinism test

bf853ca

spec algo need_topk() for future map instead of special casing dflash

d89e71e

Merge remote-tracking branch 'origin/main' into dcw02/dflash-spec-v2

ebb2526

clean up gemma 4 changes

c07177a

update dflash server_args.py for spec v2 default

35cbfea

simplify dflash server_args

f5ba4cb

enable pcg for dflash

5192b04

remove useless verify prep side stream that was immediately joined be…

55a7980

…fore target verify

DFlash prefill refill heuristic

93758bb

		@@ -26,6 +28,8 @@ class TestDFlashServerBase(CustomTestCase, MatchedStopMixin, GSM8KMixin):
		attention_backend = "flashinfer"

Conversation

dcw02 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmarks

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

ggg-s commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 16, 2026

Uh oh!

dcw02 commented Apr 24, 2026

Uh oh!

dcw02 commented Apr 25, 2026

Uh oh!

dcw02 commented Apr 25, 2026

Uh oh!

dcw02 commented Apr 27, 2026

Uh oh!

liusy58 commented Apr 27, 2026

Uh oh!

tugot17 commented Apr 27, 2026

Uh oh!

liusy58 commented Apr 28, 2026

Uh oh!

dcw02 commented Apr 28, 2026

Uh oh!

dcw02 commented May 1, 2026

Uh oh!

ggg-s commented May 7, 2026

Uh oh!

dcw02 commented May 7, 2026

Uh oh!

ggg-s commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dssugar commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcw02 commented May 11, 2026

Uh oh!

dcw02 commented May 13, 2026

Uh oh!

tugot17 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dcw02 commented Apr 16, 2026 •

edited

Loading

dssugar commented May 11, 2026 •

edited

Loading