[Gemma 4] Add DFLASH speculative decoding support by dssugar · Pull Request #24985 · sgl-project/sglang

dssugar · 2026-05-11T14:14:47Z

Motivation

Adds DFLASH speculative decoding support for Gemma 4 (both Gemma4ForCausalLM
and Gemma4ForConditionalGeneration), mirroring the existing EAGLE3 capture
hook. Picks up @dcw02's invitation in #23000 to "put up a PR to add gemma 4
support to v1" — see #23000 (comment).

Once merged, this unblocks z-lab/gemma-4-31B-it-DFlash and
z-lab/gemma-4-26B-A4B-it-DFlash against any Gemma 4 target.

Changes

gemma4_causal.py / gemma4_mm.py: add set_dflash_layers_to_capture
next to the existing set_eagle3_layers_to_capture. Uses the same
model.layers_to_capture = [val + 1 for val in layer_ids] convention.
Unlike EAGLE3, DFLASH requires explicit layer_ids (the checkpoint's
dflash_config.target_layer_ids), so None raises.
_ensure_dflash_shard_indices helper (in gemma4_causal.py):
Gemma 4 ties lm_head to embed_tokens (a plain nn.Embedding subclass
via Gemma4TextScaledWordEmbedding), so
dflash_worker._prepare_for_speculative_decoding rejects it at
hasattr(lm_head, "shard_indices"). The helper setattr-s a trivial
VocabParallelEmbeddingShardIndices (tp=1, num_added=0) onto the tied
head; the worker's fast path (tp_size == 1 and num_added == 0) then
proceeds without touching the TP or added-vocab branches.
Gemma4ForConditionalGeneration.lm_head: previously not exposed.
Aliased to self.language_model.embed_tokens so the DFLASH worker can
resolve it via target_model.lm_head. No state-dict impact (it's a
pure attribute pointer, not a new submodule).

Local verification

Verified on a fresh venv built from this branch
(pip install ./python from the branch tip),
single RTX 5090 (sm120), attention-backend=triton,
kv-cache-dtype=fp8_e4m3, RedHatAI/gemma-4-31B-it-NVFP4 target,
z-lab/gemma-4-31B-it-DFlash drafter, temperature=0 short prompts:

metric	value
code (100w) warm tok/s	150.1
haiku warm tok/s	79.5
jp warm tok/s	63.5
code accept length (peak)	4.45
code accept rate	0.23

Output quality looks healthy — e.g. haiku:

Power in the chip, / Frames fly fast in every scene, / Future of the game.

Also smoke-tested extended scenarios that historically triggered the
vllm-project/vllm#41262 repetitive-token symptom for adjacent stacks:

600-token English long-form (completion_tokens=600): full essay
on speculative decoding, no repetitive substring patterns
(30+ char × 4 regex match count = 0).
3-turn multi-turn conversation: assistant correctly recalls earlier
user-stated fact across turns.
300-token Japanese long-form: structured markdown output with
per-section labels (**1. アーキテクチャ** etc.), no degradation in
the lower-accept-rate regime where JP normally sits.
Streaming endpoint (SSE chat/completions with stream=true):
DFlash multi-token chunks are delivered correctly through the
streaming path.

For reference on the same box, NEXTN MTP gives ~84 tok/s on code
(this PR is ~1.79× on that workload), and vLLM main + MTP gives
~164 tok/s (this PR closes most of that gap on SGLang side).
Cross-checked the same patch shape on an earlier bcf8d100 base
(gemma4-mtp-fin branch) where I first found the missing
set_dflash_layers_to_capture and the shard_indices gate — same
behavior (158.4 / 92.9 / 64.0 tok/s, accept length peak 4.47).

Scope not yet exercised

TP > 1 (single-GPU test box only)
Long-context (verified at --context-length 8192, default
mem-fraction-static 0.75)
Non-greedy sampling (all of the above is temperature=0)
The /C-loop / repetitive-token failure mode reported on
[Bug]: Gemma-4 31B with DFlash speculator produces gibberish/repetitive token loop vllm-project/vllm#41262 for RedHatAI/gemma-4-31B-it-speculator.dflash
with TP=2 — not retested (this box is single-GPU)

Happy to extend testing if reviewers flag a specific gap.

Adds set_dflash_layers_to_capture to Gemma4ForCausalLM and Gemma4ForConditionalGeneration, mirroring the existing EAGLE3 capture path. This unblocks DFLASH v1 with z-lab/gemma-4-{31B-it,26B-A4B-it}-DFlash on top of any Gemma 4 target. Gemma 4 ties lm_head to embed_tokens (a plain nn.Embedding subclass, not a VocabParallelEmbedding), so dflash_worker._prepare_for_speculative_decoding rejects it at hasattr(lm_head, "shard_indices"). Adds a tiny helper that injects a trivial tp=1 VocabParallelEmbeddingShardIndices onto the tied lm_head; the worker's fast path (tp_size == 1 and num_added == 0) handles greedy verification without touching TP / added-vocab branches. The MM class previously didn't expose self.lm_head; this commit makes it an alias to language_model.embed_tokens so the DFLASH worker can find it via target_model.lm_head. Verified locally on a single RTX 5090 (sm120, triton attention backend) with RedHatAI/gemma-4-31B-it-NVFP4 as target and z-lab/gemma-4-31B-it-DFlash as drafter: - code 100w warm: 158.4 tok/s (vs MTP baseline 83.8, 1.89x) - haiku warm: 92.9 tok/s - jp warm: 64.0 tok/s - server-log accept length (code peak): 4.47, accept rate 0.23 Scope verified is short prompts with temperature=0; longer / multi-turn / streaming and the gibberish symptom reported on vllm-project/vllm#41262 (TP=2) have not been retested. Refs sgl-project#23000 (comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-05-11T14:14:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

dcw02 · 2026-05-11T22:02:15Z

Took a look, and this seems fine to me, though I’ll defer to @hnyls2002 or @kpham-sgl on the preferred style. This PR adds a Gemma 4 compatibility shim for the current DFLASH worker assumptions, while SGLang generally expects models to opt in by exposing the speculative-decoding interface explicitly.

kpham-sgl

Hi @dssugar , thanks for the contribution! Can you test with the same set of benchmarks in this PR #24436

dssugar requested a review from kpham-sgl as a code owner May 11, 2026 14:14

kpham-sgl self-assigned this May 15, 2026

kpham-sgl reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma 4] Add DFLASH speculative decoding support#24985

[Gemma 4] Add DFLASH speculative decoding support#24985
dssugar wants to merge 1 commit into
sgl-project:mainfrom
dssugar:add-gemma4-v1-dflash-support

dssugar commented May 11, 2026

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

dcw02 commented May 11, 2026

Uh oh!

kpham-sgl left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dssugar commented May 11, 2026

Motivation

Changes

Local verification

Scope not yet exercised

Related

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

dcw02 commented May 11, 2026

Uh oh!

kpham-sgl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kpham-sgl left a comment •

edited

Loading