Skip to content

[Gemma 4] Add DFLASH speculative decoding support#24985

Open
dssugar wants to merge 1 commit into
sgl-project:mainfrom
dssugar:add-gemma4-v1-dflash-support
Open

[Gemma 4] Add DFLASH speculative decoding support#24985
dssugar wants to merge 1 commit into
sgl-project:mainfrom
dssugar:add-gemma4-v1-dflash-support

Conversation

@dssugar
Copy link
Copy Markdown

@dssugar dssugar commented May 11, 2026

Motivation

Adds DFLASH speculative decoding support for Gemma 4 (both Gemma4ForCausalLM
and Gemma4ForConditionalGeneration), mirroring the existing EAGLE3 capture
hook. Picks up @dcw02's invitation in #23000 to "put up a PR to add gemma 4
support to v1" — see #23000 (comment).

Once merged, this unblocks z-lab/gemma-4-31B-it-DFlash and
z-lab/gemma-4-26B-A4B-it-DFlash against any Gemma 4 target.

Changes

  • gemma4_causal.py / gemma4_mm.py: add set_dflash_layers_to_capture
    next to the existing set_eagle3_layers_to_capture. Uses the same
    model.layers_to_capture = [val + 1 for val in layer_ids] convention.
    Unlike EAGLE3, DFLASH requires explicit layer_ids (the checkpoint's
    dflash_config.target_layer_ids), so None raises.

  • _ensure_dflash_shard_indices helper (in gemma4_causal.py):
    Gemma 4 ties lm_head to embed_tokens (a plain nn.Embedding subclass
    via Gemma4TextScaledWordEmbedding), so
    dflash_worker._prepare_for_speculative_decoding rejects it at
    hasattr(lm_head, "shard_indices"). The helper setattr-s a trivial
    VocabParallelEmbeddingShardIndices (tp=1, num_added=0) onto the tied
    head; the worker's fast path (tp_size == 1 and num_added == 0) then
    proceeds without touching the TP or added-vocab branches.

  • Gemma4ForConditionalGeneration.lm_head: previously not exposed.
    Aliased to self.language_model.embed_tokens so the DFLASH worker can
    resolve it via target_model.lm_head. No state-dict impact (it's a
    pure attribute pointer, not a new submodule).

Local verification

Verified on a fresh venv built from this branch
(pip install ./python from the branch tip),
single RTX 5090 (sm120), attention-backend=triton,
kv-cache-dtype=fp8_e4m3, RedHatAI/gemma-4-31B-it-NVFP4 target,
z-lab/gemma-4-31B-it-DFlash drafter, temperature=0 short prompts:

metric value
code (100w) warm tok/s 150.1
haiku warm tok/s 79.5
jp warm tok/s 63.5
code accept length (peak) 4.45
code accept rate 0.23

Output quality looks healthy — e.g. haiku:

Power in the chip, / Frames fly fast in every scene, / Future of the game.

Also smoke-tested extended scenarios that historically triggered the
vllm-project/vllm#41262 repetitive-token symptom for adjacent stacks:

  • 600-token English long-form (completion_tokens=600): full essay
    on speculative decoding, no repetitive substring patterns
    (30+ char × 4 regex match count = 0).
  • 3-turn multi-turn conversation: assistant correctly recalls earlier
    user-stated fact across turns.
  • 300-token Japanese long-form: structured markdown output with
    per-section labels (**1. アーキテクチャ** etc.), no degradation in
    the lower-accept-rate regime where JP normally sits.
  • Streaming endpoint (SSE chat/completions with stream=true):
    DFlash multi-token chunks are delivered correctly through the
    streaming path.

For reference on the same box, NEXTN MTP gives ~84 tok/s on code
(this PR is ~1.79× on that workload), and vLLM main + MTP gives
~164 tok/s (this PR closes most of that gap on SGLang side).
Cross-checked the same patch shape on an earlier bcf8d100 base
(gemma4-mtp-fin branch) where I first found the missing
set_dflash_layers_to_capture and the shard_indices gate — same
behavior (158.4 / 92.9 / 64.0 tok/s, accept length peak 4.47).

Scope not yet exercised

Happy to extend testing if reviewers flag a specific gap.

Related

  • [Feature] Spec V2 DFlash Support #23000 (Spec V2 DFlash Support) — orthogonal; this PR is purely v1 hook
    surface and does not touch V2 / overlap scheduling. V2 should compose
    cleanly with this since the target-side hook is shared.

Adds set_dflash_layers_to_capture to Gemma4ForCausalLM and
Gemma4ForConditionalGeneration, mirroring the existing EAGLE3 capture path.
This unblocks DFLASH v1 with z-lab/gemma-4-{31B-it,26B-A4B-it}-DFlash on
top of any Gemma 4 target.

Gemma 4 ties lm_head to embed_tokens (a plain nn.Embedding subclass, not a
VocabParallelEmbedding), so dflash_worker._prepare_for_speculative_decoding
rejects it at hasattr(lm_head, "shard_indices"). Adds a tiny helper that
injects a trivial tp=1 VocabParallelEmbeddingShardIndices onto the tied
lm_head; the worker's fast path (tp_size == 1 and num_added == 0) handles
greedy verification without touching TP / added-vocab branches.

The MM class previously didn't expose self.lm_head; this commit makes it an
alias to language_model.embed_tokens so the DFLASH worker can find it via
target_model.lm_head.

Verified locally on a single RTX 5090 (sm120, triton attention backend) with
RedHatAI/gemma-4-31B-it-NVFP4 as target and z-lab/gemma-4-31B-it-DFlash as
drafter:
  - code 100w warm: 158.4 tok/s (vs MTP baseline 83.8, 1.89x)
  - haiku warm:      92.9 tok/s
  - jp warm:         64.0 tok/s
  - server-log accept length (code peak): 4.47, accept rate 0.23

Scope verified is short prompts with temperature=0; longer / multi-turn /
streaming and the gibberish symptom reported on vllm-project/vllm#41262
(TP=2) have not been retested.

Refs sgl-project#23000 (comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dssugar dssugar requested a review from kpham-sgl as a code owner May 11, 2026 14:14
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@dcw02
Copy link
Copy Markdown
Collaborator

dcw02 commented May 11, 2026

Took a look, and this seems fine to me, though I’ll defer to @hnyls2002 or @kpham-sgl on the preferred style. This PR adds a Gemma 4 compatibility shim for the current DFLASH worker assumptions, while SGLang generally expects models to opt in by exposing the speculative-decoding interface explicitly.

@kpham-sgl kpham-sgl self-assigned this May 15, 2026
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dssugar , thanks for the contribution! Can you test with the same set of benchmarks in this PR #24436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants