Skip to content

[Spec Decode] Allow DFlash drafter to autoselect non-causal-capable backend on Gemma 4#42069

Open
mikeumus wants to merge 2 commits into
vllm-project:mainfrom
Divinci-AI:gemma4-dflash-decouple
Open

[Spec Decode] Allow DFlash drafter to autoselect non-causal-capable backend on Gemma 4#42069
mikeumus wants to merge 2 commits into
vllm-project:mainfrom
Divinci-AI:gemma4-dflash-decouple

Conversation

@mikeumus
Copy link
Copy Markdown

@mikeumus mikeumus commented May 8, 2026

Summary

Fixes #42068.

When the target is Gemma 4 (heterogeneous head dimensions: head_dim=256, global_head_dim=512), Gemma4Config.verify_and_update_config force-locks attention_config.backend to TRITON_ATTN to prevent mixed-backend numerical divergence within the target's own forward (sliding vs global attention).

That lock is correct for the target's intra-forward consistency, but it propagates to the drafter via DFlashProposer._create_draft_vllm_config. DFlash drafters use non-causal (bidirectional) attention to draft a whole block in one pass — TRITON_ATTN rejects this:

ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid for
this configuration. Reason: ['non-causal attention not supported']

Result: Gemma 4 + DFlash speculative decoding is structurally impossible upstream today.

Why the lock doesn't apply to spec-decode

The "mixed-backend numerical divergence" risk is legitimate inside one forward pass (Gemma 4's own sliding-vs-global layers). It is not legitimate across spec-decode where target and drafter are separate nn.Modules with separate KV caches and separate forwards — rejection sampling tolerates numerical drift by design. The MTP case (#41745) is the exception (KV-shared with target — must inherit backend); DFlash and any other independent-KV drafter is the general case where the drafter should be free to autoselect.

The change

One method override in vllm/v1/spec_decode/dflash.py:

@override
def _create_draft_vllm_config(self) -> VllmConfig:
    base = super()._create_draft_vllm_config()
    return replace(
        base,
        attention_config=replace(
            base.attention_config,
            use_non_causal=True,
            backend=None,    # <-- bypass the MTP-style propagation
        ),
    )

backend=None lets the standard autoselect pick a backend that supports non-causal attention (FLEX_ATTENTION / FLASHINFER) on the drafter, while leaving the target's TRITON_ATTN lock untouched. use_non_causal=True is preserved.

Precedent

vllm-ascend (sibling project for Ascend NPUs) already merged the equivalent decoupling: vllm-project/vllm-ascend#7342"Separate attention backend for target and draft model" by @SidaoY.

Measured impact

Modal H100, 10 mixed prompts (5 math + 5 conversational), temperature=0.0, max_new_tokens=256, vLLM nightly + this patch overlay:

Phase Target Avg speedup Math-reasoning peak
1 google/gemma-4-31B-it (stock) 1.28× 4.4×
2 google/gemma-4-31B-it + Divinci QLoRA-DFO (merged bf16) 1.18× 4.0×

Phase 2's QLoRA-fine-tuned target retains 92% of the stock-target speedup despite the drafter being conditioned on stock Gemma 4 hidden states — confirming the fix is broadly useful for the fine-tune ecosystem, not just stock targets. Output text was bit-identical between with-DFlash and without-DFlash runs (verifier's lossless guarantee held).

Test plan

  • Engine init succeeds end-to-end on Modal H100 with target=google/gemma-4-31B-it, drafter=z-lab/gemma-4-31B-it-DFlash
  • Drafter picks FLEX_ATTENTION via autoselect (vs the failing TRITON_ATTN on main)
  • Both models load; torch.compile completes for backbone + eagle_head
  • A/B speedup measured (table above)
  • Output bit-identical with vs without DFlash on the same prompt set
  • CI

Related

DCO sign-off: yes (Signed-off-by: Mike Mooring <mike@divinci.ai>)

🤖 Generated with Claude Code

mikeumus added 2 commits May 7, 2026 17:17
…mma4 target

When the target is Gemma4 (heterogeneous head dimensions:
head_dim=256, global_head_dim=512), Gemma4Config.verify_and_update_config
force-locks attention_config.backend to TRITON_ATTN. The base proposer
at llm_base_proposer.py:1320-1326 should reset that to None for the
drafter (spec_cfg.attention_backend default) but in our reproducer the
drafter still ends up on Triton, which DFlash's non-causal drafter
attention then rejects at engine init.

This patch:
  1. Defensively forces backend=None in DFlashProposer._create_draft_vllm_config
     so the drafter goes through autoselect (likely FlashInfer for non-causal)
  2. Adds a [DIVINCI-FORK] diagnostic log showing the backend values
     across the base->fix transition, so we can confirm where the
     unexpected Triton override originates

If (1) is sufficient, this is the proper fix and we'll clean up the
diagnostic before the upstream PR. If (1) is NOT sufficient (drafter
still ends up on Triton), the diagnostic will tell us where the
override comes from after our reset, narrowing the next iteration.

Signed-off-by: Mike Mooring <mike@divinci.ai>
…ackend on Gemma 4

When the target is Gemma 4 (heterogeneous head dimensions:
head_dim=256, global_head_dim=512), Gemma4Config.verify_and_update_config
in vllm/model_executor/models/config.py force-locks
attention_config.backend to TRITON_ATTN to prevent mixed-backend
numerical divergence within the target's forward (sliding vs global
attention layers).

This is correct for the target's own forward pass. But in spec-decode,
target and drafter are separate models with separate KV caches and
separate forwards — they're algorithmically independent and rejection
sampling tolerates numerical drift by design. The base
SpecDecodeBaseProposer._create_draft_vllm_config explicitly says
"Never inherit the attention backend from base" and resets to
spec_cfg.attention_backend (default None) for exactly this reason.

For DFlash specifically, the drafter requires non-causal (bidirectional)
attention — TRITON_ATTN doesn't support this and rejects the drafter
at engine init with:

  ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not
  valid for this configuration. Reason: ['non-causal attention not supported']

This patch defensively forces backend=None on the drafter's
attention_config, letting the standard autoselect pick a backend
that supports non-causal (e.g. FLEX_ATTENTION). use_non_causal=True
is preserved.

Verified end-to-end on Modal H100 with target=google/gemma-4-31B-it,
drafter=z-lab/gemma-4-31B-it-DFlash:
  - Engine init succeeds (drafter picks FLEX_ATTENTION via autoselect)
  - Both models load, torch.compile completes for backbone + eagle_head
  - 10-prompt A/B (with vs without DFlash) shows 1.28x average,
    4.4x peak speedup on math reasoning prompts

Related issue: filed as Divinci-AI/vllm fork demonstration; full vLLM
issue + PR will reference this commit's e2e validation.

Signed-off-by: Mike Mooring <mike@divinci.ai>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the _create_draft_vllm_config method in vllm/v1/spec_decode/dflash.py to explicitly set the attention backend to None. This change ensures that the drafter can utilize the standard autoselect path to choose a backend supporting non-causal attention (such as FLEX_ATTENTION), preventing it from being locked into a backend like TRITON_ATTN which is often forced by models like Gemma 4. I have no feedback to provide as there were no review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

1 participant