Skip to content

Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay)#81

Merged
noonghunna merged 1 commit into
masterfrom
gemma-dflash
May 6, 2026
Merged

Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay)#81
noonghunna merged 1 commit into
masterfrom
gemma-dflash

Conversation

@noonghunna

Copy link
Copy Markdown
Owner

Summary

  • Adds vllm/gemma-dflash variant — Gemma 4 31B + z-lab DFlash block-diffusion drafter (vLLM PR #41703, Codex-rebased onto upstream/main)
  • First Ampere consumer cross-rig data on DFlash + Gemma 4 (Google benched on RTX PRO 6000 Blackwell; this is 2× RTX 3090 sm_86, PCIe-only)
  • 12-file RO-mount overlay vendored under models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ — drop the dir when PR #41703 merges and a nightly rebuilds against it

Bench at shipped n=7 (TP=2, 2× 3090, 230W cap)

Metric Value
Narr wall TPS 95.16 (1.56× over no-spec baseline)
Code wall TPS 167.55 (2.74× over baseline)
AL code 5.23
Avg accept code ~60%
VRAM ~22.7 GB/card (TP=2 split)

Soak PASS (100 turns continuous, 0 errors / 0 silent-empty / 0 MiB growth / 98.6% retention / p50 55.78 TPS).

n-sweep highlights

n Narr Code Verdict
5 109 141 best narrative (override hint for chat)
6 99 161 knee
7 95 168 shipped — code-optimal
8 91 167 strictly dominated by n=7
15 82 172 past the knee, accept tail 8%

Narrative monotonically degrades; code TPS saturates at n=7. Override num_speculative_tokens to 5 for narrative-heavy / chat workloads (compose comment documents this).

DFlash vs MTP on Gemma 4

Method Narr Code
MTP n=4 (gemma-mtp.yml) 109 142
DFlash n=7 (this PR) 95 168

DFlash wins code (+18%), MTP wins narrative (+15%) — different operating regimes; users should pick based on workload.

What this PR adds

  • models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml — TP=2, BF16 KV, 32K ctx, vision tower preserved, n=7
  • models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ — 12 vendored vLLM Python files + README with provenance + drop conditions
  • scripts/switch.sh entry: vllm/gemma-dflash → port 8032
  • BENCHMARKS.md row under Gemma 4 31B section

Codex rebase note

PR #41703 was authored against an older base than current main (pre-SpecDecodeBaseProposer refactor). ChatGPT/Codex cherry-picked the 6 PR commits onto upstream/main 5d0fd87038b cleanly. One manual fix preserved on top in this PR's overlay:

  • _warn_if_multimodal (PR's name) → _raise_if_multimodal (post-2026-04 main rename) in v1/spec_decode/dflash.py — without this, the override doesn't take effect and DFlash rejects multimodal inputs with NotImplementedError.

Both findings are flagged in a comprehensive upstream report being prepared for PR #41703.

Test plan

  • bash scripts/switch.sh vllm/gemma-dflash boots cleanly
  • verify-full.sh 8/8 passes against the new endpoint at port 8032
  • bench.sh reproduces ~95 narr / 168 code TPS within CV
  • SOAK_MODE=continuous bash scripts/soak-test.sh passes (no growth, no silent-empty)
  • No regression on existing vllm/gemma-mtp (composes can coexist; only one runs at a time via switch.sh)

🤖 Generated with Claude Code

Cross-rig data on z-lab/gemma-4-31B-it-DFlash block-diffusion drafter — first
Ampere consumer benchmark of DFlash on Gemma 4. PR #41703 was needs-rebase
against pre-SpecDecodeBaseProposer-refactor main; ChatGPT/Codex cherry-picked
the 6 PR commits onto upstream/main 5d0fd87038b cleanly with one manual fix
on top (_warn_if_multimodal → _raise_if_multimodal rename, otherwise
multimodal inputs throw NotImplementedError).

Bench at shipped n=7 (TP=2, 2× 3090 PCIe, 230W cap):
  narrative:  95 wall TPS  (1.56× over no-spec-decode baseline)
  code:      168 wall TPS  (2.74× over baseline)
  Avg accept code: ~60%, AL 5.23

n-sweep summary (n=4..15): code TPS saturates at n=7; n=8 strictly dominated
by n=7 (worse on both narr+code); n=15 past the knee. Narrative monotonically
degrades with bigger n — n=5 is best for prose at 109/141, override hint
documented in compose comment for chat workloads.

Soak PASS: 100 turns, 0 errors, 0 silent-empty, 0 MiB growth, 98.6% TPS
retention, p50 decode 55.78 TPS (vs 52.71 at n=5 — n=7 is strictly better
under soak conditions too, with 2.2 GB lower peak VRAM).

DFlash vs MTP on Gemma 4: DFlash wins code (+18%), MTP wins narrative (+15%).
Different operating regimes — block-diffusion's larger draft horizon helps
deterministic code more than prose.

Adds:
- models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml
- models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ (12 RO-mounted Python
  files + README documenting provenance + drop conditions)
- scripts/switch.sh entry: vllm/gemma-dflash → port 8032
- BENCHMARKS.md row under Gemma 4 31B section

Drop the entire patches dir + overlay block when PR #41703 merges and a
vLLM :nightly tag rebuilds against it.
@noonghunna noonghunna merged commit 89b65dd into master May 6, 2026
@noonghunna noonghunna deleted the gemma-dflash branch May 6, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant