Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay) by noonghunna · Pull Request #81 · noonghunna/club-3090

noonghunna · 2026-05-06T14:15:17Z

Summary

Adds vllm/gemma-dflash variant — Gemma 4 31B + z-lab DFlash block-diffusion drafter (vLLM PR #41703, Codex-rebased onto upstream/main)
First Ampere consumer cross-rig data on DFlash + Gemma 4 (Google benched on RTX PRO 6000 Blackwell; this is 2× RTX 3090 sm_86, PCIe-only)
12-file RO-mount overlay vendored under models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ — drop the dir when PR #41703 merges and a nightly rebuilds against it

Bench at shipped n=7 (TP=2, 2× 3090, 230W cap)

Metric	Value
Narr wall TPS	95.16 (1.56× over no-spec baseline)
Code wall TPS	167.55 (2.74× over baseline)
AL code	5.23
Avg accept code	~60%
VRAM	~22.7 GB/card (TP=2 split)

Soak PASS (100 turns continuous, 0 errors / 0 silent-empty / 0 MiB growth / 98.6% retention / p50 55.78 TPS).

n-sweep highlights

n	Narr	Code	Verdict
5	109	141	best narrative (override hint for chat)
6	99	161	knee
7	95	168	shipped — code-optimal
8	91	167	strictly dominated by n=7
15	82	172	past the knee, accept tail 8%

Narrative monotonically degrades; code TPS saturates at n=7. Override num_speculative_tokens to 5 for narrative-heavy / chat workloads (compose comment documents this).

DFlash vs MTP on Gemma 4

Method	Narr	Code
MTP n=4 (gemma-mtp.yml)	109	142
DFlash n=7 (this PR)	95	168

DFlash wins code (+18%), MTP wins narrative (+15%) — different operating regimes; users should pick based on workload.

What this PR adds

models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml — TP=2, BF16 KV, 32K ctx, vision tower preserved, n=7
models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ — 12 vendored vLLM Python files + README with provenance + drop conditions
scripts/switch.sh entry: vllm/gemma-dflash → port 8032
BENCHMARKS.md row under Gemma 4 31B section

Codex rebase note

PR #41703 was authored against an older base than current main (pre-SpecDecodeBaseProposer refactor). ChatGPT/Codex cherry-picked the 6 PR commits onto upstream/main 5d0fd87038b cleanly. One manual fix preserved on top in this PR's overlay:

_warn_if_multimodal (PR's name) → _raise_if_multimodal (post-2026-04 main rename) in v1/spec_decode/dflash.py — without this, the override doesn't take effect and DFlash rejects multimodal inputs with NotImplementedError.

Both findings are flagged in a comprehensive upstream report being prepared for PR #41703.

Test plan

bash scripts/switch.sh vllm/gemma-dflash boots cleanly
verify-full.sh 8/8 passes against the new endpoint at port 8032
bench.sh reproduces ~95 narr / 168 code TPS within CV
SOAK_MODE=continuous bash scripts/soak-test.sh passes (no growth, no silent-empty)
No regression on existing vllm/gemma-mtp (composes can coexist; only one runs at a time via switch.sh)

🤖 Generated with Claude Code

Cross-rig data on z-lab/gemma-4-31B-it-DFlash block-diffusion drafter — first Ampere consumer benchmark of DFlash on Gemma 4. PR #41703 was needs-rebase against pre-SpecDecodeBaseProposer-refactor main; ChatGPT/Codex cherry-picked the 6 PR commits onto upstream/main 5d0fd87038b cleanly with one manual fix on top (_warn_if_multimodal → _raise_if_multimodal rename, otherwise multimodal inputs throw NotImplementedError). Bench at shipped n=7 (TP=2, 2× 3090 PCIe, 230W cap): narrative: 95 wall TPS (1.56× over no-spec-decode baseline) code: 168 wall TPS (2.74× over baseline) Avg accept code: ~60%, AL 5.23 n-sweep summary (n=4..15): code TPS saturates at n=7; n=8 strictly dominated by n=7 (worse on both narr+code); n=15 past the knee. Narrative monotonically degrades with bigger n — n=5 is best for prose at 109/141, override hint documented in compose comment for chat workloads. Soak PASS: 100 turns, 0 errors, 0 silent-empty, 0 MiB growth, 98.6% TPS retention, p50 decode 55.78 TPS (vs 52.71 at n=5 — n=7 is strictly better under soak conditions too, with 2.2 GB lower peak VRAM). DFlash vs MTP on Gemma 4: DFlash wins code (+18%), MTP wins narrative (+15%). Different operating regimes — block-diffusion's larger draft horizon helps deterministic code more than prose. Adds: - models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml - models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ (12 RO-mounted Python files + README documenting provenance + drop conditions) - scripts/switch.sh entry: vllm/gemma-dflash → port 8032 - BENCHMARKS.md row under Gemma 4 31B section Drop the entire patches dir + overlay block when PR #41703 merges and a vLLM :nightly tag rebuilds against it.

noonghunna merged commit 89b65dd into master May 6, 2026

noonghunna deleted the gemma-dflash branch May 6, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay)#81

Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay)#81
noonghunna merged 1 commit into
masterfrom
gemma-dflash

noonghunna commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noonghunna commented May 6, 2026

Summary

Bench at shipped n=7 (TP=2, 2× 3090, 230W cap)

n-sweep highlights

DFlash vs MTP on Gemma 4

What this PR adds

Codex rebase note

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant