Add Gemma 4 + DFlash compose (vLLM PR #41703 Codex-rebased overlay)#81
Merged
Conversation
Cross-rig data on z-lab/gemma-4-31B-it-DFlash block-diffusion drafter — first Ampere consumer benchmark of DFlash on Gemma 4. PR #41703 was needs-rebase against pre-SpecDecodeBaseProposer-refactor main; ChatGPT/Codex cherry-picked the 6 PR commits onto upstream/main 5d0fd87038b cleanly with one manual fix on top (_warn_if_multimodal → _raise_if_multimodal rename, otherwise multimodal inputs throw NotImplementedError). Bench at shipped n=7 (TP=2, 2× 3090 PCIe, 230W cap): narrative: 95 wall TPS (1.56× over no-spec-decode baseline) code: 168 wall TPS (2.74× over baseline) Avg accept code: ~60%, AL 5.23 n-sweep summary (n=4..15): code TPS saturates at n=7; n=8 strictly dominated by n=7 (worse on both narr+code); n=15 past the knee. Narrative monotonically degrades with bigger n — n=5 is best for prose at 109/141, override hint documented in compose comment for chat workloads. Soak PASS: 100 turns, 0 errors, 0 silent-empty, 0 MiB growth, 98.6% TPS retention, p50 decode 55.78 TPS (vs 52.71 at n=5 — n=7 is strictly better under soak conditions too, with 2.2 GB lower peak VRAM). DFlash vs MTP on Gemma 4: DFlash wins code (+18%), MTP wins narrative (+15%). Different operating regimes — block-diffusion's larger draft horizon helps deterministic code more than prose. Adds: - models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml - models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/ (12 RO-mounted Python files + README documenting provenance + drop conditions) - scripts/switch.sh entry: vllm/gemma-dflash → port 8032 - BENCHMARKS.md row under Gemma 4 31B section Drop the entire patches dir + overlay block when PR #41703 merges and a vLLM :nightly tag rebuilds against it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vllm/gemma-dflashvariant — Gemma 4 31B + z-lab DFlash block-diffusion drafter (vLLM PR #41703, Codex-rebased onto upstream/main)models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/— drop the dir when PR #41703 merges and a nightly rebuilds against itBench at shipped n=7 (TP=2, 2× 3090, 230W cap)
Soak PASS (100 turns continuous, 0 errors / 0 silent-empty / 0 MiB growth / 98.6% retention / p50 55.78 TPS).
n-sweep highlights
Narrative monotonically degrades; code TPS saturates at n=7. Override
num_speculative_tokensto 5 for narrative-heavy / chat workloads (compose comment documents this).DFlash vs MTP on Gemma 4
DFlash wins code (+18%), MTP wins narrative (+15%) — different operating regimes; users should pick based on workload.
What this PR adds
models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml— TP=2, BF16 KV, 32K ctx, vision tower preserved, n=7models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/— 12 vendored vLLM Python files + README with provenance + drop conditionsscripts/switch.shentry:vllm/gemma-dflash→ port 8032BENCHMARKS.mdrow under Gemma 4 31B sectionCodex rebase note
PR #41703 was authored against an older base than current
main(pre-SpecDecodeBaseProposerrefactor). ChatGPT/Codex cherry-picked the 6 PR commits onto upstream/main5d0fd87038bcleanly. One manual fix preserved on top in this PR's overlay:_warn_if_multimodal(PR's name) →_raise_if_multimodal(post-2026-04 main rename) inv1/spec_decode/dflash.py— without this, the override doesn't take effect and DFlash rejects multimodal inputs withNotImplementedError.Both findings are flagged in a comprehensive upstream report being prepared for PR #41703.
Test plan
bash scripts/switch.sh vllm/gemma-dflashboots cleanlyverify-full.sh8/8 passes against the new endpoint at port 8032bench.shreproduces ~95 narr / 168 code TPS within CVSOAK_MODE=continuous bash scripts/soak-test.shpasses (no growth, no silent-empty)vllm/gemma-mtp(composes can coexist; only one runs at a time via switch.sh)🤖 Generated with Claude Code