Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,4 +207,5 @@ Cross-rig data on Google's official Gemma 4 MTP "assistant" drafter (released 20
| Compose | Rig | KV | Max ctx | Narr / Code TPS | AL | Per-pos accept (code) | Peak VRAM | Date | Notes |
|---|---|---|---:|---:|---:|---|---:|---|---|
| `gemma-mtp.yml` (TP=2) | @noonghunna (2× 3090 PCIe, no NVLink, 230W cap) | bf16 | 32K | **108.87 / 142.25** | **3.94-4.04** | 92 / 79 / 68 / 59 % | 22.5 GB/card | 2026-05-05 | First Ampere consumer cross-rig data on Google MTP drafters. **+1.79× narr / +2.31× code** over baseline (61 TPS no-spec-decode same TP). **PASSES continuous soak** (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.3% TPS retention). bf16 KV (fp8 blocked on Ampere — see TP=1 row). PR [#41745](https://github.com/vllm-project/vllm/pull/41745) overlay + transformers 5.8.0 entrypoint. |
| `gemma-dflash.yml` (TP=2, n=7) | @noonghunna (2× 3090 PCIe, no NVLink, 230W cap) | bf16 | 32K | **95.16 / 167.55** | **~3.0 narr / 5.23 code** | 89 / 78 / 66 / 57 / 50 / 43 / 39 % | 22.7 GB/card | 2026-05-06 | First Ampere consumer cross-rig data on **z-lab Gemma 4 DFlash** block-diffusion drafter (vLLM PR [#41703](https://github.com/vllm-project/vllm/pull/41703) — Codex-rebased onto upstream/main). **+2.74× code / +1.56× narr** over baseline. **PASSES continuous soak** (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.6% TPS retention, p50 55.8 TPS — 5.8% higher than n=5). DFlash dominates MTP on **code (+18%)**; MTP wins on narrative (+15%). n-sweep: n=5 109/141 (best narr) → n=6 99/161 (knee) → **n=7 95/168 (code-optimal default)** → n=8 91/167 (dominated) → n=15 82/172 (past knee). PR #41703 overlay (12 RO-mounted files) + transformers 5.8.0 + nightly `e47c98ef`. |
| `gemma-mtp-tp1.yml` (TP=1) | @noonghunna (1× 3090) | bf16 / fp8 | — | **boot OOM** | — | — | — | 2026-05-05 | **Upstream-blocked on Ampere consumer.** bf16 KV: weights+drafter+profiling at 8K ctx + mem-util 0.95 leaves zero KV pool ("No available memory for the cache blocks"). fp8 KV: Triton `fp8e4nv not supported in this architecture` on sm_86 (Ampere supports `fp8e4b15`/`fp8e5` only); but `fp8_e5m2` is rejected by `gemma4_mm.py:1336` allowlist. Compose preserved for re-test when (a) vLLM adds Ampere-aware fp8 dispatch OR (b) PR #41745 relaxes the assert. Gemma 4 26B-A4B MoE single-card is the obvious follow-up. |
156 changes: 156 additions & 0 deletions models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# ===========================================================================
# Gemma-4-31B-it (Intel AutoRound INT4) + Google MTP "assistant" drafter
# 2× RTX 3090 TP=2 + bfloat16 KV + 32K ctx + 4 streams.
#
# Status: COMMUNITY-CONTRIBUTED, EXPERIMENTAL.
# First Ampere consumer cross-rig data on Google's Gemma 4 MTP drafters
# (released 2026-05-05). Google benched on RTX PRO 6000 Blackwell;
# this compose tests the same path on 2× RTX 3090 (Ampere sm_86, PCIe-only).
#
# Bench (2× 3090 PCIe, no NVLink, 230W cap, 2026-05-06 — at shipped n=7):
# narrative (canonical 800-word essay): 95.16 wall TPS (CV 3.0%)
# code (canonical quicksort): 167.55 wall TPS (CV 2.3%)
# AL: ~3.0 narr / 5.23 code (n=7 max — code AL well-runwayed)
# Avg draft acceptance: ~60% (code)
# VRAM: ~22.7 GB/card (TP=2 split)
# Speedup vs no-spec-decode baseline (~61 TPS): 1.56× narr / 2.74× code.
#
# At n=5 (narrative-optimal): 109/141 wall TPS — see n-sweep table below.
# Choose n=5 for chat/narrative workloads, n=7 for IDE-agent / code workloads.
#
# Models:
# target: Intel/gemma-4-31B-it-int4-AutoRound (21.2 GB, vision preserved)
# draft : google/gemma-4-31B-it-assistant (0.5B / 927 MB BF16)
#
# Pre-merge dependencies (drop when both land in nightly):
# 1. vLLM PR #41745 — gemma4_assistant model class + Gemma4Proposer.
# Vendored at ../patches/vllm-gemma4-mtp/.
# 2. transformers ≥ 5.8.0 — released 2026-05-05 with native gemma4_assistant
# support. Image ships 5.7.0; entrypoint upgrades at boot.
# See ../patches/vllm-gemma4-mtp/README.md for upgrade triggers.
#
# KV format pinned in a corner on Ampere:
# - fp8_e5m2 → assert fail in gemma4_mm.py:1336 (vLLM allowlist excludes it)
# - fp8_e4m3 → Triton "fp8e4nv not supported in this architecture" on sm_86
# (Ampere supports only fp8e4b15 / fp8e5; not fp8e4nv)
# - default (auto = bfloat16) → sidesteps all fp8 kernel paths
# Smaller KV pool than fp8 but at 32K test ctx that's not the bottleneck.
# ===========================================================================
services:
vllm-gemma-4-31b-dflash:
# Nightly bumped to 2026-05-06 to match rebase target proximity (Codex
# rebased onto upstream/main as of 2026-05-06; this nightly is the closest
# published image to that base, sharing the SpecDecodeBaseProposer refactor).
image: vllm/vllm-openai:nightly-e47c98ef7a38792996e452ef53914e21e41928e9
container_name: vllm-gemma-4-31b-dflash
restart: "no"
ports:
- "${PORT:-8032}:8000"
volumes:
- ${MODEL_DIR:-../../../../models-cache}:/root/.cache/huggingface
# torch.compile + Triton kernel caches — first boot warms (~5-7 min);
# subsequent boots reuse cached graphs (~3 min).
- ../cache/torch_compile:/root/.cache/vllm/torch_compile_cache
- ../cache/triton:/root/.triton/cache
# ---- vLLM PR #41703 overlay (jianc99/dflash-gemma4-fix branch) ----
# 12 modified files. RO-mount each over installed vllm paths.
# Drop this whole block when PR merges + propagates. Source: ../patches/vllm-gemma4-dflash/
# NOTE: this overlay CONFLICTS with the gemma-mtp.yml overlay (#41745) on
# speculative.py + gpu_model_runner.py — run only one variant at a time.
- ../patches/vllm-gemma4-dflash/config/attention.py:/usr/local/lib/python3.12/dist-packages/vllm/config/attention.py:ro
- ../patches/vllm-gemma4-dflash/config/speculative.py:/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py:ro
- ../patches/vllm-gemma4-dflash/model_executor/models/qwen3_dflash.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_dflash.py:ro
- ../patches/vllm-gemma4-dflash/transformers_utils/configs/speculators/algos.py:/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/speculators/algos.py:ro
- ../patches/vllm-gemma4-dflash/v1/attention/backends/triton_attn.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/triton_attn.py:ro
- ../patches/vllm-gemma4-dflash/v1/attention/selector.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py:ro
# kv_cache_utils.py — REBASED via Codex (2026-05-06). Now retains
# main's `resolve_kv_cache_block_sizes` helper while applying PR's
# KV-sharing changes for DFlash. Confirmed Codex's judgment is correct
# via py_compile + diff inspection.
- ../patches/vllm-gemma4-dflash/v1/core/kv_cache_utils.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py:ro
- ../patches/vllm-gemma4-dflash/v1/core/sched/scheduler.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/core/sched/scheduler.py:ro
- ../patches/vllm-gemma4-dflash/v1/spec_decode/dflash.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/dflash.py:ro
- ../patches/vllm-gemma4-dflash/v1/spec_decode/eagle.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py:ro
- ../patches/vllm-gemma4-dflash/v1/spec_decode/utils.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/utils.py:ro
- ../patches/vllm-gemma4-dflash/v1/worker/gpu_model_runner.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py:ro
# --------------------------------------------------------------------
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- NCCL_CUMEM_ENABLE=0
- NCCL_P2P_DISABLE=1
- VLLM_NO_USAGE_STATS=1
- OMP_NUM_THREADS=1
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TRITON_CACHE_DIR=/root/.triton/cache
shm_size: "16gb"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# transformers 5.8.0 (released 2026-05-05) is the first version with native
# gemma4_assistant support. Image ships 5.7.0; bump in entrypoint until
# vLLM nightly rebuilds against ≥5.8.0.
entrypoint:
- /bin/bash
- -c
- |
set -e
pip install --quiet --upgrade transformers==5.8.0
exec vllm serve "$@"
- --
command:
- --host
- 0.0.0.0
- --port
- "8000"
- --model
- /root/.cache/huggingface/gemma-4-31b-autoround-int4
- --served-model-name
- gemma-4-31b-autoround
- --tensor-parallel-size
- "2"
# DFlash drafter is BF16; --dtype bfloat16 matches its training dtype
# (same as our existing dual-dflash.yml on Qwen3.6 — vllm#40334 dtype-mismatch
# workaround until that lands).
- --dtype
- bfloat16
- --disable-custom-all-reduce
- --max-model-len
- "${MAX_MODEL_LEN:-32768}"
- --gpu-memory-utilization
- "${GPU_MEMORY_UTILIZATION:-0.92}"
- --max-num-seqs
- "4"
# Vision tower: max_tokens_per_mm_item=2496 must fit in batched tokens.
- --max-num-batched-tokens
- "4096"
- --trust-remote-code
# Tool-call support per vLLM Gemma 4 recipe — required for any soak /
# agent traffic that includes a `tools: [...]` array. Without these,
# vLLM rejects tool-bearing requests with HTTP 400.
- --enable-auto-tool-choice
- --tool-call-parser
- gemma4
- --chat-template
- /vllm-workspace/examples/tool_chat_template_gemma4.jinja
# DFlash drafter — z-lab/gemma-4-31B-it-DFlash (2.9 GB, BF16).
# Method "dflash" is the explicit dispatch added by PR #41703.
#
# n-sweep on 2× 3090 TP=2 (2026-05-06):
# n=4 → 97 narr / 138 code TPS (narr-saturated at AL ~4.0)
# n=5 → 109 narr / 141 code TPS (best narrative; balanced)
# n=6 → 99 narr / 161 code TPS (knee: +14% code, -9% narr vs n=5)
# n=7 → 95 narr / 168 code TPS ← code-optimal default (shipped)
# n=8 → 91 narr / 167 code TPS (DOMINATED by n=7 — both narr+code worse)
# n=15 → 82 narr / 172 code TPS (past the knee — avg accept 34%, tail 8%)
# Narrative monotonically degrades with bigger n; code TPS saturates at n=7.
# n=8 is strictly dominated by n=7 (one extra wasted draft slot).
# For narrative-heavy / chat workloads, override num_speculative_tokens to 5.
- --speculative-config
- '{"method":"dflash","model":"/root/.cache/huggingface/gemma-4-31b-it-dflash","num_speculative_tokens":7}'
55 changes: 55 additions & 0 deletions models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# vLLM Gemma 4 DFlash overlay (PR #41703)

Vendored Python files from [vllm-project/vllm#41703](https://github.com/vllm-project/vllm/pull/41703) — adds first-party **DFlash** (block-diffusion) speculative-decoding support for Gemma 4 + Qwen3.5 models, with the [`z-lab/gemma-4-31B-it-DFlash`](https://huggingface.co/z-lab/gemma-4-31B-it-dflash) drafter (2.9 GB BF16) as the canonical companion.

The compose `docker-compose.gemma-dflash.yml` mounts these files RO over the stock nightly image's vLLM package paths, same pattern as `vllm-gemma4-mtp/` and `models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/`.

## Why this exists

PR #41703 is open + needs-rebase as of 2026-05-06 morning. The PR was authored against an older base than current `main` (pre-`SpecDecodeBaseProposer` refactor); a clean rebase onto `upstream/main` 5d0fd87038b was performed via Codex/ChatGPT delegation, with one manual fix on top:

- `_warn_if_multimodal` (PR's name) → `_raise_if_multimodal` (post-2026-04 main rename) — without this, the override doesn't take effect and DFlash rejects multimodal inputs with `NotImplementedError: Speculative Decoding does not support multimodal models`.

That fix is preserved in `v1/spec_decode/dflash.py` here as a self-contained marker until the PR is itself rebased upstream.

Vendoring keeps this self-contained inside club-3090 so the compose works without external dependencies. Cross-rig users who want DFlash on Gemma 4 don't need to clone the PR fork themselves.

## Provenance

- Upstream branch: `jianc99/dflash-gemma4-fix` (original PR head)
- Local rebase: `/opt/ai/github/jianc99-vllm-dflash-gemma4/` branch `dflash-rebased` — 6 PR commits cherry-picked onto upstream/main `5d0fd87038b`
- File set: 12 modified files (config, model, attention, scheduler, kv cache, spec decode, worker)
- Tracked: [PR #41703](https://github.com/vllm-project/vllm/pull/41703) + [docs/UPSTREAM.md](../../../../docs/UPSTREAM.md)

## DFlash vs MTP on Gemma 4 (TP=2, 2× 3090)

n-sweep on `num_speculative_tokens` (2026-05-06):

| n | Narr wall TPS | Code wall TPS | AL code |
|---|---:|---:|---:|
| 5 | 109 | 141 | 3.99 |
| 6 | 99 | 161 | 4.73 |
| **7 (shipped)** | **95** | **168** | **5.23** |
| 8 | 91 | 167 | 5.36 |
| 15 | 82 | 172 | 6.17 |

vs MTP at n=4: **109 narr / 142 code TPS**.

DFlash dominates MTP on **code (+18%)**; MTP wins on **narrative (+15%)**. They land in genuinely different operating regimes — block-diffusion's larger draft horizon helps deterministic code more than prose. Soak: PASS at n=7 (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.6% TPS retention, p50 55.78 TPS).

## When to drop this

When PR #41703 merges to vLLM main AND a vLLM `:nightly` tag rebuilds against that change. At that point:

1. Bump the `image:` line in `docker-compose.gemma-dflash.yml` to the new nightly with a SHA dated AFTER the merge
2. Remove the entire `# vLLM PR #41703 overlay` volume block from the compose
3. Delete this entire patch directory (`rm -rf models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/`)
4. Update the [docs/UPSTREAM.md](../../../../docs/UPSTREAM.md) row from "🟡 Open" to "🟢 Landed"

## Companion: transformers 5.8.0 upgrade

Same path as gemma-mtp — the drafter ships with a `model_type` only `transformers ≥ 5.8.0` recognizes, the nightly image ships 5.7.0, the compose entrypoint upgrades it at boot. Drop that line when the vLLM nightly rebuilds against transformers ≥ 5.8.0.

## Conflict with `vllm-gemma4-mtp/`

This overlay and `vllm-gemma4-mtp/` (PR #41745) modify overlapping files (`v1/spec_decode/eagle.py`, `v1/worker/gpu_model_runner.py`, `config/speculative.py`). Run only one variant at a time — `bash scripts/switch.sh` cleanly tears down the previous container before booting the next, so this is enforced operationally.
Loading