noonghunna · noonghunna · May 6, 2026 · May 6, 2026
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -207,4 +207,5 @@ Cross-rig data on Google's official Gemma 4 MTP "assistant" drafter (released 20
 | Compose | Rig | KV | Max ctx | Narr / Code TPS | AL | Per-pos accept (code) | Peak VRAM | Date | Notes |
 |---|---|---|---:|---:|---:|---|---:|---|---|
 | `gemma-mtp.yml` (TP=2) | @noonghunna (2× 3090 PCIe, no NVLink, 230W cap) | bf16 | 32K | **108.87 / 142.25** | **3.94-4.04** | 92 / 79 / 68 / 59 % | 22.5 GB/card | 2026-05-05 | First Ampere consumer cross-rig data on Google MTP drafters. **+1.79× narr / +2.31× code** over baseline (61 TPS no-spec-decode same TP). **PASSES continuous soak** (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.3% TPS retention). bf16 KV (fp8 blocked on Ampere — see TP=1 row). PR [#41745](https://github.com/vllm-project/vllm/pull/41745) overlay + transformers 5.8.0 entrypoint. |
+| `gemma-dflash.yml` (TP=2, n=7) | @noonghunna (2× 3090 PCIe, no NVLink, 230W cap) | bf16 | 32K | **95.16 / 167.55** | **~3.0 narr / 5.23 code** | 89 / 78 / 66 / 57 / 50 / 43 / 39 % | 22.7 GB/card | 2026-05-06 | First Ampere consumer cross-rig data on **z-lab Gemma 4 DFlash** block-diffusion drafter (vLLM PR [#41703](https://github.com/vllm-project/vllm/pull/41703) — Codex-rebased onto upstream/main). **+2.74× code / +1.56× narr** over baseline. **PASSES continuous soak** (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.6% TPS retention, p50 55.8 TPS — 5.8% higher than n=5). DFlash dominates MTP on **code (+18%)**; MTP wins on narrative (+15%). n-sweep: n=5 109/141 (best narr) → n=6 99/161 (knee) → **n=7 95/168 (code-optimal default)** → n=8 91/167 (dominated) → n=15 82/172 (past knee). PR #41703 overlay (12 RO-mounted files) + transformers 5.8.0 + nightly `e47c98ef`. |
 | `gemma-mtp-tp1.yml` (TP=1) | @noonghunna (1× 3090) | bf16 / fp8 | — | **boot OOM** | — | — | — | 2026-05-05 | **Upstream-blocked on Ampere consumer.** bf16 KV: weights+drafter+profiling at 8K ctx + mem-util 0.95 leaves zero KV pool ("No available memory for the cache blocks"). fp8 KV: Triton `fp8e4nv not supported in this architecture` on sm_86 (Ampere supports `fp8e4b15`/`fp8e5` only); but `fp8_e5m2` is rejected by `gemma4_mm.py:1336` allowlist. Compose preserved for re-test when (a) vLLM adds Ampere-aware fp8 dispatch OR (b) PR #41745 relaxes the assert. Gemma 4 26B-A4B MoE single-card is the obvious follow-up. |
diff --git a/models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml b/models/gemma-4-31b/vllm/compose/docker-compose.gemma-dflash.yml
@@ -0,0 +1,156 @@
+# ===========================================================================
+# Gemma-4-31B-it (Intel AutoRound INT4) + Google MTP "assistant" drafter
+# 2× RTX 3090 TP=2 + bfloat16 KV + 32K ctx + 4 streams.
+#
+# Status: COMMUNITY-CONTRIBUTED, EXPERIMENTAL.
+#   First Ampere consumer cross-rig data on Google's Gemma 4 MTP drafters
+#   (released 2026-05-05). Google benched on RTX PRO 6000 Blackwell;
+#   this compose tests the same path on 2× RTX 3090 (Ampere sm_86, PCIe-only).
+#
+# Bench (2× 3090 PCIe, no NVLink, 230W cap, 2026-05-06 — at shipped n=7):
+#   narrative (canonical 800-word essay):  95.16 wall TPS (CV 3.0%)
+#   code (canonical quicksort):           167.55 wall TPS (CV 2.3%)
+#   AL: ~3.0 narr / 5.23 code (n=7 max — code AL well-runwayed)
+#   Avg draft acceptance: ~60% (code)
+#   VRAM: ~22.7 GB/card (TP=2 split)
+#   Speedup vs no-spec-decode baseline (~61 TPS): 1.56× narr / 2.74× code.
+#
+# At n=5 (narrative-optimal): 109/141 wall TPS — see n-sweep table below.
+# Choose n=5 for chat/narrative workloads, n=7 for IDE-agent / code workloads.
+#
+# Models:
+#   target: Intel/gemma-4-31B-it-int4-AutoRound  (21.2 GB, vision preserved)
+#   draft : google/gemma-4-31B-it-assistant       (0.5B / 927 MB BF16)
+#
+# Pre-merge dependencies (drop when both land in nightly):
+#   1. vLLM PR #41745 — gemma4_assistant model class + Gemma4Proposer.
+#      Vendored at ../patches/vllm-gemma4-mtp/.
+#   2. transformers ≥ 5.8.0 — released 2026-05-05 with native gemma4_assistant
+#      support. Image ships 5.7.0; entrypoint upgrades at boot.
+# See ../patches/vllm-gemma4-mtp/README.md for upgrade triggers.
+#
+# KV format pinned in a corner on Ampere:
+#   - fp8_e5m2 → assert fail in gemma4_mm.py:1336 (vLLM allowlist excludes it)
+#   - fp8_e4m3 → Triton "fp8e4nv not supported in this architecture" on sm_86
+#     (Ampere supports only fp8e4b15 / fp8e5; not fp8e4nv)
+#   - default (auto = bfloat16) → sidesteps all fp8 kernel paths
+# Smaller KV pool than fp8 but at 32K test ctx that's not the bottleneck.
+# ===========================================================================
+services:
+  vllm-gemma-4-31b-dflash:
+    # Nightly bumped to 2026-05-06 to match rebase target proximity (Codex
+    # rebased onto upstream/main as of 2026-05-06; this nightly is the closest
+    # published image to that base, sharing the SpecDecodeBaseProposer refactor).
+    image: vllm/vllm-openai:nightly-e47c98ef7a38792996e452ef53914e21e41928e9
+    container_name: vllm-gemma-4-31b-dflash
+    restart: "no"
+    ports:
+      - "${PORT:-8032}:8000"
+    volumes:
+      - ${MODEL_DIR:-../../../../models-cache}:/root/.cache/huggingface
+      # torch.compile + Triton kernel caches — first boot warms (~5-7 min);
+      # subsequent boots reuse cached graphs (~3 min).
+      - ../cache/torch_compile:/root/.cache/vllm/torch_compile_cache
+      - ../cache/triton:/root/.triton/cache
+      # ---- vLLM PR #41703 overlay (jianc99/dflash-gemma4-fix branch) ----
+      # 12 modified files. RO-mount each over installed vllm paths.
+      # Drop this whole block when PR merges + propagates. Source: ../patches/vllm-gemma4-dflash/
+      # NOTE: this overlay CONFLICTS with the gemma-mtp.yml overlay (#41745) on
+      # speculative.py + gpu_model_runner.py — run only one variant at a time.
+      - ../patches/vllm-gemma4-dflash/config/attention.py:/usr/local/lib/python3.12/dist-packages/vllm/config/attention.py:ro
+      - ../patches/vllm-gemma4-dflash/config/speculative.py:/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py:ro
+      - ../patches/vllm-gemma4-dflash/model_executor/models/qwen3_dflash.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_dflash.py:ro
+      - ../patches/vllm-gemma4-dflash/transformers_utils/configs/speculators/algos.py:/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/speculators/algos.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/attention/backends/triton_attn.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/triton_attn.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/attention/selector.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py:ro
+      # kv_cache_utils.py — REBASED via Codex (2026-05-06). Now retains
+      # main's `resolve_kv_cache_block_sizes` helper while applying PR's
+      # KV-sharing changes for DFlash. Confirmed Codex's judgment is correct
+      # via py_compile + diff inspection.
+      - ../patches/vllm-gemma4-dflash/v1/core/kv_cache_utils.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/core/sched/scheduler.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/core/sched/scheduler.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/spec_decode/dflash.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/dflash.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/spec_decode/eagle.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/spec_decode/utils.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/utils.py:ro
+      - ../patches/vllm-gemma4-dflash/v1/worker/gpu_model_runner.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py:ro
+      # --------------------------------------------------------------------
+    environment:
+      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
+      - VLLM_WORKER_MULTIPROC_METHOD=spawn
+      - NCCL_CUMEM_ENABLE=0
+      - NCCL_P2P_DISABLE=1
+      - VLLM_NO_USAGE_STATS=1
+      - OMP_NUM_THREADS=1
+      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
+      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
+      - TRITON_CACHE_DIR=/root/.triton/cache
+    shm_size: "16gb"
+    ipc: host
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    # transformers 5.8.0 (released 2026-05-05) is the first version with native
+    # gemma4_assistant support. Image ships 5.7.0; bump in entrypoint until
+    # vLLM nightly rebuilds against ≥5.8.0.
+    entrypoint:
+      - /bin/bash
+      - -c
+      - |
+        set -e
+        pip install --quiet --upgrade transformers==5.8.0
+        exec vllm serve "$@"
+      - --
+    command:
+      - --host
+      - 0.0.0.0
+      - --port
+      - "8000"
+      - --model
+      - /root/.cache/huggingface/gemma-4-31b-autoround-int4
+      - --served-model-name
+      - gemma-4-31b-autoround
+      - --tensor-parallel-size
+      - "2"
+      # DFlash drafter is BF16; --dtype bfloat16 matches its training dtype
+      # (same as our existing dual-dflash.yml on Qwen3.6 — vllm#40334 dtype-mismatch
+      # workaround until that lands).
+      - --dtype
+      - bfloat16
+      - --disable-custom-all-reduce
+      - --max-model-len
+      - "${MAX_MODEL_LEN:-32768}"
+      - --gpu-memory-utilization
+      - "${GPU_MEMORY_UTILIZATION:-0.92}"
+      - --max-num-seqs
+      - "4"
+      # Vision tower: max_tokens_per_mm_item=2496 must fit in batched tokens.
+      - --max-num-batched-tokens
+      - "4096"
+      - --trust-remote-code
+      # Tool-call support per vLLM Gemma 4 recipe — required for any soak /
+      # agent traffic that includes a `tools: [...]` array. Without these,
+      # vLLM rejects tool-bearing requests with HTTP 400.
+      - --enable-auto-tool-choice
+      - --tool-call-parser
+      - gemma4
+      - --chat-template
+      - /vllm-workspace/examples/tool_chat_template_gemma4.jinja
+      # DFlash drafter — z-lab/gemma-4-31B-it-DFlash (2.9 GB, BF16).
+      # Method "dflash" is the explicit dispatch added by PR #41703.
+      #
+      # n-sweep on 2× 3090 TP=2 (2026-05-06):
+      #   n=4  →  97 narr / 138 code TPS  (narr-saturated at AL ~4.0)
+      #   n=5  → 109 narr / 141 code TPS  (best narrative; balanced)
+      #   n=6  →  99 narr / 161 code TPS  (knee: +14% code, -9% narr vs n=5)
+      #   n=7  →  95 narr / 168 code TPS  ← code-optimal default (shipped)
+      #   n=8  →  91 narr / 167 code TPS  (DOMINATED by n=7 — both narr+code worse)
+      #   n=15 →  82 narr / 172 code TPS  (past the knee — avg accept 34%, tail 8%)
+      # Narrative monotonically degrades with bigger n; code TPS saturates at n=7.
+      # n=8 is strictly dominated by n=7 (one extra wasted draft slot).
+      # For narrative-heavy / chat workloads, override num_speculative_tokens to 5.
+      - --speculative-config
+      - '{"method":"dflash","model":"/root/.cache/huggingface/gemma-4-31b-it-dflash","num_speculative_tokens":7}'
diff --git a/models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/README.md b/models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/README.md
@@ -0,0 +1,55 @@
+# vLLM Gemma 4 DFlash overlay (PR #41703)
+
+Vendored Python files from [vllm-project/vllm#41703](https://github.com/vllm-project/vllm/pull/41703) — adds first-party **DFlash** (block-diffusion) speculative-decoding support for Gemma 4 + Qwen3.5 models, with the [`z-lab/gemma-4-31B-it-DFlash`](https://huggingface.co/z-lab/gemma-4-31B-it-dflash) drafter (2.9 GB BF16) as the canonical companion.
+
+The compose `docker-compose.gemma-dflash.yml` mounts these files RO over the stock nightly image's vLLM package paths, same pattern as `vllm-gemma4-mtp/` and `models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/`.
+
+## Why this exists
+
+PR #41703 is open + needs-rebase as of 2026-05-06 morning. The PR was authored against an older base than current `main` (pre-`SpecDecodeBaseProposer` refactor); a clean rebase onto `upstream/main` 5d0fd87038b was performed via Codex/ChatGPT delegation, with one manual fix on top:
+
+- `_warn_if_multimodal` (PR's name) → `_raise_if_multimodal` (post-2026-04 main rename) — without this, the override doesn't take effect and DFlash rejects multimodal inputs with `NotImplementedError: Speculative Decoding does not support multimodal models`.
+
+That fix is preserved in `v1/spec_decode/dflash.py` here as a self-contained marker until the PR is itself rebased upstream.
+
+Vendoring keeps this self-contained inside club-3090 so the compose works without external dependencies. Cross-rig users who want DFlash on Gemma 4 don't need to clone the PR fork themselves.
+
+## Provenance
+
+- Upstream branch: `jianc99/dflash-gemma4-fix` (original PR head)
+- Local rebase: `/opt/ai/github/jianc99-vllm-dflash-gemma4/` branch `dflash-rebased` — 6 PR commits cherry-picked onto upstream/main `5d0fd87038b`
+- File set: 12 modified files (config, model, attention, scheduler, kv cache, spec decode, worker)
+- Tracked: [PR #41703](https://github.com/vllm-project/vllm/pull/41703) + [docs/UPSTREAM.md](../../../../docs/UPSTREAM.md)
+
+## DFlash vs MTP on Gemma 4 (TP=2, 2× 3090)
+
+n-sweep on `num_speculative_tokens` (2026-05-06):
+
+| n | Narr wall TPS | Code wall TPS | AL code |
+|---|---:|---:|---:|
+| 5 | 109 | 141 | 3.99 |
+| 6 |  99 | 161 | 4.73 |
+| **7 (shipped)** | **95** | **168** | **5.23** |
+| 8 |  91 | 167 | 5.36 |
+| 15 | 82 | 172 | 6.17 |
+
+vs MTP at n=4: **109 narr / 142 code TPS**.
+
+DFlash dominates MTP on **code (+18%)**; MTP wins on **narrative (+15%)**. They land in genuinely different operating regimes — block-diffusion's larger draft horizon helps deterministic code more than prose. Soak: PASS at n=7 (100 turns, 0 errors / 0 silent-empty / 0 MiB growth, 98.6% TPS retention, p50 55.78 TPS).
+
+## When to drop this
+
+When PR #41703 merges to vLLM main AND a vLLM `:nightly` tag rebuilds against that change. At that point:
+
+1. Bump the `image:` line in `docker-compose.gemma-dflash.yml` to the new nightly with a SHA dated AFTER the merge
+2. Remove the entire `# vLLM PR #41703 overlay` volume block from the compose
+3. Delete this entire patch directory (`rm -rf models/gemma-4-31b/vllm/patches/vllm-gemma4-dflash/`)
+4. Update the [docs/UPSTREAM.md](../../../../docs/UPSTREAM.md) row from "🟡 Open" to "🟢 Landed"
+
+## Companion: transformers 5.8.0 upgrade
+
+Same path as gemma-mtp — the drafter ships with a `model_type` only `transformers ≥ 5.8.0` recognizes, the nightly image ships 5.7.0, the compose entrypoint upgrades it at boot. Drop that line when the vLLM nightly rebuilds against transformers ≥ 5.8.0.
+
+## Conflict with `vllm-gemma4-mtp/`
+
+This overlay and `vllm-gemma4-mtp/` (PR #41745) modify overlapping files (`v1/spec_decode/eagle.py`, `v1/worker/gpu_model_runner.py`, `config/speculative.py`). Run only one variant at a time — `bash scripts/switch.sh` cleanly tears down the previous container before booting the next, so this is enforced operationally.