From a3c10530ee64bda6e8a9aba389baa3a5fa9a56d4 Mon Sep 17 00:00:00 2001
From: AbhiOnGithub <abhiOnGithub@users.noreply.github.com>
Date: Sat, 9 May 2026 11:56:47 -0700
Subject: [PATCH] docs: clarify Gemma 4 assistant speculative decoding

Signed-off-by: AbhiOnGithub <abhiOnGithub@users.noreply.github.com>
---
 docs/features/speculative_decoding/README.md | 11 +++++++++
 docs/features/speculative_decoding/mtp.md    | 25 ++++++++++++++++++++
 docs/models/supported_models.md              |  3 +++
 3 files changed, 39 insertions(+)

diff --git a/docs/features/speculative_decoding/README.md b/docs/features/speculative_decoding/README.md
index bef71a4f5a37..ad01ee3d74bc 100644
--- a/docs/features/speculative_decoding/README.md
+++ b/docs/features/speculative_decoding/README.md
@@ -72,6 +72,17 @@ only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
 | `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
 | `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |
 
+!!! note
+    Gemma 4 assistant checkpoints are handled as Gemma 4 MTP speculators, not
+    as generic draft models. Use `"method": "mtp"` with the assistant
+    checkpoint in `model`, as shown in the [MTP guide](mtp.md#gemma-4-assistant-models).
+
+    If startup logs show `SpeculativeConfig(method='draft_model', ...)` for a
+    Gemma 4 assistant checkpoint, the installed vLLM version does not include
+    Gemma 4 MTP support for that path. Upgrade to a version that includes
+    Gemma 4 MTP support instead of forcing the assistant checkpoint through
+    generic draft-model speculative decoding.
+
 ### Method-specific keys
 
 #### N-gram
diff --git a/docs/features/speculative_decoding/mtp.md b/docs/features/speculative_decoding/mtp.md
index 7e1d1ec7038b..d60f8ff27ba2 100644
--- a/docs/features/speculative_decoding/mtp.md
+++ b/docs/features/speculative_decoding/mtp.md
@@ -9,6 +9,31 @@ MTP is useful when:
 - Your model natively supports MTP.
 - You want model-based speculative decoding with minimal extra configuration.
 
+## Gemma 4 Assistant Models
+
+Gemma 4 assistant checkpoints use vLLM's Gemma 4 MTP path. They are not generic
+draft models, even though they are passed through the `model` field in
+`--speculative-config`.
+
+Use `"method": "mtp"` when serving Gemma 4 with an assistant checkpoint:
+
+```bash
+vllm serve google/gemma-4-E2B-it \
+    --tensor-parallel-size 1 \
+    --max-model-len 8192 \
+    --speculative-config '{"method":"mtp","model":"gg-hf-am/gemma-4-E2B-it-assistant","num_speculative_tokens":1}'
+```
+
+The E2B, E4B, 26B-A4B, and 31B Gemma 4 IT assistant checkpoints are supported
+when their configuration uses `model_type: gemma4_assistant`. vLLM maps those
+checkpoints to `Gemma4MTPModel` internally and wires the assistant layers to
+share KV cache with the target model.
+
+If an older vLLM release logs `SpeculativeConfig(method='draft_model', ...)`
+for a Gemma 4 assistant checkpoint, that release is treating the assistant as a
+generic draft model and may fail during initialization for multimodal Gemma 4
+targets. Upgrade to a version with Gemma 4 MTP support instead.
+
 ## Offline Example
 
 ```python
diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md
index 2b078ca772e2..9ee916d871bf 100644
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -659,6 +659,9 @@ Some models are supported only via the [Transformers modeling backend](#transfor
     For `Gemma4ForConditionalGeneration`:
     - audio input is only supported by the `gemma-4-E2B` and `gemma-4-E4B` variants.
     - The model does not ingest videos directly. However, vLLM’s Gemma 4 implementation supports video inputs by handling video processing internally. Users can send videos directly in the message structure to vLLM, where they are converted into text and image frames before being passed to the model.
+    - Gemma 4 assistant checkpoints for speculative decoding use vLLM's Gemma
+      4 MTP path, not generic draft-model speculative decoding. See the
+      [Gemma 4 assistant model MTP example](../features/speculative_decoding/mtp.md#gemma-4-assistant-models).
 
 !!! note
     For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc.), InternVL3 and InternVL3.5 have video inputs support currently.