vllm-project · DarkLight1337 · May 10, 2026 · May 9, 2026 · May 9, 2026
diff --git a/docs/features/speculative_decoding/README.md b/docs/features/speculative_decoding/README.md
@@ -72,6 +72,17 @@ only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
 | `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
 | `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |
 
+!!! note
+    Gemma 4 assistant checkpoints are handled as Gemma 4 MTP speculators, not
+    as generic draft models. Use `"method": "mtp"` with the assistant
+    checkpoint in `model`, as shown in the [MTP guide](mtp.md#gemma-4-assistant-models).
+
+    If startup logs show `SpeculativeConfig(method='draft_model', ...)` for a
+    Gemma 4 assistant checkpoint, the installed vLLM version does not include
+    Gemma 4 MTP support for that path. Upgrade to a version that includes
+    Gemma 4 MTP support instead of forcing the assistant checkpoint through
+    generic draft-model speculative decoding.
+
 ### Method-specific keys
 
 #### N-gram

diff --git a/docs/features/speculative_decoding/mtp.md b/docs/features/speculative_decoding/mtp.md
@@ -9,6 +9,31 @@ MTP is useful when:
 - Your model natively supports MTP.
 - You want model-based speculative decoding with minimal extra configuration.
 
+## Gemma 4 Assistant Models
+
+Gemma 4 assistant checkpoints use vLLM's Gemma 4 MTP path. They are not generic
+draft models, even though they are passed through the `model` field in
+`--speculative-config`.
+
+Use `"method": "mtp"` when serving Gemma 4 with an assistant checkpoint:
+
+```bash
+vllm serve google/gemma-4-E2B-it \
+    --tensor-parallel-size 1 \
+    --max-model-len 8192 \
+    --speculative-config '{"method":"mtp","model":"gg-hf-am/gemma-4-E2B-it-assistant","num_speculative_tokens":1}'
+```
+
+The E2B, E4B, 26B-A4B, and 31B Gemma 4 IT assistant checkpoints are supported
+when their configuration uses `model_type: gemma4_assistant`. vLLM maps those
+checkpoints to `Gemma4MTPModel` internally and wires the assistant layers to
+share KV cache with the target model.
+
+If an older vLLM release logs `SpeculativeConfig(method='draft_model', ...)`
+for a Gemma 4 assistant checkpoint, that release is treating the assistant as a
+generic draft model and may fail during initialization for multimodal Gemma 4
+targets. Upgrade to a version with Gemma 4 MTP support instead.
+
 ## Offline Example
 
 ```python

diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md
@@ -659,6 +659,9 @@ Some models are supported only via the [Transformers modeling backend](#transfor
     For `Gemma4ForConditionalGeneration`:
     - audio input is only supported by the `gemma-4-E2B` and `gemma-4-E4B` variants.
     - The model does not ingest videos directly. However, vLLM’s Gemma 4 implementation supports video inputs by handling video processing internally. Users can send videos directly in the message structure to vLLM, where they are converted into text and image frames before being passed to the model.
+    - Gemma 4 assistant checkpoints for speculative decoding use vLLM's Gemma
+      4 MTP path, not generic draft-model speculative decoding. See the
+      [Gemma 4 assistant model MTP example](../features/speculative_decoding/mtp.md#gemma-4-assistant-models).
 
 !!! note
     For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc.), InternVL3 and InternVL3.5 have video inputs support currently.