Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/features/speculative_decoding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,17 @@ only apply to model-based methods such as `draft_model`, `mtp`, `eagle3`, and
| `rejection_sample_method` | `string` | `strict` | `strict`, `probabilistic`, or `synthetic`. |
| `synthetic_acceptance_rate` | `float` | `None` | Average acceptance rate to target when `rejection_sample_method` is `synthetic`. Valid range is `[0, 1]`. |

!!! note
Gemma 4 assistant checkpoints are handled as Gemma 4 MTP speculators, not
as generic draft models. Use `"method": "mtp"` with the assistant
checkpoint in `model`, as shown in the [MTP guide](mtp.md#gemma-4-assistant-models).

If startup logs show `SpeculativeConfig(method='draft_model', ...)` for a
Gemma 4 assistant checkpoint, the installed vLLM version does not include
Gemma 4 MTP support for that path. Upgrade to a version that includes
Gemma 4 MTP support instead of forcing the assistant checkpoint through
generic draft-model speculative decoding.

### Method-specific keys

#### N-gram
Expand Down
25 changes: 25 additions & 0 deletions docs/features/speculative_decoding/mtp.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,31 @@ MTP is useful when:
- Your model natively supports MTP.
- You want model-based speculative decoding with minimal extra configuration.

## Gemma 4 Assistant Models

Gemma 4 assistant checkpoints use vLLM's Gemma 4 MTP path. They are not generic
draft models, even though they are passed through the `model` field in
`--speculative-config`.

Use `"method": "mtp"` when serving Gemma 4 with an assistant checkpoint:

```bash
vllm serve google/gemma-4-E2B-it \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--speculative-config '{"method":"mtp","model":"gg-hf-am/gemma-4-E2B-it-assistant","num_speculative_tokens":1}'
```

The E2B, E4B, 26B-A4B, and 31B Gemma 4 IT assistant checkpoints are supported
when their configuration uses `model_type: gemma4_assistant`. vLLM maps those
checkpoints to `Gemma4MTPModel` internally and wires the assistant layers to
share KV cache with the target model.

If an older vLLM release logs `SpeculativeConfig(method='draft_model', ...)`
for a Gemma 4 assistant checkpoint, that release is treating the assistant as a
generic draft model and may fail during initialization for multimodal Gemma 4
targets. Upgrade to a version with Gemma 4 MTP support instead.

## Offline Example

```python
Expand Down
3 changes: 3 additions & 0 deletions docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -659,6 +659,9 @@ Some models are supported only via the [Transformers modeling backend](#transfor
For `Gemma4ForConditionalGeneration`:
- audio input is only supported by the `gemma-4-E2B` and `gemma-4-E4B` variants.
- The model does not ingest videos directly. However, vLLM’s Gemma 4 implementation supports video inputs by handling video processing internally. Users can send videos directly in the message structure to vLLM, where they are converted into text and image frames before being passed to the model.
- Gemma 4 assistant checkpoints for speculative decoding use vLLM's Gemma
4 MTP path, not generic draft-model speculative decoding. See the
[Gemma 4 assistant model MTP example](../features/speculative_decoding/mtp.md#gemma-4-assistant-models).

!!! note
For `InternVLChatModel`, only InternVL2.5 with Qwen2.5 text backbone (`OpenGVLab/InternVL2.5-1B` etc.), InternVL3 and InternVL3.5 have video inputs support currently.
Expand Down
Loading