Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions docs/design/cuda_graphs_multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,14 @@ For each graph replay:

When `mm_encoder_tp_mode="data"`, the manager distributes images across TP ranks using load-balanced assignment via `get_load_balance_assignment`, executes locally on each rank, then gathers results back in the original order via `tensor_model_parallel_all_gather`.

### Video inference support (experimental)
### Video inference support

Following <https://github.com/vllm-project/vllm/pull/35963> (ViT full CUDA graph support for image inference), <https://github.com/vllm-project/vllm/pull/38061> extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (`pixel_values` + `image_grid_thw`). Video inputs use different keys (`pixel_values_videos` + `video_grid_thw`) and require larger `cu_seqlens` buffers because each video item contributes multiple frames (`T` attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.

!!! note
Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.

Currently, we only support image-only or video-only inputs when enabling CUDA graph, mixed inputs (image + video) are not supported yet (we will work on it in the near future). Thus, it's recommended to turn off the image modality by `--limit-mm-per-prompt '{"image": 0}'` for video-only inputs.
Mixed inputs (image+video) per prompt are also supported now.

## Model integration via `SupportsEncoderCudaGraph`

Expand Down Expand Up @@ -142,15 +142,13 @@ Enable encoder CUDA Graphs via `compilation_config`:

```bash
vllm serve Qwen/Qwen3-VL-32B \
--limit-mm-per-prompt '{"image": 0}' \
--compilation-config '{"cudagraph_mm_encoder": true}'
```

With explicit budgets:

```bash
vllm serve Qwen/Qwen3-VL-32B \
--limit-mm-per-prompt '{"image": 0}' \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [2048, 4096, 8192, 13824], "encoder_cudagraph_max_vision_items_per_batch": 8, "encoder_cudagraph_max_frames_per_batch": 64}'
```

Expand All @@ -169,7 +167,6 @@ compilation_config = {

model = vllm.LLM(
model="Qwen/Qwen3-VL-32B",
limit_mm_per_prompt='{"image": 0}',
compilation_config=compilation_config,
)
```
Expand Down
Loading