[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples by shen-shanshan · Pull Request #40335 · vllm-project/vllm

shen-shanshan · 2026-04-20T08:34:26Z

Purpose

The VLM offline inference example (examples/offline_inference/vision_language.py) previously only accepted a single modality per run — either image or video. Users who wanted to test vision-language models on prompts that combine both an image and a video in the same context had no way to do so through the example script.

This PR adds a new --modality "image+video" option to the example, enabling per-prompt mixed image+video inputs. When this modality is selected, limit_mm_per_prompt is set to {"image": 1, "video": 1} and the appropriate combined placeholder token string is constructed for each model's chat template. The change covers ~20 VLM model runner functions and also fixes several pre-existing placeholder wrapping bugs (e.g., Qwen-series models had placeholder tokens double-wrapped inside vision boundary markers).

Key Changes:

mm_limit pattern (all updated model runners): Replaces the hard-coded {modality: 1} with {"image": 1, "video": 1} if modality == "image+video" else {modality: 1} so EngineArgs.limit_mm_per_prompt correctly allows one image and one video per prompt.
elif modality == "image+video": placeholder branches (all updated model runners): Concatenates the model-specific image placeholder and video placeholder tokens into a single prompt prefix string.
run_hyperclovax_seed_vision: Adds a message-content-list branch that inserts both {"type": "image", ...} and {"type": "video"} content blocks; also widens max_model_len to 16384 for image+video (matching video-only behavior).
run_minicpmv_base: Replaces the modality_placeholder dict lookup with an explicit content_prefix variable that concatenates image and video template strings for image+video.
run_llava_onevision: Adds image+video prompt template branch before the EngineArgs construction block.
Placeholder correctness fixes (EXAONE-4.5, Keye-VL/1.5, Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-VL-MoE): Moves vision boundary markers (<|vision_start|>...<|vision_end|>, <|vision_bos|>...<|vision_eos|>) from the outer prompt template string into the per-modality placeholder variable, so all three branches (image / video / image+video) are consistent.

Models with added image+video support:

ERNIE-4.5-VL, EXAONE-4.5, GLM-4.1V, GLM-4.5V, GLM-4.5V-FP8, GLM-OCR, HyperCLOVAX-SEED-Vision, Intern-S1, Intern-S1-Pro, InternVL3, Keye-VL, Keye-VL-1.5, LLaVA-OneVision, MiniCPM-V series, Molmo2, openPangu-VL, Ovis2.5, Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-VL-MoE, Tarsier2.

Test Plan

python examples/offline_inference/vision_language.py -m qwen3_vl --modality "image+video"

Test Result

--------------------------------------------------
The image shows a baby sitting on a bed, wearing glasses, and reading a book. The baby is focused on the book, turning the pages with curiosity and interest. The baby's glasses add a charming and adorable touch to the scene, making it seem like the baby is taking the reading activity seriously. The baby's movements are gentle and deliberate as they turn the pages, indicating a sense of engagement and enjoyment in the activity. The overall atmosphere of the image is cozy and heartwarming, capturing a sweet moment of a baby exploring the world of books.

In the video, the baby continues to read the book, occasionally looking up and smiling, showing their enjoyment and fascination with the story. The baby's glasses remain on, adding to the charm of the scene. The baby's movements are gentle and deliberate as they turn the pages, indicating a sense of engagement and enjoyment in the activity. The overall atmosphere of the video is cozy and heartwarming, capturing a sweet moment of a baby exploring
--------------------------------------------------
The image shows a baby wearing glasses, sitting on a bed and reading a book. The baby is dressed in a light blue shirt and pink pants. The baby is holding the book with both hands and appears to be focused on reading. The background shows a bedroom with a crib and some clothes on the bed.

In the video, the baby continues to read the book, turning the pages and occasionally looking up. The baby seems to be enjoying the activity and is fully engaged in reading. The video captures the baby's concentration and curiosity as they explore the book.
--------------------------------------------------
The image shows a baby girl sitting on a bed, wearing glasses and reading a book. She is surrounded by a cozy bedroom setting with a crib and clothes in the background.

In the video, the baby girl continues to read the book, turning the pages and occasionally looking up. She seems to be enjoying her reading time, fully immersed in the story. The video captures a sweet and adorable moment of a young child engaging in a quiet, intellectual activity.
--------------------------------------------------
The image shows a baby girl sitting on a bed, wearing glasses and reading a book. She is focused on the book, turning the pages and occasionally looking up. The background includes a crib and some clothes on the bed.

In the video, the baby girl continues to read the book, turning the pages and looking at the text. She occasionally looks up and smiles, showing her enjoyment of the activity. The video captures the baby's concentration and curiosity as she engages with the book.
--------------------------------------------------

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: shen-shanshan <467638484@qq.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-04-20T08:35:02Z

Documentation preview: https://vllm--40335.org.readthedocs.build/en/40335/

gemini-code-assist

Code Review

This pull request adds support for a combined 'image+video' modality to the offline inference example script, updating configurations and prompt templates for numerous vision-language models. It also introduces helper functions to manage multi-modal data and UUIDs for the new modality. A review comment identifies a hardcoded local file path for the Qwen3-VL model name and suggests using a public Hugging Face ID to maintain community accessibility.

Signed-off-by: shen-shanshan <467638484@qq.com>

Isotr0py · 2026-04-21T02:27:00Z

+        placeholder = "<|vision_start|><|image_pad|><|vision_end|>"
    elif modality == "video":
-        placeholder = "<|video_pad|>"
+        placeholder = "<|vision_start|><|video_pad|><|vision_end|>"
+    elif modality == "image+video":
+        placeholder = (
+            "<|vision_start|><|image_pad|><|vision_end|>"
+            "<|vision_start|><|video_pad|><|vision_end|>"
+        )


Can also concat placeholder here.

Sorry for missing this, I will update it now.

Signed-off-by: shen-shanshan <467638484@qq.com>

DarkLight1337 · 2026-04-21T03:24:14Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for a combined 'image+video' modality across various vision-language models in the offline inference example. Key changes include updating model-specific functions to configure multi-modal limits and placeholders, modifying prompt construction to include both image and video tags, and implementing helper functions in the main loop to handle the complex data structures required for multiple concurrent modalities. The argument parser was also updated to accept the new modality. I have no feedback to provide.

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yifan <yzong@redhat.com>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Adrian <info@zzit.ch>

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

shen-shanshan added 2 commits April 20, 2026 08:25

support image+video mixed inputs for VLM examples

45d7165

Signed-off-by: shen-shanshan <467638484@qq.com>

update

d459fd2

Signed-off-by: shen-shanshan <467638484@qq.com>

claude Bot reviewed Apr 20, 2026

View reviewed changes

mergify Bot added the documentation Improvements or additions to documentation label Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread examples/offline_inference/vision_language.py Outdated

update

1579b33

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan mentioned this pull request Apr 20, 2026

[Doc] Update ViT CUDA graph doc for mixed (image+video) inputs #40355

Merged

4 tasks

ZJY0516 requested review from DarkLight1337 and Isotr0py April 20, 2026 13:20

DarkLight1337 reviewed Apr 20, 2026

View reviewed changes

Comment thread examples/offline_inference/vision_language.py

merge placeholder

9d30e04

Signed-off-by: shen-shanshan <467638484@qq.com>

Isotr0py approved these changes Apr 21, 2026

View reviewed changes

update

99c4a47

Signed-off-by: shen-shanshan <467638484@qq.com>

Isotr0py enabled auto-merge (squash) April 21, 2026 02:58

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

DarkLight1337 approved these changes Apr 21, 2026

View reviewed changes

Isotr0py merged commit b478400 into vllm-project:main Apr 21, 2026
15 checks passed

baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM exam…

1fea90b

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com>

yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 23, 2026

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM exam…

8696822

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yifan <yzong@redhat.com>

Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM exam…

612b7f2

…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Adrian <info@zzit.ch>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335
Isotr0py merged 5 commits intovllm-project:mainfrom
shen-shanshan:vlm-test

shen-shanshan commented Apr 20, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Isotr0py Apr 21, 2026

Uh oh!

shen-shanshan Apr 21, 2026

Uh oh!

DarkLight1337 commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

shen-shanshan commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Isotr0py Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shen-shanshan commented Apr 20, 2026 •

edited

Loading