Skip to content

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335

Merged
Isotr0py merged 5 commits intovllm-project:mainfrom
shen-shanshan:vlm-test
Apr 21, 2026
Merged

[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335
Isotr0py merged 5 commits intovllm-project:mainfrom
shen-shanshan:vlm-test

Conversation

@shen-shanshan
Copy link
Copy Markdown
Contributor

@shen-shanshan shen-shanshan commented Apr 20, 2026

Purpose

The VLM offline inference example (examples/offline_inference/vision_language.py) previously only accepted a single modality per run — either image or video. Users who wanted to test vision-language models on prompts that combine both an image and a video in the same context had no way to do so through the example script.

This PR adds a new --modality "image+video" option to the example, enabling per-prompt mixed image+video inputs. When this modality is selected, limit_mm_per_prompt is set to {"image": 1, "video": 1} and the appropriate combined placeholder token string is constructed for each model's chat template. The change covers ~20 VLM model runner functions and also fixes several pre-existing placeholder wrapping bugs (e.g., Qwen-series models had placeholder tokens double-wrapped inside vision boundary markers).

Key Changes:

  • mm_limit pattern (all updated model runners): Replaces the hard-coded {modality: 1} with {"image": 1, "video": 1} if modality == "image+video" else {modality: 1} so EngineArgs.limit_mm_per_prompt correctly allows one image and one video per prompt.
  • elif modality == "image+video": placeholder branches (all updated model runners): Concatenates the model-specific image placeholder and video placeholder tokens into a single prompt prefix string.
  • run_hyperclovax_seed_vision: Adds a message-content-list branch that inserts both {"type": "image", ...} and {"type": "video"} content blocks; also widens max_model_len to 16384 for image+video (matching video-only behavior).
  • run_minicpmv_base: Replaces the modality_placeholder dict lookup with an explicit content_prefix variable that concatenates image and video template strings for image+video.
  • run_llava_onevision: Adds image+video prompt template branch before the EngineArgs construction block.
  • Placeholder correctness fixes (EXAONE-4.5, Keye-VL/1.5, Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-VL-MoE): Moves vision boundary markers (<|vision_start|>...<|vision_end|>, <|vision_bos|>...<|vision_eos|>) from the outer prompt template string into the per-modality placeholder variable, so all three branches (image / video / image+video) are consistent.

Models with added image+video support:

ERNIE-4.5-VL, EXAONE-4.5, GLM-4.1V, GLM-4.5V, GLM-4.5V-FP8, GLM-OCR, HyperCLOVAX-SEED-Vision, Intern-S1, Intern-S1-Pro, InternVL3, Keye-VL, Keye-VL-1.5, LLaVA-OneVision, MiniCPM-V series, Molmo2, openPangu-VL, Ovis2.5, Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-VL-MoE, Tarsier2.

Test Plan

python examples/offline_inference/vision_language.py -m qwen3_vl --modality "image+video"

Test Result

--------------------------------------------------
The image shows a baby sitting on a bed, wearing glasses, and reading a book. The baby is focused on the book, turning the pages with curiosity and interest. The baby's glasses add a charming and adorable touch to the scene, making it seem like the baby is taking the reading activity seriously. The baby's movements are gentle and deliberate as they turn the pages, indicating a sense of engagement and enjoyment in the activity. The overall atmosphere of the image is cozy and heartwarming, capturing a sweet moment of a baby exploring the world of books.

In the video, the baby continues to read the book, occasionally looking up and smiling, showing their enjoyment and fascination with the story. The baby's glasses remain on, adding to the charm of the scene. The baby's movements are gentle and deliberate as they turn the pages, indicating a sense of engagement and enjoyment in the activity. The overall atmosphere of the video is cozy and heartwarming, capturing a sweet moment of a baby exploring
--------------------------------------------------
The image shows a baby wearing glasses, sitting on a bed and reading a book. The baby is dressed in a light blue shirt and pink pants. The baby is holding the book with both hands and appears to be focused on reading. The background shows a bedroom with a crib and some clothes on the bed.

In the video, the baby continues to read the book, turning the pages and occasionally looking up. The baby seems to be enjoying the activity and is fully engaged in reading. The video captures the baby's concentration and curiosity as they explore the book.
--------------------------------------------------
The image shows a baby girl sitting on a bed, wearing glasses and reading a book. She is surrounded by a cozy bedroom setting with a crib and clothes in the background.

In the video, the baby girl continues to read the book, turning the pages and occasionally looking up. She seems to be enjoying her reading time, fully immersed in the story. The video captures a sweet and adorable moment of a young child engaging in a quiet, intellectual activity.
--------------------------------------------------
The image shows a baby girl sitting on a bed, wearing glasses and reading a book. She is focused on the book, turning the pages and occasionally looking up. The background includes a crib and some clothes on the bed.

In the video, the baby girl continues to read the book, turning the pages and looking at the text. She occasionally looks up and smiles, showing her enjoyment of the activity. The video captures the baby's concentration and curiosity as she engages with the book.
--------------------------------------------------

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 20, 2026

Documentation preview: https://vllm--40335.org.readthedocs.build/en/40335/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for a combined 'image+video' modality to the offline inference example script, updating configurations and prompt templates for numerous vision-language models. It also introduces helper functions to manage multi-modal data and UUIDs for the new modality. A review comment identifies a hardcoded local file path for the Qwen3-VL model name and suggests using a public Hugging Face ID to maintain community accessibility.

Comment thread examples/offline_inference/vision_language.py Outdated
Signed-off-by: shen-shanshan <467638484@qq.com>
Comment thread examples/offline_inference/vision_language.py
Signed-off-by: shen-shanshan <467638484@qq.com>
Comment on lines +2114 to +2121
placeholder = "<|vision_start|><|image_pad|><|vision_end|>"
elif modality == "video":
placeholder = "<|video_pad|>"
placeholder = "<|vision_start|><|video_pad|><|vision_end|>"
elif modality == "image+video":
placeholder = (
"<|vision_start|><|image_pad|><|vision_end|>"
"<|vision_start|><|video_pad|><|vision_end|>"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also concat placeholder here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this, I will update it now.

Signed-off-by: shen-shanshan <467638484@qq.com>
@Isotr0py Isotr0py enabled auto-merge (squash) April 21, 2026 02:58
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2026
@DarkLight1337
Copy link
Copy Markdown
Member

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a combined 'image+video' modality across various vision-language models in the offline inference example. Key changes include updating model-specific functions to configure multi-modal limits and placeholders, modifying prompt construction to include both image and video tags, and implementing helper functions in the main loop to handle the complex data structures required for multiple concurrent modalities. The argument parser was also updated to accept the new modality. I have no feedback to provide.

@Isotr0py Isotr0py merged commit b478400 into vllm-project:main Apr 21, 2026
15 checks passed
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request Apr 22, 2026
…ples (vllm-project#40335)

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 23, 2026
…ples (vllm-project#40335)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Yifan <yzong@redhat.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…ples (vllm-project#40335)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026
…ples (vllm-project#40335)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Adrian <info@zzit.ch>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…ples (vllm-project#40335)

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants