[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335
[MM][Misc] Support image+video mixed inputs (per prompt) for VLM examples#40335Isotr0py merged 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: shen-shanshan <467638484@qq.com>
|
Documentation preview: https://vllm--40335.org.readthedocs.build/en/40335/ |
There was a problem hiding this comment.
Code Review
This pull request adds support for a combined 'image+video' modality to the offline inference example script, updating configurations and prompt templates for numerous vision-language models. It also introduces helper functions to manage multi-modal data and UUIDs for the new modality. A review comment identifies a hardcoded local file path for the Qwen3-VL model name and suggests using a public Hugging Face ID to maintain community accessibility.
Signed-off-by: shen-shanshan <467638484@qq.com>
| placeholder = "<|vision_start|><|image_pad|><|vision_end|>" | ||
| elif modality == "video": | ||
| placeholder = "<|video_pad|>" | ||
| placeholder = "<|vision_start|><|video_pad|><|vision_end|>" | ||
| elif modality == "image+video": | ||
| placeholder = ( | ||
| "<|vision_start|><|image_pad|><|vision_end|>" | ||
| "<|vision_start|><|video_pad|><|vision_end|>" | ||
| ) |
There was a problem hiding this comment.
Can also concat placeholder here.
There was a problem hiding this comment.
Sorry for missing this, I will update it now.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for a combined 'image+video' modality across various vision-language models in the offline inference example. Key changes include updating model-specific functions to configure multi-modal limits and placeholders, modifying prompt construction to include both image and video tags, and implementing helper functions in the main loop to handle the complex data structures required for multiple concurrent modalities. The argument parser was also updated to accept the new modality. I have no feedback to provide.
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com>
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yifan <yzong@redhat.com>
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Adrian <info@zzit.ch>
…ples (vllm-project#40335) Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Purpose
The VLM offline inference example (
examples/offline_inference/vision_language.py) previously only accepted a single modality per run — eitherimageorvideo. Users who wanted to test vision-language models on prompts that combine both an image and a video in the same context had no way to do so through the example script.This PR adds a new
--modality "image+video"option to the example, enabling per-prompt mixed image+video inputs. When this modality is selected,limit_mm_per_promptis set to{"image": 1, "video": 1}and the appropriate combined placeholder token string is constructed for each model's chat template. The change covers ~20 VLM model runner functions and also fixes several pre-existing placeholder wrapping bugs (e.g., Qwen-series models had placeholder tokens double-wrapped inside vision boundary markers).Key Changes:
mm_limitpattern (all updated model runners): Replaces the hard-coded{modality: 1}with{"image": 1, "video": 1} if modality == "image+video" else {modality: 1}soEngineArgs.limit_mm_per_promptcorrectly allows one image and one video per prompt.elif modality == "image+video":placeholder branches (all updated model runners): Concatenates the model-specific image placeholder and video placeholder tokens into a single prompt prefix string.run_hyperclovax_seed_vision: Adds a message-content-list branch that inserts both{"type": "image", ...}and{"type": "video"}content blocks; also widensmax_model_lento16384forimage+video(matching video-only behavior).run_minicpmv_base: Replaces themodality_placeholderdict lookup with an explicitcontent_prefixvariable that concatenates image and video template strings forimage+video.run_llava_onevision: Addsimage+videoprompt template branch before theEngineArgsconstruction block.<|vision_start|>...<|vision_end|>,<|vision_bos|>...<|vision_eos|>) from the outer prompt template string into the per-modality placeholder variable, so all three branches (image / video / image+video) are consistent.Models with added
image+videosupport:ERNIE-4.5-VL, EXAONE-4.5, GLM-4.1V, GLM-4.5V, GLM-4.5V-FP8, GLM-OCR, HyperCLOVAX-SEED-Vision, Intern-S1, Intern-S1-Pro, InternVL3, Keye-VL, Keye-VL-1.5, LLaVA-OneVision, MiniCPM-V series, Molmo2, openPangu-VL, Ovis2.5, Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-VL-MoE, Tarsier2.
Test Plan
python examples/offline_inference/vision_language.py -m qwen3_vl --modality "image+video"Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.