fix(vllm): support mixed multimodal payloads by pjo256 · Pull Request #10225 · hiyouga/LlamaFactory

pjo256 · 2026-02-28T00:26:50Z

What does this PR do?

While looking into inference, I noticed that multi-modal inputs are mutually exclusive in multi_modal_data - e.g. vllm_engine.

This PR adds support for mixed modalities in multi_modal_data in vllm_engine and vllm_infer.

References

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

…_infer Signed-off-by: Philip Ottesen <phiott256@gmail.com>

gemini-code-assist · 2026-02-28T00:27:04Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the VLLM inference pipeline to support inputs containing a combination of different modalities, such as images, videos, and audio, within a single request. Previously, the system was designed to handle only one type of modality at a time. The changes allow for more flexible and comprehensive multimodal processing, improving the utility of the VLLM integration for complex input scenarios.

Highlights

Mixed Modality Support: Enabled the processing of mixed multimodal inputs (images, videos, and audios) simultaneously within vllm_engine and vllm_infer by changing exclusive conditional logic (if/elif) to inclusive independent if statements.
Multi-modal Data Initialization: Initialized multi_modal_data as an empty dictionary at the beginning of the processing loop to allow accumulation of different modality data.
Simplified Video Metadata Handling: Streamlined the check for video_metadata_kwargs in scripts/vllm_infer.py.

Changelog

scripts/vllm_infer.py
- Modified conditional logic for image, video, and audio processing from if/elif to independent if statements to allow mixed modalities.
- Initialized multi_modal_data as an empty dictionary before processing modalities.
- Updated the multi_modal_data parameter in vllm_input_data to explicitly handle cases where no multimodal data is present.
- Simplified the check for video_metadata_kwargs.
src/llamafactory/chat/vllm_engine.py
- Refactored conditional logic for image, video, and audio processing from if/elif to independent if statements.
- Initialized multi_modal_data as an empty dictionary to support accumulating data from multiple modalities.
- Adjusted the multi_modal_data parameter passed to self.model.generate to correctly handle empty multimodal data.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates vllm_infer.py and src/llamafactory/chat/vllm_engine.py to support mixed multimodal payloads for vLLM inference. The previous implementation processed multimodal inputs like images, videos, and audios in a mutually exclusive manner using if/elif statements. This has been changed to a series of if statements, allowing multiple modalities to be included in the multi_modal_data dictionary for a single request. This change aligns with vLLM's support for mixed-modality inputs. The implementation appears correct and addresses the described issue.

hiyouga

LGTM

Signed-off-by: Philip Ottesen <phiott256@gmail.com>

fix(vllm): preserve mixed multimodal payloads in vllm_engine and vllm…

b2fe97f

…_infer Signed-off-by: Philip Ottesen <phiott256@gmail.com>

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

pjo256 temporarily deployed to docker February 28, 2026 12:22 — with GitHub Actions Inactive

hiyouga approved these changes Feb 28, 2026

View reviewed changes

hiyouga merged commit 0779846 into hiyouga:main Feb 28, 2026
17 checks passed

hiyouga added the solved This problem has been already solved label Feb 28, 2026

Rheane116 pushed a commit to Rheane116/LlamaFactory that referenced this pull request Mar 30, 2026

[infer] support mixed multimodal payloads (hiyouga#10225)

2829ad2

Signed-off-by: Philip Ottesen <phiott256@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vllm): support mixed multimodal payloads#10225

fix(vllm): support mixed multimodal payloads#10225
hiyouga merged 1 commit intohiyouga:mainfrom
pjo256:vllm-mixed-modalities

pjo256 commented Feb 28, 2026

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

hiyouga left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pjo256 commented Feb 28, 2026

What does this PR do?

Before submitting

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

hiyouga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants