Skip to content

fix(vllm): support mixed multimodal payloads#10225

Merged
hiyouga merged 1 commit intohiyouga:mainfrom
pjo256:vllm-mixed-modalities
Feb 28, 2026
Merged

fix(vllm): support mixed multimodal payloads#10225
hiyouga merged 1 commit intohiyouga:mainfrom
pjo256:vllm-mixed-modalities

Conversation

@pjo256
Copy link
Copy Markdown
Contributor

@pjo256 pjo256 commented Feb 28, 2026

What does this PR do?

While looking into inference, I noticed that multi-modal inputs are mutually exclusive in multi_modal_data - e.g. vllm_engine.

This PR adds support for mixed modalities in multi_modal_data in vllm_engine and vllm_infer.

References

Before submitting

…_infer

Signed-off-by: Philip Ottesen <phiott256@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the VLLM inference pipeline to support inputs containing a combination of different modalities, such as images, videos, and audio, within a single request. Previously, the system was designed to handle only one type of modality at a time. The changes allow for more flexible and comprehensive multimodal processing, improving the utility of the VLLM integration for complex input scenarios.

Highlights

  • Mixed Modality Support: Enabled the processing of mixed multimodal inputs (images, videos, and audios) simultaneously within vllm_engine and vllm_infer by changing exclusive conditional logic (if/elif) to inclusive independent if statements.
  • Multi-modal Data Initialization: Initialized multi_modal_data as an empty dictionary at the beginning of the processing loop to allow accumulation of different modality data.
  • Simplified Video Metadata Handling: Streamlined the check for video_metadata_kwargs in scripts/vllm_infer.py.
Changelog
  • scripts/vllm_infer.py
    • Modified conditional logic for image, video, and audio processing from if/elif to independent if statements to allow mixed modalities.
    • Initialized multi_modal_data as an empty dictionary before processing modalities.
    • Updated the multi_modal_data parameter in vllm_input_data to explicitly handle cases where no multimodal data is present.
    • Simplified the check for video_metadata_kwargs.
  • src/llamafactory/chat/vllm_engine.py
    • Refactored conditional logic for image, video, and audio processing from if/elif to independent if statements.
    • Initialized multi_modal_data as an empty dictionary to support accumulating data from multiple modalities.
    • Adjusted the multi_modal_data parameter passed to self.model.generate to correctly handle empty multimodal data.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates vllm_infer.py and src/llamafactory/chat/vllm_engine.py to support mixed multimodal payloads for vLLM inference. The previous implementation processed multimodal inputs like images, videos, and audios in a mutually exclusive manner using if/elif statements. This has been changed to a series of if statements, allowing multiple modalities to be included in the multi_modal_data dictionary for a single request. This change aligns with vLLM's support for mixed-modality inputs. The implementation appears correct and addresses the described issue.

Copy link
Copy Markdown
Owner

@hiyouga hiyouga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiyouga hiyouga merged commit 0779846 into hiyouga:main Feb 28, 2026
17 checks passed
@hiyouga hiyouga added the solved This problem has been already solved label Feb 28, 2026
Rheane116 pushed a commit to Rheane116/LlamaFactory that referenced this pull request Mar 30, 2026
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

solved This problem has been already solved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants