fix(vllm): support mixed multimodal payloads#10225
Conversation
…_infer Signed-off-by: Philip Ottesen <phiott256@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the VLLM inference pipeline to support inputs containing a combination of different modalities, such as images, videos, and audio, within a single request. Previously, the system was designed to handle only one type of modality at a time. The changes allow for more flexible and comprehensive multimodal processing, improving the utility of the VLLM integration for complex input scenarios. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates vllm_infer.py and src/llamafactory/chat/vllm_engine.py to support mixed multimodal payloads for vLLM inference. The previous implementation processed multimodal inputs like images, videos, and audios in a mutually exclusive manner using if/elif statements. This has been changed to a series of if statements, allowing multiple modalities to be included in the multi_modal_data dictionary for a single request. This change aligns with vLLM's support for mixed-modality inputs. The implementation appears correct and addresses the described issue.
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
What does this PR do?
While looking into inference, I noticed that multi-modal inputs are mutually exclusive in
multi_modal_data- e.g. vllm_engine.This PR adds support for mixed modalities in
multi_modal_datainvllm_engineandvllm_infer.References
Before submitting