Support multiple image/audio embeddings per requests#29988
Support multiple image/audio embeddings per requests#29988DarkLight1337 merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully enables support for multiple image and audio embeddings per request by updating the MultiModalItemTracker to handle lists of embeddings. The changes in vllm/entrypoints/chat_utils.py are correct and effectively remove the previous limitation. The new unit tests are also well-designed to cover these new capabilities. However, this change to the data structure for embeddings (from a single item to a list) breaks several existing unit tests that were written with the single-embedding assumption. I've identified these failing tests and provided suggestions to update them. Addressing these test failures is critical for merging this PR.
c4e242c to
fa53ab7
Compare
💡 Codex Reviewhttps://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L725-L727 Mapping https://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L730-L732 Setting ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
fa53ab7 to
f4e251f
Compare
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks, can you update the Multimodal Inputs documentation page accordingly as well?
f4e251f to
0306c96
Compare
|
Documentation preview: https://vllm--29988.org.readthedocs.build/en/29988/ |
Here is the doc ! |
|
#29970 just got merged, can you update your code to use the |
docs/features/multimodal_inputs.md
Outdated
| @@ -473,6 +473,9 @@ You can pass pre-computed audio embeddings similar to image embeddings: | |||
| print(generated_text) | |||
| ``` | |||
|
|
|||
| !!! note | |||
There was a problem hiding this comment.
This section is under offline inference so I think it's not related?
ee06b3c to
dbc789a
Compare
|
I have fixed DCO for you, next time please sign-off your commits. |
Head branch was pushed to by a user without write access
e28f687 to
2285558
Compare
This enables the Chat Completions API to leverage the model's existing capability for multiple embeddings - Remove limitation that only allowed one message with image_embeds/audio_embeds - Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists - Add unit tests for multiple image embeddings support: * test_parse_chat_messages_multiple_image_embeds * test_parse_chat_messages_multiple_image_embeds_with_uuids * test_parse_chat_messages_multiple_image_embeds_async - Embeddings now behave consistently with regular images/audios - Validation via existing validate_num_items() against --limit-mm-per-prompt Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
b0a69f7 to
174fd06
Compare
This enables the Chat Completions API to leverage the model's existing capability for multiple embeddings, previously only accessible through the direct LLM inference API.
Remove limitation that only allowed one message with image_embeds/audio_embeds
Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists
Embeddings now behave consistently with regular images/audios
Validation via existing validate_num_items() against --limit-mm-per-prompt
Test Plan
Test Result
passed