Support multiple image/audio embeddings per requests by jeremyteboul · Pull Request #29988 · vllm-project/vllm

jeremyteboul · 2025-12-03T18:46:12Z

This enables the Chat Completions API to leverage the model's existing capability for multiple embeddings, previously only accessible through the direct LLM inference API.

Remove limitation that only allowed one message with image_embeds/audio_embeds
Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists
Embeddings now behave consistently with regular images/audios
Validation via existing validate_num_items() against --limit-mm-per-prompt

Test Plan

Add unit tests for multiple image embeddings support:
- test_parse_chat_messages_multiple_image_embeds
- test_parse_chat_messages_multiple_image_embeds_with_uuids
- test_parse_chat_messages_multiple_image_embeds_async

Test Result

passed

gemini-code-assist

Code Review

This pull request successfully enables support for multiple image and audio embeddings per request by updating the MultiModalItemTracker to handle lists of embeddings. The changes in vllm/entrypoints/chat_utils.py are correct and effectively remove the previous limitation. The new unit tests are also well-designed to cover these new capabilities. However, this change to the data structure for embeddings (from a single item to a list) breaks several existing unit tests that were written with the single-embedding assumption. I've identified these failing tests and provided suggestions to update them. Addressing these test failures is critical for merging this PR.

chatgpt-codex-connector · 2025-12-03T18:51:13Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L725-L727
Keep single image_embeds output shape backward compatible

Mapping image_embeds directly to mm_inputs["image"] now returns a list even when only one embedding is provided. Existing callers/tests (e.g., test_parse_chat_messages_empty_image_embeds_with_uuid, lines 829–858) relied on mm_data["image"] being a lone tensor/None for a single embed; after this change they receive [tensor] or [None], breaking those assertions and changing the public return contract despite the commit claiming backward compatibility.

https://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L730-L732
Audio embeddings wrapped in lists stop being parsed as embeddings

Setting mm_inputs["audio"] = audio_embeds_lst means a single audio embedding now surfaces as a list. Downstream parsing treats embeddings only when it receives a tensor or a list of 2D tensors (MultiModalDataParser.is_embeddings, vllm/multimodal/parse.py:383-390); a list containing the previous 3D tensor (see test_parse_chat_messages_audio_embeds_with_string, lines 893–939) is no longer recognized as an embedding and is processed as raw audio instead, breaking existing tests and single-embed requests.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

DarkLight1337

Thanks, can you update the Multimodal Inputs documentation page accordingly as well?

mergify · 2025-12-04T05:27:26Z

Documentation preview: https://vllm--29988.org.readthedocs.build/en/29988/

jeremyteboul · 2025-12-04T05:34:29Z

Thanks, can you update the Multimodal Inputs documentation page accordingly as well?

Here is the doc !

DarkLight1337 · 2025-12-04T06:48:26Z

#29970 just got merged, can you update your code to use the tensor2base64 convenience function?

DarkLight1337 · 2025-12-04T06:48:52Z

docs/features/multimodal_inputs.md

@@ -473,6 +473,9 @@ You can pass pre-computed audio embeddings similar to image embeddings:
        print(generated_text)
    ```

+!!! note


This section is under offline inference so I think it's not related?

docs/features/multimodal_inputs.md

DarkLight1337

Thanks, LGTM then

DarkLight1337 · 2025-12-05T05:47:12Z

I have fixed DCO for you, next time please sign-off your commits.

This enables the Chat Completions API to leverage the model's existing capability for multiple embeddings - Remove limitation that only allowed one message with image_embeds/audio_embeds - Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists - Add unit tests for multiple image embeddings support: * test_parse_chat_messages_multiple_image_embeds * test_parse_chat_messages_multiple_image_embeds_with_uuids * test_parse_chat_messages_multiple_image_embeds_async - Embeddings now behave consistently with regular images/audios - Validation via existing validate_num_items() against --limit-mm-per-prompt Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>

) Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com> Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

jeremyteboul requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang and robertgshaw2-redhat as code owners December 3, 2025 18:46

mergify bot added the frontend label Dec 3, 2025

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

jeremyteboul force-pushed the multi_image_enbeddings branch from c4e242c to fa53ab7 Compare December 3, 2025 18:49

jeremyteboul force-pushed the multi_image_enbeddings branch from fa53ab7 to f4e251f Compare December 3, 2025 22:09

DarkLight1337 reviewed Dec 4, 2025

View reviewed changes

jeremyteboul force-pushed the multi_image_enbeddings branch from f4e251f to 0306c96 Compare December 4, 2025 05:26

mergify bot added the documentation Improvements or additions to documentation label Dec 4, 2025

DarkLight1337 reviewed Dec 4, 2025

View reviewed changes

docs/features/multimodal_inputs.md Show resolved Hide resolved

jeremyteboul force-pushed the multi_image_enbeddings branch 2 times, most recently from ee06b3c to dbc789a Compare December 4, 2025 19:30

DarkLight1337 approved these changes Dec 5, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025

DarkLight1337 enabled auto-merge (squash) December 5, 2025 05:47

auto-merge was automatically disabled December 6, 2025 20:20
Head branch was pushed to by a user without write access

jeremyteboul force-pushed the multi_image_enbeddings branch from e28f687 to 2285558 Compare December 6, 2025 20:20

jeremyteboul added 3 commits December 6, 2025 18:19

Trigger CI re-run (flaky network test)

f3b1107

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>

Trigger CI re-run (flaky network test)

174fd06

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>

jeremyteboul force-pushed the multi_image_enbeddings branch from b0a69f7 to 174fd06 Compare December 7, 2025 02:20

DarkLight1337 enabled auto-merge (squash) December 7, 2025 03:07

DarkLight1337 merged commit dce6d22 into vllm-project:main Dec 7, 2025
46 of 47 checks passed

DarkLight1337 mentioned this pull request Dec 15, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

57 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support multiple image/audio embeddings per requests#29988

Support multiple image/audio embeddings per requests#29988
DarkLight1337 merged 3 commits intovllm-project:mainfrom
jeremyteboul:multi_image_enbeddings

jeremyteboul commented Dec 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot commented Dec 3, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

mergify bot commented Dec 4, 2025

Uh oh!

jeremyteboul commented Dec 4, 2025

Uh oh!

DarkLight1337 commented Dec 4, 2025

Uh oh!

DarkLight1337 Dec 4, 2025

Uh oh!

Uh oh!

DarkLight1337 left a comment

Uh oh!

DarkLight1337 commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jeremyteboul commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Dec 3, 2025

💡 Codex Review

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 4, 2025

Uh oh!

jeremyteboul commented Dec 4, 2025

Uh oh!

DarkLight1337 commented Dec 4, 2025

Uh oh!

DarkLight1337 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeremyteboul commented Dec 3, 2025 •

edited by github-actions bot

Loading