Skip to content

Support multiple image/audio embeddings per requests#29988

Merged
DarkLight1337 merged 3 commits intovllm-project:mainfrom
jeremyteboul:multi_image_enbeddings
Dec 7, 2025
Merged

Support multiple image/audio embeddings per requests#29988
DarkLight1337 merged 3 commits intovllm-project:mainfrom
jeremyteboul:multi_image_enbeddings

Conversation

@jeremyteboul
Copy link
Copy Markdown
Contributor

@jeremyteboul jeremyteboul commented Dec 3, 2025

This enables the Chat Completions API to leverage the model's existing capability for multiple embeddings, previously only accessible through the direct LLM inference API.

  • Remove limitation that only allowed one message with image_embeds/audio_embeds

  • Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists

  • Embeddings now behave consistently with regular images/audios

  • Validation via existing validate_num_items() against --limit-mm-per-prompt

Test Plan

  • Add unit tests for multiple image embeddings support:
    • test_parse_chat_messages_multiple_image_embeds
    • test_parse_chat_messages_multiple_image_embeds_with_uuids
    • test_parse_chat_messages_multiple_image_embeds_async

Test Result

passed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully enables support for multiple image and audio embeddings per request by updating the MultiModalItemTracker to handle lists of embeddings. The changes in vllm/entrypoints/chat_utils.py are correct and effectively remove the previous limitation. The new unit tests are also well-designed to cover these new capabilities. However, this change to the data structure for embeddings (from a single item to a list) breaks several existing unit tests that were written with the single-embedding assumption. I've identified these failing tests and provided suggestions to update them. Addressing these test failures is critical for merging this PR.

@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch from c4e242c to fa53ab7 Compare December 3, 2025 18:49
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

https://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L725-L727
P1 Badge Keep single image_embeds output shape backward compatible

Mapping image_embeds directly to mm_inputs["image"] now returns a list even when only one embedding is provided. Existing callers/tests (e.g., test_parse_chat_messages_empty_image_embeds_with_uuid, lines 829–858) relied on mm_data["image"] being a lone tensor/None for a single embed; after this change they receive [tensor] or [None], breaking those assertions and changing the public return contract despite the commit claiming backward compatibility.


https://github.com/vllm-project/vllm/blob/c4e242c86de0203e5a2a137cce1cacda6e226705/vllm/entrypoints/chat_utils.py#L730-L732
P1 Badge Audio embeddings wrapped in lists stop being parsed as embeddings

Setting mm_inputs["audio"] = audio_embeds_lst means a single audio embedding now surfaces as a list. Downstream parsing treats embeddings only when it receives a tensor or a list of 2D tensors (MultiModalDataParser.is_embeddings, vllm/multimodal/parse.py:383-390); a list containing the previous 3D tensor (see test_parse_chat_messages_audio_embeds_with_string, lines 893–939) is no longer recognized as an embedding and is processed as raw audio instead, breaking existing tests and single-embed requests.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch from fa53ab7 to f4e251f Compare December 3, 2025 22:09
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, can you update the Multimodal Inputs documentation page accordingly as well?

@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch from f4e251f to 0306c96 Compare December 4, 2025 05:26
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 4, 2025

Documentation preview: https://vllm--29988.org.readthedocs.build/en/29988/

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 4, 2025
@jeremyteboul
Copy link
Copy Markdown
Contributor Author

Thanks, can you update the Multimodal Inputs documentation page accordingly as well?

Here is the doc !

@DarkLight1337
Copy link
Copy Markdown
Member

#29970 just got merged, can you update your code to use the tensor2base64 convenience function?

@@ -473,6 +473,9 @@ You can pass pre-computed audio embeddings similar to image embeddings:
print(generated_text)
```

!!! note
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is under offline inference so I think it's not related?

@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch 2 times, most recently from ee06b3c to dbc789a Compare December 4, 2025 19:30
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM then

@DarkLight1337
Copy link
Copy Markdown
Member

I have fixed DCO for you, next time please sign-off your commits.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) December 5, 2025 05:47
auto-merge was automatically disabled December 6, 2025 20:20

Head branch was pushed to by a user without write access

@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch from e28f687 to 2285558 Compare December 6, 2025 20:20
This enables the Chat Completions API to leverage the model's existing
capability for multiple embeddings

- Remove limitation that only allowed one message with image_embeds/audio_embeds
- Update MultiModalItemTracker and AsyncMultiModalItemTracker to treat embeddings as lists
- Add unit tests for multiple image embeddings support:
  * test_parse_chat_messages_multiple_image_embeds
  * test_parse_chat_messages_multiple_image_embeds_with_uuids
  * test_parse_chat_messages_multiple_image_embeds_async
- Embeddings now behave consistently with regular images/audios
- Validation via existing validate_num_items() against --limit-mm-per-prompt

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
@jeremyteboul jeremyteboul force-pushed the multi_image_enbeddings branch from b0a69f7 to 174fd06 Compare December 7, 2025 02:20
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) December 7, 2025 03:07
@DarkLight1337 DarkLight1337 merged commit dce6d22 into vllm-project:main Dec 7, 2025
46 of 47 checks passed
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
)

Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants