Skip to content

[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec#23779

Merged
DarkLight1337 merged 9 commits intovllm-project:mainfrom
sfeng33:renderer
Aug 29, 2025
Merged

[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec#23779
DarkLight1337 merged 9 commits intovllm-project:mainfrom
sfeng33:renderer

Conversation

@sfeng33
Copy link
Contributor

@sfeng33 sfeng33 commented Aug 27, 2025

Purpose

This PR refactors multimodal input in V1 engine to use a unified MultiModalFeatureSpec data structure, as part of the broader effort to abstract out vllm's input processing pipeline (#22880).
Partially fix: #23872

Why This Change:

This refactor addresses the fragmented input processing issue where multimodal data (images, audio, video) was passed through multiple separate fields (mm_kwargs, mm_hashes, mm_placeholders).

Changes

  • Introduced MultiModalFeatureSpec: A unified dataclass that encapsulates all multimodal-related data (data, modality, identifier, position) into a single structure.
  • Simplified EngineCoreRequest: Replaced three separate multimodal fields with a single mm_features field and updated its references.
    Note: To keep the PR small, only updated core engine and processor references. Left TODO comments for migrating
    scheduler and model runner in follow-up PRs

Test Plan

python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-1.5-7b-hf 

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What do you see in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 100,
    "temperature": 0
  }'

Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
@mergify mergify bot added multi-modality Related to multi-modality (#4194) v1 labels Aug 27, 2025
Signed-off-by: sfeng33 <4florafeng@gmail.com>
@sfeng33 sfeng33 changed the title [Multimodal] Consolidate mm inputs in MultiModalFeatureSpec in engines [Multimodal] Consolidate mm inputs into MultiModalFeatureSpec Aug 27, 2025
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
@sfeng33
Copy link
Contributor Author

sfeng33 commented Aug 28, 2025

PTAL @ywang96 @DarkLight1337

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for the cleanup!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 29, 2025 03:23
@sfeng33
Copy link
Contributor Author

sfeng33 commented Aug 29, 2025

Thanks for the review!

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 29, 2025
@DarkLight1337
Copy link
Member

The test fails, PTAL

Signed-off-by: sfeng33 <4florafeng@gmail.com>
auto-merge was automatically disabled August 29, 2025 07:05

Head branch was pushed to by a user without write access

Signed-off-by: sfeng33 <4florafeng@gmail.com>
@sfeng33
Copy link
Contributor Author

sfeng33 commented Aug 29, 2025

The test fails, PTAL

Should be fixed now, will monitor the CI.

Signed-off-by: sfeng33 <4florafeng@gmail.com>
@DarkLight1337 DarkLight1337 merged commit 69f4635 into vllm-project:main Aug 29, 2025
37 of 38 checks passed
@sfeng33 sfeng33 deleted the renderer branch September 1, 2025 02:30
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Renderer]: Consolidate MM classes to MultiModalFeatureSpec

2 participants