[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU by lgeiger · Pull Request #28168 · vllm-project/vllm

lgeiger · 2025-11-05T23:08:36Z

Purpose

This is a followup on #28141 and adds a multimodal_cpu_fields to the SupportsMultiModal protocol. This allows model implementers to define multimodal fields which shouldn't be transferred to the GPU.

I updated the Qwen VL model definitions to set image_grid_thw and video_grid_thw as CPU-only fields. These fields are transferred to the GPU by group_mm_kwargs_by_modality and copied back to the CPU in the model forward pass via .tolist(). The latter causes a cudaStreamSync which can hurt performance. Keeping them on CPU also allows to cleanup the model code a bit and removes the need to convert the list back to tensors.

I'm not sure whether SupportsMultiModal is the best place to define this, but it didn't require large code changes and still is backwards compatible so I think it's sensible to keep it close to merge_by_field_config.

Test Plan

vllm serve Qwen/Qwen3-VL-2B-Instruct-FP8 --limit-mm-per-prompt.video 0

vllm bench serve --backend openai-chat --model Qwen/Qwen3-VL-2B-Instruct-FP8 --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Test Result

I generated a torch profile and verified that no more cudaStreamSync ops are present.

This also seems to slightly improve throughput by allowing more computation to be overlapped. Measured on a single L40s GPU though the benefit might be larger for faster GPUs.

Before:

============ Serving Benchmark Result ============
Successful requests:                     998
Failed requests:                         2
Benchmark duration (s):                  49.89
Total input tokens:                      94306
Total generated tokens:                  121122
Request throughput (req/s):              20.00
Output token throughput (tok/s):         2427.71
Peak output token throughput (tok/s):    14299.00
Peak concurrent requests:                998.00
Total Token throughput (tok/s):          4317.94
---------------Time to First Token----------------
Mean TTFT (ms):                          22076.45
Median TTFT (ms):                        20806.25
P99 TTFT (ms):                           45564.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          84.44
Median TPOT (ms):                        84.52
P99 TPOT (ms):                           126.30
---------------Inter-token Latency----------------
Mean ITL (ms):                           84.63
Median ITL (ms):                         73.15
P99 ITL (ms):                            514.58
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     998
Failed requests:                         2
Benchmark duration (s):                  49.28
Total input tokens:                      94302
Total generated tokens:                  121141
Request throughput (req/s):              20.25
Output token throughput (tok/s):         2458.19
Peak output token throughput (tok/s):    15255.00
Peak concurrent requests:                998.00
Total Token throughput (tok/s):          4371.77
---------------Time to First Token----------------
Mean TTFT (ms):                          21838.22
Median TTFT (ms):                        20195.33
P99 TTFT (ms):                           44981.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.06
Median TPOT (ms):                        84.76
P99 TPOT (ms):                           128.41
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.61
Median ITL (ms):                         73.28
P99 ITL (ms):                            511.00
==================================================

vllm/model_executor/models/vision.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/vision.py

gemini-code-assist

Code Review

This pull request introduces a valuable optimization by adding a multimodal_cpu_fields mechanism to prevent unnecessary GPU transfers of certain multimodal data. This is implemented cleanly by extending the SupportsMultiModal protocol and updating group_mm_kwargs_by_modality. The Qwen VL models are updated to use this feature, which simplifies the code and avoids performance-hurting CUDA synchronizations. However, there is a critical regression in the data-parallel path for qwen2_5_vl that needs to be addressed.

vllm/model_executor/models/vision.py

gemini-code-assist

Code Review

This pull request introduces a well-reasoned optimization to keep certain multimodal fields on the CPU, avoiding unnecessary GPU transfers and potential performance bottlenecks from CUDA synchronization. The changes are consistently applied across the relevant Qwen models and supporting utilities. The implementation is clean and the performance improvement, though modest in the provided benchmark, is a welcome enhancement. I've found one critical issue in qwen2_vl.py that needs to be addressed, but otherwise, the changes look solid.

vllm/model_executor/models/qwen2_vl.py

gemini-code-assist

Code Review

This pull request introduces a mechanism to keep specified multimodal fields on the CPU, aiming to prevent unnecessary GPU-CPU data transfers and CUDA stream synchronizations for performance improvement. The changes are well-motivated and have been applied to the Qwen VL models. The implementation is largely correct, but I've identified a critical issue in qwen2_vl.py where a tensor is computed on the CPU but not moved to the GPU before being used in an operation with GPU tensors, which would cause a runtime error. I have provided a code suggestion to fix this. Apart from this issue, the changes are solid and align with the goal of enhancing performance and code clarity.

vllm/model_executor/models/qwen2_vl.py

vllm/model_executor/models/interfaces.py

DarkLight1337 · 2025-11-06T03:16:48Z

I am ok with putting this in the model definition.

stay on CPU Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

mergify · 2025-11-06T12:54:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lgeiger.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

lgeiger requested review from DarkLight1337, NickLucche, sighingnow and ywang96 as code owners November 5, 2025 23:08

mergify bot added multi-modality Related to multi-modality (#4194) qwen Related to Qwen models v1 tpu Related to Google TPUs labels Nov 5, 2025

lgeiger commented Nov 5, 2025

View reviewed changes

vllm/model_executor/models/vision.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 5, 2025

View reviewed changes

vllm/model_executor/models/vision.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

vllm/model_executor/models/vision.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Nov 6, 2025

View reviewed changes

vllm/model_executor/models/interfaces.py Outdated Show resolved Hide resolved

lgeiger added 2 commits November 6, 2025 12:54

[Core][MM] Add mechanism to configure multimodal fields which should

a6df2dc

stay on CPU Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Revert changes to data parallel code path

dd48342

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

mergify bot added the needs-rebase label Nov 6, 2025

lgeiger force-pushed the mm-cpu-fields branch from 0d7ef23 to dd48342 Compare November 6, 2025 12:54

mergify bot removed the needs-rebase label Nov 6, 2025

Use frozenset as multimodal_cpu_fields default

beba1f4

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

DarkLight1337 approved these changes Nov 6, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 6, 2025 23:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025

Merge branch 'main' into mm-cpu-fields

7fd4084

DarkLight1337 merged commit e0919f3 into vllm-project:main Nov 7, 2025
55 checks passed

lgeiger deleted the mm-cpu-fields branch November 7, 2025 12:26

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Core][MM] Add mechanism to configure multimodal fields which should …

e286192

…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

lgeiger added a commit to lgeiger/vllm that referenced this pull request Nov 19, 2025

Update tests according to vllm-project#28168

427b41e

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Core][MM] Add mechanism to configure multimodal fields which should …

eae8e7c

…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

shaopeng-666 mentioned this pull request Nov 29, 2025

fix qwen3vl ci vllm-project/vllm-ascend#4536

Closed

lgeiger mentioned this pull request Dec 1, 2025

[Bugfix][MM] Move grid_thw tensor to cpu before directly converting to numpy #29770

Closed

5 tasks

DarkLight1337 mentioned this pull request Dec 6, 2025

[Model] Move multimodal_cpu_fields definition to field config #30181

Merged

5 tasks

This was referenced Dec 8, 2025

[Misc] Upgrade vllm commit to 12_08 vllm-project/vllm-ascend#4781

Closed

[Misc] Upgrade vllm hash to 1210 vllm-project/vllm-ascend#4906

Closed

Potabk mentioned this pull request Dec 15, 2025

[Misc] Upgrade vllm hash to 12_14 vllm-project/vllm-ascend#5000

Merged

shen-shanshan mentioned this pull request Dec 19, 2025

[Bugfix] Implement multimodal_cpu_fields in model runner vllm-project/vllm-ascend#5196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168

[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168
DarkLight1337 merged 4 commits intovllm-project:mainfrom
lgeiger:mm-cpu-fields

lgeiger commented Nov 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lgeiger commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lgeiger commented Nov 5, 2025 •

edited by github-actions bot

Loading