Skip to content

[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168

Merged
DarkLight1337 merged 4 commits intovllm-project:mainfrom
lgeiger:mm-cpu-fields
Nov 7, 2025
Merged

[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168
DarkLight1337 merged 4 commits intovllm-project:mainfrom
lgeiger:mm-cpu-fields

Conversation

@lgeiger
Copy link
Contributor

@lgeiger lgeiger commented Nov 5, 2025

Purpose

This is a followup on #28141 and adds a multimodal_cpu_fields to the SupportsMultiModal protocol. This allows model implementers to define multimodal fields which shouldn't be transferred to the GPU.

I updated the Qwen VL model definitions to set image_grid_thw and video_grid_thw as CPU-only fields. These fields are transferred to the GPU by group_mm_kwargs_by_modality and copied back to the CPU in the model forward pass via .tolist(). The latter causes a cudaStreamSync which can hurt performance. Keeping them on CPU also allows to cleanup the model code a bit and removes the need to convert the list back to tensors.

I'm not sure whether SupportsMultiModal is the best place to define this, but it didn't require large code changes and still is backwards compatible so I think it's sensible to keep it close to merge_by_field_config.

Test Plan

vllm serve Qwen/Qwen3-VL-2B-Instruct-FP8 --limit-mm-per-prompt.video 0

vllm bench serve --backend openai-chat --model Qwen/Qwen3-VL-2B-Instruct-FP8 --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Test Result

I generated a torch profile and verified that no more cudaStreamSync ops are present.

This also seems to slightly improve throughput by allowing more computation to be overlapped. Measured on a single L40s GPU though the benefit might be larger for faster GPUs.

Before:

============ Serving Benchmark Result ============
Successful requests:                     998
Failed requests:                         2
Benchmark duration (s):                  49.89
Total input tokens:                      94306
Total generated tokens:                  121122
Request throughput (req/s):              20.00
Output token throughput (tok/s):         2427.71
Peak output token throughput (tok/s):    14299.00
Peak concurrent requests:                998.00
Total Token throughput (tok/s):          4317.94
---------------Time to First Token----------------
Mean TTFT (ms):                          22076.45
Median TTFT (ms):                        20806.25
P99 TTFT (ms):                           45564.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          84.44
Median TPOT (ms):                        84.52
P99 TPOT (ms):                           126.30
---------------Inter-token Latency----------------
Mean ITL (ms):                           84.63
Median ITL (ms):                         73.15
P99 ITL (ms):                            514.58
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     998
Failed requests:                         2
Benchmark duration (s):                  49.28
Total input tokens:                      94302
Total generated tokens:                  121141
Request throughput (req/s):              20.25
Output token throughput (tok/s):         2458.19
Peak output token throughput (tok/s):    15255.00
Peak concurrent requests:                998.00
Total Token throughput (tok/s):          4371.77
---------------Time to First Token----------------
Mean TTFT (ms):                          21838.22
Median TTFT (ms):                        20195.33
P99 TTFT (ms):                           44981.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.06
Median TPOT (ms):                        84.76
P99 TPOT (ms):                           128.41
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.61
Median ITL (ms):                         73.28
P99 ITL (ms):                            511.00
==================================================

@mergify mergify bot added multi-modality Related to multi-modality (#4194) qwen Related to Qwen models v1 tpu Related to Google TPUs labels Nov 5, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization by adding a multimodal_cpu_fields mechanism to prevent unnecessary GPU transfers of certain multimodal data. This is implemented cleanly by extending the SupportsMultiModal protocol and updating group_mm_kwargs_by_modality. The Qwen VL models are updated to use this feature, which simplifies the code and avoids performance-hurting CUDA synchronizations. However, there is a critical regression in the data-parallel path for qwen2_5_vl that needs to be addressed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-reasoned optimization to keep certain multimodal fields on the CPU, avoiding unnecessary GPU transfers and potential performance bottlenecks from CUDA synchronization. The changes are consistently applied across the relevant Qwen models and supporting utilities. The implementation is clean and the performance improvement, though modest in the provided benchmark, is a welcome enhancement. I've found one critical issue in qwen2_vl.py that needs to be addressed, but otherwise, the changes look solid.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to keep specified multimodal fields on the CPU, aiming to prevent unnecessary GPU-CPU data transfers and CUDA stream synchronizations for performance improvement. The changes are well-motivated and have been applied to the Qwen VL models. The implementation is largely correct, but I've identified a critical issue in qwen2_vl.py where a tensor is computed on the CPU but not moved to the GPU before being used in an operation with GPU tensors, which would cause a runtime error. I have provided a code suggestion to fix this. Apart from this issue, the changes are solid and align with the goal of enhancing performance and code clarity.

@DarkLight1337
Copy link
Member

I am ok with putting this in the model definition.

stay on CPU

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@mergify
Copy link

mergify bot commented Nov 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lgeiger.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 6, 2025 23:48
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025
@DarkLight1337 DarkLight1337 merged commit e0919f3 into vllm-project:main Nov 7, 2025
55 checks passed
@lgeiger lgeiger deleted the mm-cpu-fields branch November 7, 2025 12:26
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
…stay on CPU (vllm-project#28168)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
lgeiger added a commit to lgeiger/vllm that referenced this pull request Nov 19, 2025
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…stay on CPU (vllm-project#28168)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 15, 2025
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
chenaoxuan pushed a commit to chenaoxuan/vllm-ascend that referenced this pull request Dec 20, 2025
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?
1. fix vllm-project/vllm#27938
2. fix vllm-project/vllm#27145
pooling models now supports chunked prefill and prefix caching,
3. fix vllm-project/vllm#30181
define the CPU fields in the field config where they really belong.
4. fix vllm-project/vllm#28168
define the CPU fields in the field config where they really belong.
5. fix vllm-project/vllm#30201
some moudle rename
6. fix vllm-project/vllm#29067
fusedmoe moudle refactor
7. fix vllm-project/vllm#29066
fusedmoe moudle refactor
8. fix vllm-project/vllm#29624
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants