[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168
[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU#28168DarkLight1337 merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable optimization by adding a multimodal_cpu_fields mechanism to prevent unnecessary GPU transfers of certain multimodal data. This is implemented cleanly by extending the SupportsMultiModal protocol and updating group_mm_kwargs_by_modality. The Qwen VL models are updated to use this feature, which simplifies the code and avoids performance-hurting CUDA synchronizations. However, there is a critical regression in the data-parallel path for qwen2_5_vl that needs to be addressed.
There was a problem hiding this comment.
Code Review
This pull request introduces a well-reasoned optimization to keep certain multimodal fields on the CPU, avoiding unnecessary GPU transfers and potential performance bottlenecks from CUDA synchronization. The changes are consistently applied across the relevant Qwen models and supporting utilities. The implementation is clean and the performance improvement, though modest in the provided benchmark, is a welcome enhancement. I've found one critical issue in qwen2_vl.py that needs to be addressed, but otherwise, the changes look solid.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to keep specified multimodal fields on the CPU, aiming to prevent unnecessary GPU-CPU data transfers and CUDA stream synchronizations for performance improvement. The changes are well-motivated and have been applied to the Qwen VL models. The implementation is largely correct, but I've identified a critical issue in qwen2_vl.py where a tensor is computed on the CPU but not moved to the GPU before being used in an operation with GPU tensors, which would cause a runtime error. I have provided a code suggestion to fix this. Apart from this issue, the changes are solid and align with the goal of enhancing performance and code clarity.
|
I am ok with putting this in the model definition. |
stay on CPU Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
0d7ef23 to
dd48342
Compare
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…stay on CPU (vllm-project#28168) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix vllm-project/vllm#27938 2. fix vllm-project/vllm#27145 pooling models now supports chunked prefill and prefix caching, 3. fix vllm-project/vllm#30181 define the CPU fields in the field config where they really belong. 4. fix vllm-project/vllm#28168 define the CPU fields in the field config where they really belong. 5. fix vllm-project/vllm#30201 some moudle rename 6. fix vllm-project/vllm#29067 fusedmoe moudle refactor 7. fix vllm-project/vllm#29066 fusedmoe moudle refactor 8. fix vllm-project/vllm#29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Purpose
This is a followup on #28141 and adds a
multimodal_cpu_fieldsto theSupportsMultiModalprotocol. This allows model implementers to define multimodal fields which shouldn't be transferred to the GPU.I updated the Qwen VL model definitions to set
image_grid_thwandvideo_grid_thwas CPU-only fields. These fields are transferred to the GPU bygroup_mm_kwargs_by_modalityand copied back to the CPU in the model forward pass via.tolist(). The latter causes acudaStreamSyncwhich can hurt performance. Keeping them on CPU also allows to cleanup the model code a bit and removes the need to convert the list back to tensors.I'm not sure whether
SupportsMultiModalis the best place to define this, but it didn't require large code changes and still is backwards compatible so I think it's sensible to keep it close tomerge_by_field_config.Test Plan
Test Result
I generated a torch profile and verified that no more
cudaStreamSyncops are present.This also seems to slightly improve throughput by allowing more computation to be overlapped. Measured on a single L40s GPU though the benefit might be larger for faster GPUs.
Before:
After: