[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL#22742
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL#22742DarkLight1337 merged 22 commits intovllm-project:mainfrom
Conversation
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces data parallelism for the Vision Transformer in the Qwen2.5VL model, which results in significant performance improvements as shown by the benchmarks. The implementation includes a load balancing mechanism to distribute images across GPUs. The code is well-structured, but I've identified a potential issue in the load balancing logic that could lead to imbalanced workloads under certain conditions, even though it doesn't affect the current usage in this PR. I've provided a detailed comment with a suggested fix for this.
| # Assign minimum samples to each GPU | ||
| # (round-robin with smallest samples first) | ||
| small_to_large_indices = torch.argsort(sizes, descending=False) | ||
|
|
||
| for gpu_id in range(num_gpus): | ||
| samples_assigned = 0 | ||
| for idx in small_to_large_indices: | ||
| if idx.item( | ||
| ) not in used_indices and samples_assigned < min_samples_per_gpu: | ||
| gpu_assignments[gpu_id].append(idx.item()) | ||
| gpu_loads[gpu_id] += sizes[idx] | ||
| used_indices.add(idx.item()) | ||
| samples_assigned += 1 | ||
|
|
||
| if samples_assigned >= min_samples_per_gpu: | ||
| break | ||
|
|
There was a problem hiding this comment.
The current implementation for Phase 1 of load balancing does not perform a round-robin assignment as the comment suggests. Instead, it assigns blocks of the smallest samples to each GPU sequentially. This can lead to significant load imbalance if min_samples_per_gpu > 1.
For example, with sizes = [1, 2, 100, 101], num_gpus=2, and min_samples_per_gpu=2, GPU 0 would get samples of size 1 and 2 (total load 3), while GPU 1 would get samples of size 100 and 101 (total load 201).
While the current usage in this PR sets min_samples_per_gpu=0 (making this a non-issue for now), the function's default is 1, and it's a latent bug for other potential uses.
Here is a suggested fix that implements a proper round-robin assignment for Phase 1.
# Assign minimum samples to each GPU
# (round-robin with smallest samples first)
if min_samples_per_gpu > 0:
small_to_large_indices = torch.argsort(sizes, descending=False)
unassigned_indices_iter = iter(idx.item() for idx in small_to_large_indices)
for _ in range(min_samples_per_gpu):
for gpu_id in range(num_gpus):
try:
# Find the next available sample
idx = next(unassigned_indices_iter)
gpu_assignments[gpu_id].append(idx)
gpu_loads[gpu_id] += sizes[idx]
used_indices.add(idx)
except StopIteration:
# Not enough samples to satisfy min_samples_per_gpu for all GPUs
break
else:
continue
break| shard_size = self.output_sizes[loaded_shard_id] | ||
|
|
||
| param[shard_offset:shard_offset + shard_size] = loaded_weight | ||
| param.data[shard_offset:shard_offset + shard_size] = loaded_weight |
There was a problem hiding this comment.
Is this change necessary?
There was a problem hiding this comment.
I am getting this error if it is not assigning to the param.data directly
File "/app/tritonmrope/dp-qwen2vl/vllm/model_executor/layers/linear.py", line 446, in weight_loader
param[shard_offset:shard_offset + shard_size] = loaded_weight
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
There was a problem hiding this comment.
Looks like some params are created with required_grad=True incorrectly, but since other Linear layer's weights_loader all slice at param.data, I think this change is fine for this PR:
vllm/vllm/model_executor/layers/linear.py
Lines 353 to 356 in 653124b
|
CC. @wuhuikx |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Head branch was pushed to by a user without write access
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
This PR enables DP for ViT (LLM will be in TP)
A load balancing logic has also been implemented.
Test Plan
Run lm_eval on ChartQA dataset
Evaluate the Performance Gain
Add unit test case to evaluate the function
get_load_balance_assignmentandrun_dp_sharded_mrope_vision_model.Test Result
lm_eval ChartQA Dataset of model Qwen/Qwen2.5VL-72B-Instruct
Performance Gain:
Server command:
Client command:
Most of the improvement comes from DP-ing Conv3d.
(Optional) Documentation Update
Trace of Qwen2.5VL-72B-Instruct with 16 concurrency prompts
Before enabling DP (ViT in TP mode)
After enabling DP (ViT in DP mode)
