Conversation
Signed-off-by: Angazenn <supperccell@163.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the data parallel communication in _get_forward_metadata_across_dp by replacing a CPU-based all_reduce with an NPU-based all_gather. This is a good optimization that moves the collective operation to the accelerator. The new implementation also appears to correct the logic for determining enable_dbo across DP ranks, using an any operation which seems more appropriate. I have one suggestion to further improve the performance by minimizing data transfer between NPU and CPU.
| dtype=torch.int32) | ||
| global_forward_metadata = get_dp_group().all_gather( | ||
| local_forward_metadata, dim=0) | ||
| maybe_padded_num_tokens = global_forward_metadata[:, 0].cpu().max() |
There was a problem hiding this comment.
For better performance, it's generally recommended to perform reduction operations on the device (NPU) and only transfer the scalar result to the CPU. This avoids synchronizing and copying a larger tensor. You can change this line to perform the max() operation on the NPU before moving the result to the CPU using .item().
| maybe_padded_num_tokens = global_forward_metadata[:, 0].cpu().max() | |
| maybe_padded_num_tokens = global_forward_metadata[:, 0].max().item() |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2492 +/- ##
==========================================
+ Coverage 77.70% 78.49% +0.78%
==========================================
Files 132 132
Lines 17521 17806 +285
==========================================
+ Hits 13615 13976 +361
+ Misses 3906 3830 -76
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?