[Feature] Optimize forward metadata collection across dp ranks#1593
[Feature] Optimize forward metadata collection across dp ranks#1593
Conversation
|
@NeverRaR PTAL |
|
lgtm |
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes how forward-pass metadata is collected and communicated across data-parallel ranks by removing the previous all-reduce and introducing an HCCL-based all-gather approach.
- Enforce that dummy batch execution only runs under data parallelism and refactor
execute_dummy_batchto use per-rank metadata. - Replace
dist.all_reducewith HCCLall_gatherin_get_forward_metadata_across_dpand update callers to handle the Tensor of per-rank token counts. - Propagate
num_tokens_across_dpthrough dummy runs and forward contexts, masking sentinel values before the pass.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm_ascend/worker/worker_v1.py | Added assertion for dp_size > 1, refactored dummy-run logic to use HCCL per-rank metadata. |
| vllm_ascend/worker/model_runner_v1.py | Swapped all_reduce for all_gather under get_dp_group(), changed method signature and updated callers to handle a Tensor of metadata. |
Comments suppressed due to low confidence (1)
vllm_ascend/worker/model_runner_v1.py:622
- Add unit or integration tests covering the
dp_size > 1aggregation path to verify thatall_gatherproduces the correct combined metadata and that themasked_fill_logic correctly replaces sentinel values.
local_forward_metadata)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1593 +/- ##
===========================================
+ Coverage 27.39% 54.93% +27.53%
===========================================
Files 56 80 +24
Lines 6191 9712 +3521
===========================================
+ Hits 1696 5335 +3639
+ Misses 4495 4377 -118
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Yikun @wangxiyuan @ApsarasX @ganyi1996ppo ready to merge. |
| max_num_tokens) | ||
| runner._dummy_run(max_num_tokens, | ||
| else: | ||
| num_tokens = 1 |
There was a problem hiding this comment.
If graph mode is off, a dummy run only needs to be executed; computational requirements are not a factor.
dbaad95 to
7726146
Compare
|
@Angazenn PTAL |
|
@wangxiyuan @ganyi1996ppo @Yikun ready to merge. |
|
torchair has made the code more and more complex and hard to maintain, I have a PR(#1661) to add torchair module, all torchair related code can be updated and changed there, I'll make the PR avaliable for review soon. Before that, I really don't want to merge anything about torchair change. Because it's very hard to review(I'm not sure if the change breaked anything else), Sorry. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
…rker Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: Angazenn <92204292+Angazenn@users.noreply.github.com> Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
|
Wrong to submit the branch, pls feel free to open new PR. |
What this PR does / why we need it?
This PR introduces two optimizations for cases where data parallel size > 1:
set_forward_contextDoes this PR introduce any user-facing change?
no
How was this patch tested?
CI passed.