fix gllo allreduce to hccl by zhaowei1936 · Pull Request #2102 · vllm-project/vllm-ascend

zhaowei1936 · 2025-07-30T03:21:51Z

What this PR does / why we need it?

Implements HCCL for DP has_unfinished_dp communication, resulting in significant performance improvements for large DP
Achieves ~10ms latency reduction with A3 DP size of 128

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@6807af8

github-actions · 2025-07-30T03:56:24Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

jianzs · 2025-07-31T13:41:05Z

#1857 I submitted a pull request with this feature, and eliminated the communication overhead for DP metadata when constructing the forward context. Now, I'm waiting for @wangxiyuan to refactor the model runner.

zhaowei1936 · 2025-08-01T06:41:58Z

#1857 I submitted a pull request with this feature, and eliminated the communication overhead for DP metadata when constructing the forward context. Now, I'm waiting for @wangxiyuan to refactor the model runner.

ok， I abandon modifying this function _get_forward_metadata_across_dp， what time @wangxiyuan to refactor the model runner？ I modifying this funciton has_unfinished_dp first

codecov · 2025-08-01T11:21:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.85%. Comparing base (992271b) to head (be22296).
⚠️ Report is 663 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2102      +/-   ##
==========================================
+ Coverage   75.74%   75.85%   +0.10%     
==========================================
  Files         118      119       +1     
  Lines       13525    13585      +60     
==========================================
+ Hits        10245    10305      +60     
  Misses       3280     3280

Flag	Coverage Δ
unittests	`75.85% <100.00%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jianzs · 2025-08-02T04:00:16Z

As far as I know, has_unfinished_dp runs inside the Scheduler process, not the Worker process. In that case, why does optimizing this function improve performance?

vllm_ascend/patch/platform/patch_common/patch_distributed.py

github-actions · 2025-08-05T03:39:10Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zhaowei1936 · 2025-08-05T14:38:55Z

As far as I know, has_unfinished_dp runs inside the Scheduler process, not the Worker process. In that case, why does optimizing this function improve performance?

The specific benefits of a single case are derived as follows: 1) Gloo was initially designed for small-scale clusters (such as single machines with multiple cards), using relatively simple communication protocols. In multi-machine cross-node scenarios, the protocol overhead (such as handshakes, message headers) significantly increases, leading to a decrease in actual data transmission efficiency. 2) Gloo does not support acceleration from dedicated network hardware (such as RDMA, NCCL) and relies on the standard TCP/IP protocol. In multi-machine environments, TCP's congestion control and retransmission mechanisms introduce additional delays, especially in high-bandwidth networks where the hardware potential cannot be fully utilized. 3) The core communication logic of Gloo is single-threaded, unable to fully utilize the multi-core CPU resources in a multi-machine environment. As the number of nodes increases, single-threaded processing becomes a significant bottleneck. The advantages of HCCL communication based on A3 super nodes are mainly as follows: 1) HCCL is a communication library specifically designed by Huawei for the Ascend series of NPUs, directly invoking the communication engine built into the chip, avoiding the overhead of the general TCP/IP protocol stack that Gloo relies on. 2) HCCL supports the RoCE (RDMA over Converged Ethernet) protocol, utilizing Ethernet to achieve RDMA-level high-performance communication, with latency reduced to the microsecond level, whereas Gloo's TCP-based latency is typically in the millisecond range. 3) HCCL can automatically recognize the cluster network topology (such as switch hierarchy, link bandwidth) and dynamically select the optimal communication path. For example, in multi-machine multi-node scenarios, HCCL adjusts the partitioning of the Ring algorithm for all_reduce based on the network distance between nodes, whereas Gloo uses a fixed strategy and cannot adapt to complex topologies. 4) For large-scale clusters (such as those with more than 64 nodes), HCCL employs hierarchical aggregation (Hierarchical AllReduce): first, nodes within a cabinet quickly aggregate, then cross-cabinet aggregation is performed, significantly reducing the number of cross-switch communications. Gloo lacks this intelligent hierarchical mechanism.

报错

wangxiyuan · 2025-08-07T08:51:47Z

please take a look at this one. vllm-project/vllm#22243

github-actions · 2025-08-12T06:27:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: z00811365 <zhaowei6@huawei.com>

github-actions · 2025-08-14T01:41:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from 8225c3b to ce79104 Compare July 31, 2025 12:46

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 4 times, most recently from e121e1d to 2ae3461 Compare August 1, 2025 11:01

momo609 reviewed Aug 4, 2025

View reviewed changes

vllm_ascend/patch/platform/patch_common/patch_distributed.py Show resolved Hide resolved

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from 2ae3461 to cbe946e Compare August 5, 2025 03:37

github-actions bot added the merge-conflicts label Aug 5, 2025

github-actions bot added module:tests module:core labels Aug 5, 2025

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from cbe946e to c6c56ff Compare August 5, 2025 06:20

github-actions bot removed the merge-conflicts label Aug 5, 2025

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 2 times, most recently from e93bbda to 245fead Compare August 5, 2025 14:19

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 8 times, most recently from 4289a31 to 079fe9c Compare August 6, 2025 11:53

github-actions bot added the merge-conflicts label Aug 12, 2025

fix gllo allreduce to hccl

be22296

Signed-off-by: z00811365 <zhaowei6@huawei.com>

zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from 079fe9c to be22296 Compare August 13, 2025 04:53

github-actions bot added merge-conflicts and removed merge-conflicts labels Aug 13, 2025

wangxiyuan closed this Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix gllo allreduce to hccl#2102

fix gllo allreduce to hccl#2102
zhaowei1936 wants to merge 1 commit intovllm-project:mainfrom
zhaowei1936:br_main_gloo_fix_hccl

zhaowei1936 commented Jul 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

jianzs commented Jul 31, 2025 •

edited

Loading

Uh oh!

zhaowei1936 commented Aug 1, 2025 •

edited by wangxiyuan

Loading

Uh oh!

codecov bot commented Aug 1, 2025 •

edited

Loading

Uh oh!

jianzs commented Aug 2, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

zhaowei1936 commented Aug 5, 2025

Uh oh!

wangxiyuan commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhaowei1936 commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

jianzs commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaowei1936 commented Aug 1, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jianzs commented Aug 2, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

zhaowei1936 commented Aug 5, 2025

Uh oh!

wangxiyuan commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhaowei1936 commented Jul 30, 2025 •

edited by github-actions bot

Loading

jianzs commented Jul 31, 2025 •

edited

Loading

zhaowei1936 commented Aug 1, 2025 •

edited by wangxiyuan

Loading

codecov bot commented Aug 1, 2025 •

edited

Loading