Skip to content

fix gllo allreduce to hccl#2102

Closed
zhaowei1936 wants to merge 1 commit intovllm-project:mainfrom
zhaowei1936:br_main_gloo_fix_hccl
Closed

fix gllo allreduce to hccl#2102
zhaowei1936 wants to merge 1 commit intovllm-project:mainfrom
zhaowei1936:br_main_gloo_fix_hccl

Conversation

@zhaowei1936
Copy link
Copy Markdown

@zhaowei1936 zhaowei1936 commented Jul 30, 2025

What this PR does / why we need it?

Implements HCCL for DP has_unfinished_dp communication, resulting in significant performance improvements for large DP
Achieves ~10ms latency reduction with A3 DP size of 128
image

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@zhaowei1936 zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from 8225c3b to ce79104 Compare July 31, 2025 12:46
@jianzs
Copy link
Copy Markdown
Collaborator

jianzs commented Jul 31, 2025

#1857 I submitted a pull request with this feature, and eliminated the communication overhead for DP metadata when constructing the forward context. Now, I'm waiting for @wangxiyuan to refactor the model runner.

@zhaowei1936
Copy link
Copy Markdown
Author

zhaowei1936 commented Aug 1, 2025

#1857 I submitted a pull request with this feature, and eliminated the communication overhead for DP metadata when constructing the forward context. Now, I'm waiting for @wangxiyuan to refactor the model runner.

ok, I abandon modifying this function _get_forward_metadata_across_dp, what time @wangxiyuan to refactor the model runner? I modifying this funciton has_unfinished_dp first

@zhaowei1936 zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 4 times, most recently from e121e1d to 2ae3461 Compare August 1, 2025 11:01
@codecov
Copy link
Copy Markdown

codecov bot commented Aug 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.85%. Comparing base (992271b) to head (be22296).
⚠️ Report is 663 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2102      +/-   ##
==========================================
+ Coverage   75.74%   75.85%   +0.10%     
==========================================
  Files         118      119       +1     
  Lines       13525    13585      +60     
==========================================
+ Hits        10245    10305      +60     
  Misses       3280     3280              
Flag Coverage Δ
unittests 75.85% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jianzs
Copy link
Copy Markdown
Collaborator

jianzs commented Aug 2, 2025

As far as I know, has_unfinished_dp runs inside the Scheduler process, not the Worker process. In that case, why does optimizing this function improve performance?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zhaowei1936 zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch from cbe946e to c6c56ff Compare August 5, 2025 06:20
@zhaowei1936 zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 2 times, most recently from e93bbda to 245fead Compare August 5, 2025 14:19
@zhaowei1936
Copy link
Copy Markdown
Author

As far as I know, has_unfinished_dp runs inside the Scheduler process, not the Worker process. In that case, why does optimizing this function improve performance?

image The specific benefits of a single case are derived as follows: 1) Gloo was initially designed for small-scale clusters (such as single machines with multiple cards), using relatively simple communication protocols. In multi-machine cross-node scenarios, the protocol overhead (such as handshakes, message headers) significantly increases, leading to a decrease in actual data transmission efficiency. 2) Gloo does not support acceleration from dedicated network hardware (such as RDMA, NCCL) and relies on the standard TCP/IP protocol. In multi-machine environments, TCP's congestion control and retransmission mechanisms introduce additional delays, especially in high-bandwidth networks where the hardware potential cannot be fully utilized. 3) The core communication logic of Gloo is single-threaded, unable to fully utilize the multi-core CPU resources in a multi-machine environment. As the number of nodes increases, single-threaded processing becomes a significant bottleneck. The advantages of HCCL communication based on A3 super nodes are mainly as follows: 1) HCCL is a communication library specifically designed by Huawei for the Ascend series of NPUs, directly invoking the communication engine built into the chip, avoiding the overhead of the general TCP/IP protocol stack that Gloo relies on. 2) HCCL supports the RoCE (RDMA over Converged Ethernet) protocol, utilizing Ethernet to achieve RDMA-level high-performance communication, with latency reduced to the microsecond level, whereas Gloo's TCP-based latency is typically in the millisecond range. 3) HCCL can automatically recognize the cluster network topology (such as switch hierarchy, link bandwidth) and dynamically select the optimal communication path. For example, in multi-machine multi-node scenarios, HCCL adjusts the partitioning of the Ring algorithm for all_reduce based on the network distance between nodes, whereas Gloo uses a fixed strategy and cannot adapt to complex topologies. 4) For large-scale clusters (such as those with more than 64 nodes), HCCL employs hierarchical aggregation (Hierarchical AllReduce): first, nodes within a cabinet quickly aggregate, then cross-cabinet aggregation is performed, significantly reducing the number of cross-switch communications. Gloo lacks this intelligent hierarchical mechanism.

报错

@zhaowei1936 zhaowei1936 force-pushed the br_main_gloo_fix_hccl branch 8 times, most recently from 4289a31 to 079fe9c Compare August 6, 2025 11:53
@wangxiyuan
Copy link
Copy Markdown
Collaborator

please take a look at this one. vllm-project/vllm#22243

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: z00811365 <zhaowei6@huawei.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan wangxiyuan closed this Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants