-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Description
This RFC proposes to provide initial support for RCCL and CG on Ascend NPU.
Original work by @Bye-legumes and @hipudding.
However, we need to decouple them into several PRs with minor modifications and set an example for further hardware support.
Notes:
- I previously submitted a PR in September 2024 to support HCCL and refactor NCCL into a communicator, but the feedback was that it was too large and complicated and we should decouple into some PR with minor modification.
- We should avoid adding additional C code into Ray, as that would influence the build stage.
Plan for Decoupling into Several Stages:
PR1: Support RCCL on NPU
Ray Core supports scheduling on Ascend NPU devices, but the Ray Collective API does not yet support communication between NPUs using HCCL.
🔗 PR #50790
👤 @liuxsh9
PR2: Refactor CG to Support Multiple Devices
We can refer to this PR to decouple device-related modules.
Move cupy dependency, support rank mapping or different progress group.
👤 @hipudding
PR3: CG Support for NPU
CG support will be added after RCCL is merged, utilizing the RCCL API from PR #47658.
👤 @Bye-legumes
Merge Strategy
- PR2 and PR3 can be merged independently.
- PR3 will adjust accordingly based on PR2.
CANN+torch Version
Based on vLLM or latest?
Use case
Support vllm-ascend https://github.com/vllm-project/vllm-ascend