Skip to content

[CG, Core] Add Ascend NPU Support for RCCL and CG #51574

@Bye-legumes

Description

@Bye-legumes

Description

This RFC proposes to provide initial support for RCCL and CG on Ascend NPU.

Original work by @Bye-legumes and @hipudding.

However, we need to decouple them into several PRs with minor modifications and set an example for further hardware support.

Notes:

  • I previously submitted a PR in September 2024 to support HCCL and refactor NCCL into a communicator, but the feedback was that it was too large and complicated and we should decouple into some PR with minor modification.
  • We should avoid adding additional C code into Ray, as that would influence the build stage.

Plan for Decoupling into Several Stages:

PR1: Support RCCL on NPU

Ray Core supports scheduling on Ascend NPU devices, but the Ray Collective API does not yet support communication between NPUs using HCCL.
🔗 PR #50790
👤 @liuxsh9

PR2: Refactor CG to Support Multiple Devices

We can refer to this PR to decouple device-related modules.
Move cupy dependency, support rank mapping or different progress group.
👤 @hipudding

PR3: CG Support for NPU

CG support will be added after RCCL is merged, utilizing the RCCL API from PR #47658.
👤 @Bye-legumes

Merge Strategy

  • PR2 and PR3 can be merged independently.
  • PR3 will adjust accordingly based on PR2.

CANN+torch Version

Based on vLLM or latest?

Use case

Support vllm-ascend https://github.com/vllm-project/vllm-ascend

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions