[ADAG]Enable NPU (hccl) communication for CG #47658

Bye-legumes · 2024-09-13T20:00:19Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: zhilong <[email protected]>

rkooo567 · 2024-09-16T05:09:42Z

cc @ruisearch42

ruisearch42

Having a round of review since I was tagged.

Overall looks good. Do you plan to add a test?

Let me know when this is ready to review.

ruisearch42 · 2024-09-16T15:25:52Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

+    from ray.experimental.channel.nccl_group import _NcclGroup
+
+else:
+    from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup


hmm, this looks like a hack. Do you plan to change to a cleaner approach?

OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.

ruisearch42 · 2024-09-16T15:27:30Z

python/ray/experimental/channel/gpu_communicator.py

 class GPUCommunicator(ABC):
    """
-    Communicator for a group of aDAG actors on Nvidia GPU.
+    Communicator for a group of aDAG actors on Nvidia GPU or other XPUs.


We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.

I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.

ruisearch42 · 2024-09-16T15:34:53Z

python/ray/experimental/channel/hccl_group.py

+        self._device_id = device_id
+
+        if rank is not None:
+            assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned"


ray.get_gpu_ids() seems to only get GPU IDs?

True, I just changed it. Also I think there should be a API to get all Accelerator id?

python/ray/experimental/channel/hccl_group.py

arcyleung · 2024-09-19T21:00:52Z

python/ray/experimental/channel/hccl_group.py

+        if self._closed:
+            raise RayChannelError("HCCL group has been destroyed.")
+
+        self._comm.send(tensor=value, dst=peer_rank)


One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.

If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group

I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.

So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.

There is another PR to support collective fn as a node type #47621

I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py

Signed-off-by: zhilong <[email protected]>

Bye-legumes · 2024-09-25T19:58:35Z

HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it.
Our future plan will rector the aDAG channel first so it can support other device more easier.

Signed-off-by: zhilong <[email protected]>

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

python/ray/experimental/channel/hccl_group.py

ruisearch42 · 2024-10-10T18:01:34Z

python/ray/experimental/channel/hccl_group.py

+        )
+
+        torch_npu.npu.set_device(rank)  # Set the NPU device according to the rank
+        self.ctx = dist.init_process_group(


Should we call this process_group?

aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25

Signed-off-by: zhilong <[email protected]>

Bye-legumes · 2025-02-10T18:51:01Z

@kevin85421 Hi, I just resolved confits and it works with vLLM now. Can you check this PR? Thx!

Bye-legumes · 2025-03-05T18:31:57Z

hi, @ruisearch42 Can you review this PR?

ruisearch42 · 2025-03-06T02:13:59Z

I also noticed another new PR: #51032
There are trying to do similar things?

hipudding · 2025-03-06T02:29:16Z

I also noticed another new PR: #51032 There are trying to do similar things?

Yes, the goals of these two PRs are the same. I was also inspired by this PR, thanks to @Bye-legumes for the first step. But I prefer to use a similar way to cuda to access hccl, and provide more convenient for more accelerators, so the implementation is different. @ruisearch42 Which way is better? Could you give us some suggestions?

I think it's better to join these two PR together. I did not submit code directly to this PR because I was worried that it would affect the existing functions of this PR. This PR may be related to the debugging of vllm.

What do you think? @Bye-legumes. (Of course, we are both authors of this feature)

Bye-legumes · 2025-03-07T16:16:28Z

I also noticed another new PR: #51032 There are trying to do similar things?

Based on previous feedback, this is just minor modification as previous I also try to refactor to use communicator. #48607

Bye-legumes · 2025-03-20T03:34:54Z

Performance Results (Model: Qwen2.5-14B)

Total Cards	TP/FP Configuration	default (mp)	ray	ray + hccl
2	TP = 2, PP = 1	1627.300708	1585.444219	1512.801163
	TP = 1, PP = 2	<2	<2	2440.919
4	TP = 2, PP = 2	<2	<2	2007.215
	TP = 4, PP = 1	1658.146417	1682.88043	1707.24
	TP = 1, PP = 4	<2	<2	2715.516

jjyao · 2025-05-06T20:41:55Z

@ruisearch42 what's the next step?

ruisearch42 · 2025-05-07T00:45:02Z

@ruisearch42 what's the next step?

We will follow up with #51032

github-actions · 2025-06-02T00:34:37Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-17T00:31:56Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

fix

e1743e9

Signed-off-by: zhilong <[email protected]>

Bye-legumes changed the title ~~[WIP] Enable NPU communication for aDAG~~ [WIP][ADAG]Enable NPU (hccl) communication for aDAG Sep 13, 2024

rkooo567 assigned rkooo567 and ruisearch42 Sep 15, 2024

ruisearch42 reviewed Sep 16, 2024

View reviewed changes

anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core compiled-graphs labels Sep 16, 2024

Bye-legumes added 2 commits September 19, 2024 16:35

Merge branch 'master' into AcceleratorCommunicator

38bd814

Merge branch 'master' into AcceleratorCommunicator

90cc57a

arcyleung reviewed Sep 23, 2024

View reviewed changes

Bye-legumes and others added 5 commits September 25, 2024 13:43

Merge branch 'master' into AcceleratorCommunicator

305a930

fix

3ef20be

Signed-off-by: zhilong <[email protected]>

fix

791a728

Signed-off-by: zhilong <[email protected]>

fix

6df1ee6

Signed-off-by: zhilong <[email protected]>

fix

eed8f50

Signed-off-by: zhilong <[email protected]>

Bye-legumes changed the title ~~[WIP][ADAG]Enable NPU (hccl) communication for aDAG~~ [ADAG]Enable NPU (hccl) communication for aDAG Sep 25, 2024

Bye-legumes added 2 commits September 25, 2024 15:38

fix

20a9d24

Signed-off-by: zhilong <[email protected]>

fix

90d8536

Signed-off-by: zhilong <[email protected]>

Bye-legumes and others added 6 commits September 26, 2024 11:01

Merge branch 'master' into AcceleratorCommunicator

66b3a90

Merge branch 'master' into AcceleratorCommunicator

94eb57a

Merge branch 'master' into AcceleratorCommunicator

e3cfcce

Merge branch 'master' into AcceleratorCommunicator

1adf211

Merge branch 'master' into AcceleratorCommunicator

2a84356

fix

b846fd6

Signed-off-by: zhilong <[email protected]>

Bye-legumes mentioned this pull request Oct 9, 2024

[ADAG] Refactor nccl to communicator channel. #47845

Closed

5 tasks

ruisearch42 reviewed Oct 10, 2024

View reviewed changes

fix

5084059

Signed-off-by: zhilong <[email protected]>

Bye-legumes added 4 commits February 7, 2025 10:19

resolve confit

0a08b00

Signed-off-by: zhilong <[email protected]>

fix

d93da6e

Signed-off-by: zhilong <[email protected]>

fix

5be1d39

Signed-off-by: zhilong <[email protected]>

fix

c6a2f79

Signed-off-by: zhilong <[email protected]>

Bye-legumes mentioned this pull request Feb 7, 2025

Support Ray CG(ADAG) for NPU vllm-project/vllm#12911

Closed

Bye-legumes added 3 commits February 7, 2025 12:09

Merge branch 'master' into AcceleratorCommunicator

ee0ec0c

Merge branch 'master' into AcceleratorCommunicator

0bfde8d

Merge branch 'master' into AcceleratorCommunicator

3bbb9d9

Bye-legumes added 5 commits February 11, 2025 15:35

Merge branch 'master' into AcceleratorCommunicator

beab1f5

Merge branch 'master' into AcceleratorCommunicator

400abbc

Merge branch 'master' into AcceleratorCommunicator

2cc5db0

Merge branch 'master' into AcceleratorCommunicator

7c1842e

Merge branch 'master' into AcceleratorCommunicator

43abd29

hipudding mentioned this pull request Mar 7, 2025

[Compiled Graph] Enhance Compile Graph with Multi-Device Support #51032

Merged

8 tasks

This was referenced Mar 21, 2025

[CG, Core] Add Ascend NPU Support for RCCL and CG #51574

Open

Different device CG support vllm-project/vllm#15482

Closed

hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025

jjyao unassigned rkooo567 May 6, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 2, 2025

github-actions bot closed this Jun 17, 2025

[ADAG]Enable NPU (hccl) communication for CG #47658

[ADAG]Enable NPU (hccl) communication for CG #47658

Uh oh!

Conversation

Bye-legumes commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

rkooo567 commented Sep 16, 2024

Uh oh!

ruisearch42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arcyleung Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bye-legumes commented Sep 25, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bye-legumes commented Feb 10, 2025

Uh oh!

Bye-legumes commented Mar 5, 2025

Uh oh!

ruisearch42 commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hipudding commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bye-legumes commented Mar 7, 2025

Uh oh!

Bye-legumes commented Mar 20, 2025

Performance Results (Model: Qwen2.5-14B)

Uh oh!

jjyao commented May 6, 2025

Uh oh!

ruisearch42 commented May 7, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Bye-legumes commented Sep 13, 2024 •

edited

Loading

arcyleung Sep 27, 2024 •

edited

Loading

ruisearch42 commented Mar 6, 2025 •

edited

Loading

hipudding commented Mar 6, 2025 •

edited

Loading