Skip to content

Conversation

@Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Sep 13, 2024

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: zhilong <[email protected]>
@Bye-legumes Bye-legumes changed the title [WIP] Enable NPU communication for aDAG [WIP][ADAG]Enable NPU (hccl) communication for aDAG Sep 13, 2024
@rkooo567
Copy link
Contributor

cc @ruisearch42

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a round of review since I was tagged.

Overall looks good. Do you plan to add a test?

Let me know when this is ready to review.

from ray.experimental.channel.nccl_group import _NcclGroup

else:
from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this looks like a hack. Do you plan to change to a cleaner approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.

class GPUCommunicator(ABC):
"""
Communicator for a group of aDAG actors on Nvidia GPU.
Communicator for a group of aDAG actors on Nvidia GPU or other XPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.

self._device_id = device_id

if rank is not None:
assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.get_gpu_ids() seems to only get GPU IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I just changed it. Also I think there should be a API to get all Accelerator id?

@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core compiled-graphs labels Sep 16, 2024
if self._closed:
raise RayChannelError("HCCL group has been destroyed.")

self._comm.send(tensor=value, dst=peer_rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.

If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.

Copy link
Contributor

@arcyleung arcyleung Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another PR to support collective fn as a node type #47621

I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py

Bye-legumes and others added 5 commits September 25, 2024 13:43
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@Bye-legumes Bye-legumes changed the title [WIP][ADAG]Enable NPU (hccl) communication for aDAG [ADAG]Enable NPU (hccl) communication for aDAG Sep 25, 2024
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@Bye-legumes
Copy link
Contributor Author

HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it.
Our future plan will rector the aDAG channel first so it can support other device more easier.

)

torch_npu.npu.set_device(rank) # Set the NPU device according to the rank
self.ctx = dist.init_process_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this process_group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25

Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@Bye-legumes
Copy link
Contributor Author

@kevin85421 Hi, I just resolved confits and it works with vLLM now. Can you check this PR? Thx!

@Bye-legumes
Copy link
Contributor Author

hi, @ruisearch42 Can you review this PR?

@ruisearch42
Copy link
Contributor

ruisearch42 commented Mar 6, 2025

I also noticed another new PR: #51032
There are trying to do similar things?

@hipudding
Copy link
Contributor

hipudding commented Mar 6, 2025

I also noticed another new PR: #51032 There are trying to do similar things?

Yes, the goals of these two PRs are the same. I was also inspired by this PR, thanks to @Bye-legumes for the first step. But I prefer to use a similar way to cuda to access hccl, and provide more convenient for more accelerators, so the implementation is different. @ruisearch42 Which way is better? Could you give us some suggestions?

I think it's better to join these two PR together. I did not submit code directly to this PR because I was worried that it would affect the existing functions of this PR. This PR may be related to the debugging of vllm.

What do you think? @Bye-legumes. (Of course, we are both authors of this feature)

@Bye-legumes
Copy link
Contributor Author

I also noticed another new PR: #51032 There are trying to do similar things?

Based on previous feedback, this is just minor modification as previous I also try to refactor to use communicator. #48607

@Bye-legumes
Copy link
Contributor Author

Performance Results (Model: Qwen2.5-14B)

Total Cards TP/FP Configuration default (mp) ray ray + hccl
2 TP = 2, PP = 1 1627.300708 1585.444219 1512.801163
TP = 1, PP = 2 <2 <2 2440.919
4 TP = 2, PP = 2 <2 <2 2007.215
TP = 4, PP = 1 1658.146417 1682.88043 1707.24
TP = 1, PP = 4 <2 <2 2715.516

@jjyao
Copy link
Collaborator

jjyao commented May 6, 2025

@ruisearch42 what's the next step?

@ruisearch42
Copy link
Contributor

@ruisearch42 what's the next step?

We will follow up with #51032

@github-actions
Copy link

github-actions bot commented Jun 2, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 2, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community compiled-graphs core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants