-
Notifications
You must be signed in to change notification settings - Fork 7k
[ADAG]Enable NPU (hccl) communication for CG #47658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAG]Enable NPU (hccl) communication for CG #47658
Conversation
Signed-off-by: zhilong <[email protected]>
|
cc @ruisearch42 |
ruisearch42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a round of review since I was tagged.
Overall looks good. Do you plan to add a test?
Let me know when this is ready to review.
| from ray.experimental.channel.nccl_group import _NcclGroup | ||
|
|
||
| else: | ||
| from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this looks like a hack. Do you plan to change to a cleaner approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.
| class GPUCommunicator(ABC): | ||
| """ | ||
| Communicator for a group of aDAG actors on Nvidia GPU. | ||
| Communicator for a group of aDAG actors on Nvidia GPU or other XPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.
| self._device_id = device_id | ||
|
|
||
| if rank is not None: | ||
| assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.get_gpu_ids() seems to only get GPU IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I just changed it. Also I think there should be a API to get all Accelerator id?
| if self._closed: | ||
| raise RayChannelError("HCCL group has been destroyed.") | ||
|
|
||
| self._comm.send(tensor=value, dst=peer_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.
If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another PR to support collective fn as a node type #47621
I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
|
HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it. |
Signed-off-by: zhilong <[email protected]>
| ) | ||
|
|
||
| torch_npu.npu.set_device(rank) # Set the NPU device according to the rank | ||
| self.ctx = dist.init_process_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this process_group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
|
@kevin85421 Hi, I just resolved confits and it works with vLLM now. Can you check this PR? Thx! |
|
hi, @ruisearch42 Can you review this PR? |
|
I also noticed another new PR: #51032 |
Yes, the goals of these two PRs are the same. I was also inspired by this PR, thanks to @Bye-legumes for the first step. But I prefer to use a similar way to cuda to access hccl, and provide more convenient for more accelerators, so the implementation is different. @ruisearch42 Which way is better? Could you give us some suggestions? I think it's better to join these two PR together. I did not submit code directly to this PR because I was worried that it would affect the existing functions of this PR. This PR may be related to the debugging of vllm. What do you think? @Bye-legumes. (Of course, we are both authors of this feature) |
Performance Results (Model: Qwen2.5-14B)
|
|
@ruisearch42 what's the next step? |
We will follow up with #51032 |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.