Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

Open
garrett361 opened this issue Mar 19, 2025 · 0 comments
Open

Comments

@garrett361
Copy link

The current nccl{Send,Recv} API requires specifying a int peer for P2P ops. This causes unavoidable and undesirable device-host syncs in cases where the P2P op routings are dynamically determined.

Example: mixture of expert models (like DeepSeek v3) have dynamically-determined computation. The computation may require communication, depending on the parallelization scheme. The communication patterns can be encoded in, say, int64 CUDA tensors which contain the rank of the peers that different tensors should be sent do. Using the CUDA-tensor peer info to launch the corresponding P2P ops currently requires moving the CUDA tensors back to the host, so that the appropriate int peer values can be read off. This is an undesirable device-host sync point.

Request: allow specifying the peer via a pointer to a buffer held on the device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants