[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

garrett361 · 2025-03-19T19:11:17Z

The current nccl{Send,Recv} API requires specifying a int peer for P2P ops. This causes unavoidable and undesirable device-host syncs in cases where the P2P op routings are dynamically determined.

Example: mixture of expert models (like DeepSeek v3) have dynamically-determined computation. The computation may require communication, depending on the parallelization scheme. The communication patterns can be encoded in, say, int64 CUDA tensors which contain the rank of the peers that different tensors should be sent do. Using the CUDA-tensor peer info to launch the corresponding P2P ops currently requires moving the CUDA tensors back to the host, so that the appropriate int peer values can be read off. This is an undesirable device-host sync point.

Request: allow specifying the peer via a pointer to a buffer held on the device.

The text was updated successfully, but these errors were encountered:

kiskra-nvidia added the enhancement label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

garrett361 commented Mar 19, 2025

[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

[Feature Request] Allow specifying nccl{Send,Recv} peer via CUDA tensor #1648

Comments

garrett361 commented Mar 19, 2025