Skip to content

Commit f44ac75

Browse files
committed
NCCL 2.26.2-1
Profiler improvements * Add events for CUDA kernel start and end. * Allow network plugins to generate profiling events * Enable profiling on a per-operation basis, rather than per-communicator. * Add support for graph capturing. Add implicit launch order * Allow to prevent deadlocks when using multiple NCCL communicators per device by implicitly ordering NCCL operations using the host program order. Disabled by default, set NCCL_LAUNCH_ORDER_IMPLICIT=1 to enable. * Add a complementary mechanism to detect host threads racing to launch to the same device. Enabled by default, set NCCL_LAUNCH_RACE_FATAL=0 to disable. Optimize the PAT algorithm * Separate the computation and execution of PAT steps on different warps, allowing to run up to 16 PAT steps in parallel to significantly accelerate PAT and reduce its linear part. Add support for setting QoS per communicator * Add a new trafficClass field to the communicator configuration, to allow the application to select a particular traffic class for a given communicator. The meaning of the traffic class is network-specific and should be set in accordance with the network configuration. * For the IB/RoCE plugin, existing config variables such as NCCL_IB_SL and NCCL_IB_TC take precedence. Allow to enable GPU Direct RDMA specifically on C2C platforms * Disabled by default, set NCCL_NET_GDR_C2C=1 to enable. Do not disable user buffer registration unless PXN is really used * Only disable UB when a communicator has more than one rank per node on any node. RAS subsystem improvements * Report operation counts separately for each collective operation type. * Provide details about missing communicator ranks and reliably distinguish ranks that are no longer a given communicator's members (now reported as NOCOMM) from those that failed to respond. Add support for timestamps to NCCL diagnostic messages * On by default for WARN messages; NCCL_DEBUG_TIMESTAMP_LEVELS can be used to enable them for other debug levels as well. * The format can be changed using the NCCL_DEBUG_TIMESTAMP_FORMAT config variable. Reduce the memory usage with NVLink SHARP (NVLS) * Potentially save hundreds of MBs of device memory, considering the multicast buffer size granularity separately from the address alignment. Update performance tuning for recent Intel CPUs * Improve algorithm/protocol selection on recent CPUs such as Emerald Rapids and Sapphire Rapids. Improve channel scheduling when mixing LL and Simple operations. * Make LL operations account for 4x more traffic to ensure LL and simple operations complete at the same time. Refactor the plugin code * Clean up and harmonize the support code across the network, tuner, and profiler plugins. Add support for comment lines (starting with #) in the nccl.conf file * Issue #1540. Make user buffer registration problems print an INFO instead of a WARN. Drop support for network plugin interface version 5. Fix a race condition with split-shared communicators * NCCL could hang during connection setup if multiple communicators were grouped together that share resources. Fix a performance regression when using NCCL_CROSS_NIC=1 * NCCL would unnecessarily alternate rings, breaking the GPU-NIC associations. Make GID index detection code more resilient * Dynamic GID detection code was giving up too soon if the detected index was not available (e.g., wasn't mapped to the container's sysfs). * Issues #1538, #1573. Fix a race condition with non-blocking operation * Fix issue when creating a non-blocking communicator after a non- blocking collective operation on another communicator. Fix shared memory usage on recent Blackwell GPUs. * Issues NVIDIA/nccl-tests#287, NVIDIA/nccl-tests#291, #1637. Fix an error with NIC fusion and IB SHARP when recreating communicators * Disable the unloading of network plugins Make the auto-merge failures in the NIC fusion non-fatal * This could happen when trying to merge IB and RoCE devices. Fixes to ncclCommAbort * Fix hangs due to the progress thread spinning indefinitely on the network progress. * Reduce the abort time by up to two orders of magnitude. Fix a crash when libnccl.so was dynamically unloaded * The RAS subsystem was missing a clean-up handler. Fix a hang if the network plugin's test() call returns an error. Fix a hang on heterogeneous architectures * Ensure we harmonize the tuning to avoid different tuning choices, causing a hang. Fix double-free on failed ncclCommInitRank and ncclCommFinalize. Fix a potential list traversal bug during a group launch of multiple communicators * Issue #1599. Unify the handling of NCCL configuration variables * Under rare circumstances, some variables specified in the config file could be ignored.
1 parent 80f6bda commit f44ac75

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+7498
-5254
lines changed

ext-net/README.md

+22-8
Original file line numberDiff line numberDiff line change
@@ -60,20 +60,20 @@ of newer ones.
6060
The `nccl/` directory is populated with `net_vX.h` files extracting all relevant definitions
6161
from old API versions. It also provides error codes in `err.h`.
6262

63-
# API (v9)
63+
# API (v10)
6464

65-
Below is the main `ncclNet_v9` struct. Each function is explained in later sections.
65+
Below is the main `ncclNet_v10` struct. Each function is explained in later sections.
6666

6767
```
6868
typedef struct {
6969
// Name of the network (mainly for logs)
7070
const char* name;
7171
// Initialize the network.
72-
ncclResult_t (*init)(ncclDebugLogger_t logFunction);
72+
ncclResult_t (*init)(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
7373
// Return the number of adapters.
7474
ncclResult_t (*devices)(int* ndev);
7575
// Get various device properties.
76-
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v9_t* props);
76+
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v10_t* props);
7777
// Create a receiving object and provide a handle to connect to it. The
7878
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
7979
// between ranks to create a connection.
@@ -83,13 +83,13 @@ typedef struct {
8383
// should return successfully with sendComm == NULL with the expectation that
8484
// it will be called again until sendComm != NULL.
8585
// If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
86-
ncclResult_t (*connect)(int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v8_t** sendDevComm);
86+
ncclResult_t (*connect)(int dev, ncclNetCommConfig_v10_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_v10_t** sendDevComm);
8787
// Finalize connection establishment after remote peer has called connect.
8888
// This call must not block for the connection to be established, and instead
8989
// should return successfully with recvComm == NULL with the expectation that
9090
// it will be called again until recvComm != NULL.
9191
// If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
92-
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v8_t** recvDevComm);
92+
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v10_t** recvDevComm);
9393
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
9494
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
9595
ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
@@ -98,10 +98,10 @@ typedef struct {
9898
ncclResult_t (*deregMr)(void* comm, void* mhandle);
9999
// Asynchronous send to a peer.
100100
// May return request == NULL if the call cannot be performed (or would block)
101-
ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void** request);
101+
ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* pHandle, void** request);
102102
// Asynchronous recv from a peer.
103103
// May return request == NULL if the call cannot be performed (or would block)
104-
ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** request);
104+
ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** pHandles, void** request);
105105
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
106106
// visible to the GPU
107107
ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
@@ -200,6 +200,9 @@ the plugin code adding the following definitions:
200200
#define INFO(FLAGS, ...) logFunction(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
201201
```
202202

203+
The `ncclProfilerCallback_t` argument is a NCCL core callback that allows the plugin to define and
204+
record its own events with the NCCL profiler plugin.
205+
203206
`devices`
204207

205208
Once the plugin is initialized, NCCL will query the number of devices available. It should not
@@ -301,6 +304,11 @@ the `listen` call previously. If the sender did not connect yet, `accept` should
301304
should return `ncclSuccess`, setting `recvComm` to `NULL`. NCCL will call `accept` again until it
302305
succeeds.
303306

307+
The `connect` API takes a `ncclNetCommConfig_t`, which contains a trafficClass field.
308+
This field can be used by the network plugin to specify the QoS level of the connection. By default,
309+
`trafficClass` is set to -1 but can be configured by the application during communicator initialization
310+
to select a plugin-supported QoS level.
311+
304312
`closeListen`/`closeSend`/`closeRecv`
305313

306314
Once a `listenComm`/`sendComm`/`recvComm` is no longer needed, NCCL will call
@@ -354,6 +362,9 @@ The `isend` operation returns a handle in the `request` argument for further cal
354362
the `isend` operation cannot be initiated, `request` can be set to `NULL` and NCCL will call
355363
`isend` again later.
356364

365+
The `pHandle` argument allows NCCL to pass an opaque handle that can be used by the network plugin
366+
to support network defined events.
367+
357368
`irecv`
358369

359370
To receive data, NCCL will call `irecv` with the `recvComm` returned by `accept`. The argument
@@ -375,6 +386,9 @@ of irecv and is resilient to redundant network writes. This allows the plugin to
375386
completions on such irecvs (for example, complete the request immediately). The plugin is still
376387
expected to set a valid request pointer on return which NCCL can poll to check for completion.
377388

389+
The `pHandle` argument allows NCCL to pass an array of opaque handles that can be used by the
390+
network plugin to support network defined events.
391+
378392
Note: for a given connection, send/receive operations should always match in the order they were
379393
posted. Tags provided for receive operations are only used to assign a given send operation to one
380394
of the buffers of the first (multi-)receive in the queue, not to allow for out-of-order tag

ext-net/example/nccl/net.h

+11-2
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_H_
6-
#define NCCL_NET_H_
5+
#ifndef NET_H_
6+
#define NET_H_
77

88
#include <stdint.h>
99
#include <stdlib.h>
1010

1111
#include "common.h"
1212
#include "err.h"
13+
#include "net_device.h"
1314

1415
#define NCCL_NET_HANDLE_MAXSIZE 128
1516
#define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L) //1TB
@@ -22,6 +23,9 @@
2223
// Maximum number of requests per comm object
2324
#define NCCL_NET_MAX_REQUESTS 32
2425

26+
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
27+
28+
#include "net_v10.h"
2529
#include "net_v9.h"
2630
#include "net_v8.h"
2731
#include "net_v7.h"
@@ -31,4 +35,9 @@
3135
#include "net_v3.h"
3236
#include "net_v2.h"
3337

38+
typedef ncclNet_v10_t ncclNet_t;
39+
typedef ncclNetProperties_v10_t ncclNetProperties_t;
40+
typedef ncclNetVDeviceProps_v10_t ncclNetVDeviceProps_t;
41+
typedef ncclNetCommConfig_v10_t ncclNetCommConfig_t;
42+
3443
#endif // end include guard

ext-net/example/nccl/net_device.h

+2-1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ typedef struct {
2626

2727
typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
2828
typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_v9_t;
29-
typedef ncclNetDeviceHandle_v9_t ncclNetDeviceHandle_t;
29+
typedef ncclNetDeviceHandle_v9_t ncclNetDeviceHandle_v10_t;
30+
typedef ncclNetDeviceHandle_v10_t ncclNetDeviceHandle_t;
3031

3132
#endif

ext-net/example/nccl/net_v10.h

+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
/*
2+
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
3+
*/
4+
5+
#ifndef NET_V10_H_
6+
#define NET_V10_H_
7+
8+
#define NCCL_NET_MAX_DEVS_PER_NIC_V10 4
9+
typedef struct {
10+
int ndevs;
11+
int devs[NCCL_NET_MAX_DEVS_PER_NIC_V10];
12+
} ncclNetVDeviceProps_v10_t;
13+
14+
15+
#define NCCL_NET_TRAFFIC_CLASS_UNDEF -1
16+
typedef struct {
17+
// Plugin-specific TC value
18+
int trafficClass;
19+
} ncclNetCommConfig_v10_t;
20+
21+
22+
typedef struct {
23+
char* name; // Used mostly for logging.
24+
char* pciPath; // Path to the PCI device in /sys.
25+
uint64_t guid; // Unique identifier for the NIC chip. Important for
26+
// cards with multiple PCI functions (Physical or virtual).
27+
int ptrSupport; // [NCCL_PTR_HOST|NCCL_PTR_CUDA|NCCL_PTR_DMABUF]
28+
int regIsGlobal; // regMr is not tied to a particular comm
29+
int forceFlush; // Force a flush on receives
30+
int speed; // Port speed in Mbps.
31+
int port; // Port number.
32+
float latency; // Network latency
33+
int maxComms; // Maximum number of comms we can create
34+
int maxRecvs; // Maximum number of grouped receives.
35+
ncclNetDeviceType netDeviceType; // Network offload type
36+
int netDeviceVersion; // Version number for network offload
37+
ncclNetVDeviceProps_v10_t vProps;
38+
size_t maxP2pBytes; // Max transfer size for point-to-point operations
39+
size_t maxCollBytes; // Max transfer size for collective operations
40+
} ncclNetProperties_v10_t;
41+
42+
typedef struct {
43+
// Name of the network (mainly for logs)
44+
const char* name;
45+
// Initialize the network.
46+
ncclResult_t (*init)(ncclDebugLogger_t logFunction, ncclProfilerCallback_t profFunction);
47+
// Return the number of adapters.
48+
ncclResult_t (*devices)(int* ndev);
49+
// Get various device properties.
50+
ncclResult_t (*getProperties)(int dev, ncclNetProperties_v10_t* props);
51+
// Create a receiving object and provide a handle to connect to it. The
52+
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
53+
// between ranks to create a connection.
54+
ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
55+
// Connect to a handle and return a sending comm object for that peer.
56+
// This call must not block for the connection to be established, and instead
57+
// should return successfully with sendComm == NULL with the expectation that
58+
// it will be called again until sendComm != NULL.
59+
// If *sendDevComm points to a valid object, then NCCL is requesting device offload for this connection
60+
ncclResult_t (*connect)(int dev, ncclNetCommConfig_v10_t* config, void* handle, void** sendComm, ncclNetDeviceHandle_v10_t** sendDevComm);
61+
// Finalize connection establishment after remote peer has called connect.
62+
// This call must not block for the connection to be established, and instead
63+
// should return successfully with recvComm == NULL with the expectation that
64+
// it will be called again until recvComm != NULL.
65+
// If *recvDevComm points to a valid object, then NCCL is requesting device offload for this connection
66+
ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v10_t** recvDevComm);
67+
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
68+
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
69+
ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);
70+
/* DMA-BUF support */
71+
ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);
72+
ncclResult_t (*deregMr)(void* comm, void* mhandle);
73+
// Asynchronous send to a peer.
74+
// May return request == NULL if the call cannot be performed (or would block)
75+
ncclResult_t (*isend)(void* sendComm, void* data, size_t size, int tag, void* mhandle, void* phandle, void** request);
76+
// Asynchronous recv from a peer.
77+
// May return request == NULL if the call cannot be performed (or would block)
78+
ncclResult_t (*irecv)(void* recvComm, int n, void** data, size_t* sizes, int* tags, void** mhandles, void** phandles, void** request);
79+
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
80+
// visible to the GPU
81+
ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);
82+
// Test whether a request is complete. If size is not NULL, it returns the
83+
// number of bytes sent/received.
84+
ncclResult_t (*test)(void* request, int* done, int* sizes);
85+
// Close and free send/recv comm objects
86+
ncclResult_t (*closeSend)(void* sendComm);
87+
ncclResult_t (*closeRecv)(void* recvComm);
88+
ncclResult_t (*closeListen)(void* listenComm);
89+
90+
// Copy the given mhandle to a dptr in a format usable by this plugin's device code
91+
ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);
92+
93+
// Notify the plugin that a recv has completed by the device
94+
ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);
95+
96+
// Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
97+
// what index this new vNIC exists at
98+
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v10_t* props);
99+
} ncclNet_v10_t;
100+
101+
#endif // end include guard

ext-net/example/nccl/net_v2.h

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V2_H_
6-
#define NCCL_NET_V2_H_
5+
#ifndef NET_V2_H_
6+
#define NET_V2_H_
77

88
typedef struct {
99
// Name of the network (mainly for logs)

ext-net/example/nccl/net_v3.h

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V3_H_
6-
#define NCCL_NET_V3_H_
5+
#ifndef NET_V3_H_
6+
#define NET_V3_H_
77

88
#define NCCL_NET_MAX_REQUESTS_V3 16
99

ext-net/example/nccl/net_v4.h

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V4_H_
6-
#define NCCL_NET_V4_H_
5+
#ifndef NET_V4_H_
6+
#define NET_V4_H_
77

88
#define NCCL_NET_HANDLE_MAXSIZE_V4 64
99

ext-net/example/nccl/net_v5.h

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V5_H_
6-
#define NCCL_NET_V5_H_
5+
#ifndef NET_V5_H_
6+
#define NET_V5_H_
77

88
typedef ncclNetProperties_v6_t ncclNetProperties_v5_t;
99
typedef struct {

ext-net/example/nccl/net_v6.h

+2-4
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V6_H_
6-
#define NCCL_NET_V6_H_
7-
8-
#define NCCL_NET_MAX_REQUESTS_V6 8
5+
#ifndef NET_V6_H_
6+
#define NET_V6_H_
97

108
typedef struct {
119
char* name; // Used mostly for logging.

ext-net/example/nccl/net_v7.h

+2-4
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V7_H_
6-
#define NCCL_NET_V7_H_
7-
8-
#include "net_device.h"
5+
#ifndef NET_V7_H_
6+
#define NET_V7_H_
97

108
typedef struct {
119
char* name; // Used mostly for logging.

ext-net/example/nccl/net_v8.h

+2-4
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,8 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V8_H_
6-
#define NCCL_NET_V8_H_
7-
8-
#include "net_device.h"
5+
#ifndef NET_V8_H_
6+
#define NET_V8_H_
97

108
typedef struct {
119
char* name; // Used mostly for logging.

ext-net/example/nccl/net_v9.h

+3-9
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,14 @@
22
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
33
*/
44

5-
#ifndef NCCL_NET_V9_H_
6-
#define NCCL_NET_V9_H_
7-
8-
#include "net_device.h"
5+
#ifndef NET_V9_H_
6+
#define NET_V9_H_
97

108
#define NCCL_NET_MAX_DEVS_PER_NIC_V9 4
11-
#define NCCL_NET_MAX_DEVS_PER_NIC NCCL_NET_MAX_DEVS_PER_NIC_V9
129
typedef struct {
1310
int ndevs;
1411
int devs[NCCL_NET_MAX_DEVS_PER_NIC_V9];
1512
} ncclNetVDeviceProps_v9_t;
16-
typedef ncclNetVDeviceProps_v9_t ncclNetVDeviceProps_t;
1713

1814
typedef struct {
1915
char* name; // Used mostly for logging.
@@ -35,8 +31,6 @@ typedef struct {
3531
size_t maxCollBytes; // Max transfer size for collective operations
3632
} ncclNetProperties_v9_t;
3733

38-
typedef ncclNetProperties_v9_t ncclNetProperties_t;
39-
4034
typedef struct {
4135
// Name of the network (mainly for logs)
4236
const char* name;
@@ -93,7 +87,7 @@ typedef struct {
9387

9488
// Virtual NIC APIs. makeVDevice will create a virtual NIC given the specified properties, and tell the caller
9589
// what index this new vNIC exists at
96-
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_t* props);
90+
ncclResult_t (*makeVDevice)(int* d, ncclNetVDeviceProps_v9_t* props);
9791
} ncclNet_v9_t;
9892

9993
#endif // end include guard

0 commit comments

Comments
 (0)