Skip to content

Commit a16af40

Browse files
authored
Merge pull request #201 from youkaichao/no_gdrcopy
remove the dependency of gdrcopy
2 parents 1157693 + b9b7ce3 commit a16af40

File tree

2 files changed

+65
-70
lines changed

2 files changed

+65
-70
lines changed

third-party/README.md

Lines changed: 28 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -8,74 +8,27 @@
88

99
## Prerequisites
1010

11-
1. [GDRCopy](https://github.com/NVIDIA/gdrcopy) (v2.4 and above recommended) is a low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology, and *it requires kernel module installation with root privileges.*
12-
13-
2. Hardware requirements
14-
- GPUDirect RDMA capable devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
11+
Hardware requirements:
12+
- GPUs inside one node needs to be connected by NVLink
13+
- GPUs across different nodes needs to be connected by RDMA devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
1514
- InfiniBand GPUDirect Async (IBGDA) support, see [IBGDA Overview](https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/)
1615
- For more detailed requirements, see [NVSHMEM Hardware Specifications](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html#hardware-requirements)
1716

1817
## Installation procedure
1918

20-
### 1. Install GDRCopy
21-
22-
GDRCopy requires kernel module installation on the host system. Complete these steps on the bare-metal host before container deployment:
23-
24-
#### Build and installation
25-
26-
```bash
27-
wget https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
28-
cd gdrcopy-2.4.4/
29-
make -j$(nproc)
30-
sudo make prefix=/opt/gdrcopy install
31-
```
32-
33-
#### Kernel module installation
34-
35-
After compiling the software, you need to install the appropriate packages based on your Linux distribution.
36-
For instance, using Ubuntu 22.04 and CUDA 12.3 as an example:
37-
38-
```bash
39-
pushd packages
40-
CUDA=/path/to/cuda ./build-deb-packages.sh
41-
sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb \
42-
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb \
43-
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb \
44-
gdrcopy_2.4.4_amd64.Ubuntu22_04.deb
45-
popd
46-
sudo ./insmod.sh # Load kernel modules on the bare-metal system
47-
```
48-
49-
#### Container environment notes
50-
51-
For containerized environments:
52-
1. Host: keep kernel modules loaded (`gdrdrv`)
53-
2. Container: install DEB packages *without* rebuilding modules:
54-
```bash
55-
sudo dpkg -i gdrcopy_2.4.4_amd64.Ubuntu22_04.deb \
56-
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb \
57-
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb
58-
```
59-
60-
#### Verification
61-
62-
```bash
63-
gdrcopy_copybw # Should show bandwidth test results
64-
```
65-
66-
### 2. Acquiring NVSHMEM source code
19+
### 1. Acquiring NVSHMEM source code
6720

6821
Download NVSHMEM v3.2.5 from the [NVIDIA NVSHMEM OPEN SOURCE PACKAGES](https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz).
6922

70-
### 3. Apply our custom patch
23+
### 2. Apply our custom patch
7124

7225
Navigate to your NVSHMEM source directory and apply our provided patch:
7326

7427
```bash
7528
git apply /path/to/deep_ep/dir/third-party/nvshmem.patch
7629
```
7730

78-
### 4. Configure NVIDIA driver
31+
### 3. Configure NVIDIA driver (required by inter-node communication)
7932

8033
Enable IBGDA by modifying `/etc/modprobe.d/nvidia.conf`:
8134

@@ -92,26 +45,31 @@ sudo reboot
9245

9346
For more detailed configurations, please refer to the [NVSHMEM Installation Guide](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html).
9447

95-
### 5. Build and installation
48+
### 4. Build and installation
9649

97-
The following example demonstrates building NVSHMEM with IBGDA support:
50+
DeepEP uses NVLink for intra-node communication and IBGDA for inter-node communication. All the other features are disabled to reduce the dependencies.
9851

9952
```bash
100-
CUDA_HOME=/path/to/cuda \
101-
GDRCOPY_HOME=/path/to/gdrcopy \
102-
NVSHMEM_SHMEM_SUPPORT=0 \
103-
NVSHMEM_UCX_SUPPORT=0 \
104-
NVSHMEM_USE_NCCL=0 \
105-
NVSHMEM_MPI_SUPPORT=0 \
106-
NVSHMEM_IBGDA_SUPPORT=1 \
107-
NVSHMEM_PMIX_SUPPORT=0 \
108-
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
109-
NVSHMEM_USE_GDRCOPY=1 \
110-
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
111-
112-
cd build
113-
make -j$(nproc)
114-
make install
53+
export CUDA_HOME=/path/to/cuda
54+
# disable all features except IBGDA
55+
export NVSHMEM_IBGDA_SUPPORT=1
56+
57+
export NVSHMEM_SHMEM_SUPPORT=0
58+
export NVSHMEM_UCX_SUPPORT=0
59+
export NVSHMEM_USE_NCCL=0
60+
export NVSHMEM_PMIX_SUPPORT=0
61+
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0
62+
export NVSHMEM_USE_GDRCOPY=0
63+
export NVSHMEM_IBRC_SUPPORT=0
64+
export NVSHMEM_BUILD_TESTS=0
65+
export NVSHMEM_BUILD_EXAMPLES=0
66+
export NVSHMEM_MPI_SUPPORT=0
67+
export NVSHMEM_BUILD_HYDRA_LAUNCHER=0
68+
export NVSHMEM_BUILD_TXZ_PACKAGE=0
69+
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0
70+
71+
cmake -G Ninja -S . -B build -DCMAKE_INSTALL_PREFIX=/path/to/your/dir/to/install
72+
cmake --build build/ --target install
11573
```
11674

11775
## Post-installation configuration

third-party/nvshmem.patch

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,3 +435,40 @@ index c89f408..f99018a 100644
435435

436436
return NVSHMEMX_ERROR_INTERNAL;
437437
}
438+
439+
440+
From 099f608fcd9a1d34c866ad75d0af5d02d2020374 Mon Sep 17 00:00:00 2001
441+
From: Kaichao You <[email protected]>
442+
Date: Tue, 10 Jun 2025 00:35:03 -0700
443+
Subject: [PATCH] remove gdrcopy dependency
444+
445+
---
446+
src/modules/transport/ibgda/ibgda.cpp | 6 ++++++
447+
1 file changed, 6 insertions(+)
448+
449+
diff --git a/src/modules/transport/ibgda/ibgda.cpp b/src/modules/transport/ibgda/ibgda.cpp
450+
index ef325cd..16ee09c 100644
451+
--- a/src/modules/transport/ibgda/ibgda.cpp
452+
+++ b/src/modules/transport/ibgda/ibgda.cpp
453+
@@ -406,6 +406,7 @@ static size_t ibgda_get_host_page_size() {
454+
return host_page_size;
455+
}
456+
457+
+#ifdef NVSHMEM_USE_GDRCOPY
458+
int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
459+
nvshmemt_ibgda_state_t *ibgda_state = (nvshmemt_ibgda_state_t *)t->state;
460+
int n_devs_selected = ibgda_state->n_devs_selected;
461+
@@ -459,6 +460,11 @@ int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
462+
}
463+
return 0;
464+
}
465+
+#else
466+
+int nvshmemt_ibgda_progress(nvshmem_transport_t t) {
467+
+ return NVSHMEMX_ERROR_NOT_SUPPORTED;
468+
+}
469+
+#endif
470+
471+
int nvshmemt_ibgda_show_info(struct nvshmem_transport *transport, int style) {
472+
NVSHMEMI_ERROR_PRINT("ibgda show info not implemented");
473+
--
474+
2.34.1

0 commit comments

Comments
 (0)