Copy 2 arrays into GPU memory using GDRCOPY #242

bonbax · 2023-01-17T14:50:33Z

I know this is not a programming class, but please let me know.

I want to copy 2 arrays into GPU memory using GDRCOPY.
I read gdrapi.h but could not understand it. What is the reason why the following does not work?
The sample programs sanity, copylat, and copylat use a working GPU.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"

using namespace std;

int main(){

   float N = 10.0;

   CUdeviceptr devPtrA, devPtrB;
   float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));
   cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   devPtrA = (CUdeviceptr)d_arrayA;
   devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);

   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

Result

device ptrA: 7fcba6800000
device ptrB: 7fcba6800200
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7fcba6800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7fcba6800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00

The text was updated successfully, but these errors were encountered:

pakmarkthub · 2023-01-18T03:38:00Z

Hi @bonbax,

Based on the output, I guess that there was an error in gdr_map.

A few requests:

Can you capture the return error code and tell us what that error code is?
Can you reload gdrdrv.ko with dbg_enabled=1 info_enabled=1? You can modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 and use that script to reload the driver for you.
Please run your application GDRCOPY_ENABLE_LOGGING=1 <your-app> and provide us the output.

bonbax · 2023-01-18T14:51:31Z

Thanks for the reply.

specs

Rocky Linux release 8.6 (Green Obsidian)
Linux devel 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Tue Sep 13 18:09:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Driver Version: 515.65.01

Maybe because I installed dgrcopy kernelmodule with RPM, it was not possible.
Please let me know if there is a way to change this in /etc/rc.d/init.d/gdrcopy.

insmod: ERROR: could not load module src/gdrdrv/gdrdrv.ko: No such file or directory
INFO: driver major is
INFO: creating /dev/gdrdrv inode
mknod: missing operand after '0'
Try 'mknod --help' for more information.
chmod: cannot access '/dev/gdrdrv': No such file or directory

installed package

gdrcopy-2.3-1.x86_64
gdrcopy-kmod-4.18.0-372.26.1.el8_6.x86_64-nvidia-515.65.01-2.3-1.x86_64
gdrcopy-devel-2.3-1.noarch

Result(My Program)

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 7efd12800000
device ptrB: 7efd12800200
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  error Invalid argument(22) while mapping handle 2466d50, rounded_size=4096 offset=0
ERR:  error Invalid argument(22) while mapping handle 2466d80, rounded_size=4096 offset=0
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7efd12800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7efd12800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet

copybw succeeds.

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 copybw
GPU id:0; name: Tesla P4; Bus id: 0000:13:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7fe1bc800000
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
DBG:  wc_mapping=1
map_d_ptr: 0x7fe1e9065000
info.va: 7fe1bc800000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7fe1e9065000
writing test, size=131072 offset=0 num_iters=10000
DBG:  using AVX implementation of gdr_copy_to_bar
write BW: 1548.99MB/s
reading test, size=131072 offset=0 num_iters=100
DBG:  using SSE4_1 implementation of gdr_copy_from_bar
read BW: 15.688MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

pakmarkthub · 2023-01-19T02:47:57Z

It failed in gdr_map, which called mmap to gdrdrv. We need to see the debug messages of the gdrdrv driver to understand what's going on.

insmod: ERROR: could not load module src/gdrdrv/gdrdrv.ko: No such file or directory
You need to compile GDRCopy locally. Please do:

cd <gdrcopy-root-folder>
make -j
sudo ./insmod.sh   # Please use the modified version with dbg_enabled=1 info_enabled=1
<run-your-app>

Please provide us the output from dmesg. The lines that are relevant to GDRCopy (around the end of the log) are sufficient. We don't need the whole log.

bonbax · 2023-01-19T05:39:39Z

I uninstalled the RPM package and performed a build installation.

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 7f0a26800000
device ptrB: 7f0a26800200
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  error Invalid argument(22) while mapping handle b9bf40, rounded_size=4096 offset=0
ERR:  error Invalid argument(22) while mapping handle b9bf70, rounded_size=4096 offset=0
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7f0a26800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7f0a26800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet

[ 1610.951847] gdrdrv:gdrdrv_open:minor=0 filep=0xffff88a2dbd12400
[ 1610.951851] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[ 1610.951861] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[ 1610.951862] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f0a26800000 len=40 p2p_tok=0 va_tok=0)
[ 1610.951946] gdrdrv:__gdrdrv_pin_buffer:page table entries: 1
[ 1610.951947] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x00000000c0080000
[ 1610.951949] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[ 1610.951949] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f0a26800000 len=552 p2p_tok=0 va_tok=0)
[ 1610.951953] gdrdrv:__gdrdrv_pin_buffer:page table entries: 1
[ 1610.951954] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x00000000c0080000
[ 1610.951957] gdrdrv:gdrdrv_mmap:mmap filp=0xffff88a2dbd12400 vma=0xffff88a2e03ac9f8 vm_file=0xffff88a2dbd12400 start=0x7f0a535ae000 size=4096 off=0x0
[ 1610.951959] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951960] gdrdrv:gdrdrv_mmap:offset != 0 is not supported
[ 1610.951970] gdrdrv:gdrdrv_mmap:mmap filp=0xffff88a2dbd12400 vma=0xffff88a2e03ac9f8 vm_file=0xffff88a2dbd12400 start=0x7f0a535ae000 size=4096 off=0x0
[ 1610.951971] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951971] gdrdrv:gdrdrv_mmap:offset != 0 is not supported
[ 1610.951982] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da04)
[ 1610.951982] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951988] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da04)
[ 1610.951988] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952011] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da02)
[ 1610.952012] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952013] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f0a26800000 p2p_tok=0 va_tok=0)
[ 1610.952020] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da02)
[ 1610.952020] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952020] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f0a26800000 p2p_tok=0 va_tok=0)
[ 1610.952037] gdrdrv:gdrdrv_release:closing

pakmarkthub · 2023-01-19T07:16:14Z

Hi @bonbax,

Thank you for the debug info. In your case, the failure is because your second buffer is not GPU-page-aligned. We documented this restriction in README.md in the second paragraph of the "Restrictions and known issues" section. So, the failure in gdr_map(..., handleB, ...) is expected.

Nevertheless, it should not cause gdr_map(..., handleA, ...) to fail. Thanks to your help, we have discovered a bug and I filed it in #244.

To solve your issue, I would like to suggest that you properly align both buffers. You will not run into #244 if you do that. You can look at https://github.com/NVIDIA/gdrcopy/blob/master/tests/common.cpp#L46 to see how we allocate aligned CUDA buffers.

bonbax · 2023-01-19T13:36:21Z

We are happy to have found the bug.
I tried to incorporate what is being done with gpu_mem_alloc, but it did not work.

   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   float *h_arrayA, *h_arrayB;
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 0
device ptrB: 0
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  ioctl error (errno=22)
ERR:  mh is mapped already
ERR:  ioctl error (errno=22)
Segmentation fault

[29680.356553] gdrdrv:gdrdrv_open:minor=0 filep=0xffff88a61ea0f000
[29680.356557] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[29680.356569] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[29680.356570] gdrdrv:gdrdrv_pin_buffer:NULL device pointer
[29680.356573] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[29680.356574] gdrdrv:gdrdrv_pin_buffer:NULL device pointer
[29680.356576] gdr[4954]: segfault at 19 ip 00007ff4eb7bf554 sp 00007ffd7297c160 error 4 in libgdrapi.so.2.3[7ff4eb7be000+4000]
[29680.356579] Code: 84 00 00 00 00 00 41 57 66 0f ef c0 41 56 41 55 41 54 55 53 48 83 ec 38 0f 29 04 24 0f 29 44 24 10 48 c7 44 24 20 00 00 00 00 <f6> 46 18 01 0f 85 ba 00 00 00 48 8b 47 10 44 8b 2e 48 89 fd 48 89
[29680.438399] gdrdrv:gdrdrv_release:closing

pakmarkthub · 2023-01-20T01:50:25Z

Your devPtrA and devPtrB were 0. Can you check the following?

Does cuMemAlloc return successfully? What are the values of your devPtrA_tmp and devPtrB_tmp?
What is the value of GPU_PAGE_SIZE? This is just in case a different header replaces what gdrapi.h defines.
Does PAGE_ROUND_UP work correctly? Can you double-check that the implementation is from GDRCopy?
Are you using the master branch? If not, please try the master branch. I remember that we merged a fix in the rounding algorithm after v2.3 release.

bonbax · 2023-01-21T07:26:05Z

cuMemAlloc returned "201".
CUDA_ERROR_INVALID_CONTEXT = 201
GPU_PAGE_SIZE was "65536".

#define PAGE_ROUND_UP(x, n) (((x) + ((n) - 1)) & ~((n) - 1)) defines.

[root@devel gdrcopy]# git branch

master

Result

GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
malloc size = 40.000000
GPU_PAGE_SIZE = 65536
malloc size + GPU_PAGE_SIZE - 1 = 65575.000000
result res1 = 201 , devPtrA_tmp address (nil)
result res2 = 201 , devPtrA_tmp address (nil)
device ptrA: 0
device ptrB: 0
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  ioctl error (errno=22)
Segmentation fault

Program

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   res1 = cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   printf("malloc size = %lf \n", (sizeof(float) * N));
   printf("GPU_PAGE_SIZE = %d \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %lf \n", (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   //gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   //gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   //gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   //gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

pakmarkthub · 2023-01-23T06:20:44Z

You problem this time is in CUDA, not in GDRCopy. If you want to use the driver API (cuMemAlloc), you need to follow the proper CUDA initialization process such as calling cuInit before calling other CUDA driver API. copybw.cpp can give you an example. Otherwise, you can use CUDA runtime API such as cudaMalloc instead.

bonbax · 2023-01-23T07:27:58Z

Sorry.
I made a mistake with the code I should have put.
The actual program I ran is below.
I was executing cuInit before reading cuMemAlloc.
I would check copybw.cpp.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   cuInit(0);
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   res1 = cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   printf("malloc size = %lf \n", (sizeof(float) * N));
   printf("GPU_PAGE_SIZE = %d \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %lf \n", (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   //gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   //gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   //gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   //gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

bonbax · 2023-01-24T09:22:55Z

I didn't understand.
Is there any documentation that details gdrcopy?
sample

bonbax · 2023-01-24T15:14:04Z

They could find the bug, but they couldn't fix mine.

pakmarkthub · 2023-01-25T05:47:51Z

Hi @bonbax,

This should work. Basically, your CUDA driver API setup was incomplete.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   size_t size = sizeof(float) * N;
   size_t rounded_size = PAGE_ROUND_UP(size, GPU_PAGE_SIZE);

   cuInit(0);

   CUdevice dev;
   cuDeviceGet(&dev, 0);

   CUcontext dev_ctx;
   cuDevicePrimaryCtxRetain(&dev_ctx, dev);
   cuCtxSetCurrent(dev_ctx);

   cuMemAllocHost((void **)&h_arrayA, size);
   cuMemAllocHost((void **)&h_arrayB, size);

   res1 = cuMemAlloc(&devPtrA_tmp, rounded_size + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, rounded_size + GPU_PAGE_SIZE - 1);

   printf("malloc size = %zu \n", size);
   printf("GPU_PAGE_SIZE = %zu \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %zu \n", size + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, rounded_size, 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, size);
   gdr_pin_buffer(g, devPtrB, rounded_size, 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, size);

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, size);
   gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, size);

   /* CUDA Kernel <<< >>>*/

   gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, size);
   gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, size);

   gdr_unmap(g, handleA, map_d_ptrA, size);
   gdr_unmap(g, handleB, map_d_ptrB, size);
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

   cout << "Done" << endl;
}

bonbax · 2023-01-28T17:11:48Z

Thank you very much.
The code you wrote works.
I have found that I need to learn CUDA well first.

pakmarkthub mentioned this issue Jan 19, 2023

gdrdrv returns the same handle for two pinned buffers #244

Closed

pakmarkthub closed this as completed May 11, 2023

tylerjereddy mentioned this issue May 1, 2024

BUG, MAINT: segfaults through libfabric->ucx ofiwg/libfabric#10001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy 2 arrays into GPU memory using GDRCOPY #242

Copy 2 arrays into GPU memory using GDRCOPY #242

bonbax commented Jan 17, 2023

pakmarkthub commented Jan 18, 2023

bonbax commented Jan 18, 2023

pakmarkthub commented Jan 19, 2023

bonbax commented Jan 19, 2023

pakmarkthub commented Jan 19, 2023 •

edited

Loading

bonbax commented Jan 19, 2023

pakmarkthub commented Jan 20, 2023

bonbax commented Jan 21, 2023

pakmarkthub commented Jan 23, 2023

bonbax commented Jan 23, 2023

bonbax commented Jan 24, 2023

bonbax commented Jan 24, 2023

pakmarkthub commented Jan 25, 2023 •

edited

Loading

bonbax commented Jan 28, 2023

Copy 2 arrays into GPU memory using GDRCOPY #242

Copy 2 arrays into GPU memory using GDRCOPY #242

Comments

bonbax commented Jan 17, 2023

pakmarkthub commented Jan 18, 2023

bonbax commented Jan 18, 2023

pakmarkthub commented Jan 19, 2023

bonbax commented Jan 19, 2023

pakmarkthub commented Jan 19, 2023 • edited Loading

bonbax commented Jan 19, 2023

pakmarkthub commented Jan 20, 2023

bonbax commented Jan 21, 2023

pakmarkthub commented Jan 23, 2023

bonbax commented Jan 23, 2023

bonbax commented Jan 24, 2023

bonbax commented Jan 24, 2023

pakmarkthub commented Jan 25, 2023 • edited Loading

bonbax commented Jan 28, 2023

pakmarkthub commented Jan 19, 2023 •

edited

Loading

pakmarkthub commented Jan 25, 2023 •

edited

Loading