Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy 2 arrays into GPU memory using GDRCOPY #242

Closed
bonbax opened this issue Jan 17, 2023 · 14 comments
Closed

Copy 2 arrays into GPU memory using GDRCOPY #242

bonbax opened this issue Jan 17, 2023 · 14 comments

Comments

@bonbax
Copy link

bonbax commented Jan 17, 2023

I know this is not a programming class, but please let me know.

I want to copy 2 arrays into GPU memory using GDRCOPY.
I read gdrapi.h but could not understand it. What is the reason why the following does not work?
The sample programs sanity, copylat, and copylat use a working GPU.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"

using namespace std;

int main(){

   float N = 10.0;

   CUdeviceptr devPtrA, devPtrB;
   float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));
   cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   devPtrA = (CUdeviceptr)d_arrayA;
   devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);

   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

Result

device ptrA: 7fcba6800000
device ptrB: 7fcba6800200
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7fcba6800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7fcba6800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00
@pakmarkthub
Copy link
Collaborator

Hi @bonbax,

Based on the output, I guess that there was an error in gdr_map.

A few requests:

  1. Can you capture the return error code and tell us what that error code is?
  2. Can you reload gdrdrv.ko with dbg_enabled=1 info_enabled=1? You can modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 and use that script to reload the driver for you.
  3. Please run your application GDRCOPY_ENABLE_LOGGING=1 <your-app> and provide us the output.

@bonbax
Copy link
Author

bonbax commented Jan 18, 2023

Thanks for the reply.

specs

Rocky Linux release 8.6 (Green Obsidian)
Linux devel 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Tue Sep 13 18:09:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Driver Version: 515.65.01

Maybe because I installed dgrcopy kernelmodule with RPM, it was not possible.
Please let me know if there is a way to change this in /etc/rc.d/init.d/gdrcopy.

insmod: ERROR: could not load module src/gdrdrv/gdrdrv.ko: No such file or directory
INFO: driver major is
INFO: creating /dev/gdrdrv inode
mknod: missing operand after '0'
Try 'mknod --help' for more information.
chmod: cannot access '/dev/gdrdrv': No such file or directory

installed package

gdrcopy-2.3-1.x86_64
gdrcopy-kmod-4.18.0-372.26.1.el8_6.x86_64-nvidia-515.65.01-2.3-1.x86_64
gdrcopy-devel-2.3-1.noarch

Result(My Program)

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 7efd12800000
device ptrB: 7efd12800200
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  error Invalid argument(22) while mapping handle 2466d50, rounded_size=4096 offset=0
ERR:  error Invalid argument(22) while mapping handle 2466d80, rounded_size=4096 offset=0
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7efd12800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7efd12800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet

copybw succeeds.

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 copybw
GPU id:0; name: Tesla P4; Bus id: 0000:13:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7fe1bc800000
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
DBG:  wc_mapping=1
map_d_ptr: 0x7fe1e9065000
info.va: 7fe1bc800000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7fe1e9065000
writing test, size=131072 offset=0 num_iters=10000
DBG:  using AVX implementation of gdr_copy_to_bar
write BW: 1548.99MB/s
reading test, size=131072 offset=0 num_iters=100
DBG:  using SSE4_1 implementation of gdr_copy_from_bar
read BW: 15.688MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

@pakmarkthub
Copy link
Collaborator

It failed in gdr_map, which called mmap to gdrdrv. We need to see the debug messages of the gdrdrv driver to understand what's going on.

insmod: ERROR: could not load module src/gdrdrv/gdrdrv.ko: No such file or directory
You need to compile GDRCopy locally. Please do:

cd <gdrcopy-root-folder>
make -j
sudo ./insmod.sh   # Please use the modified version with dbg_enabled=1 info_enabled=1
<run-your-app>

Please provide us the output from dmesg. The lines that are relevant to GDRCopy (around the end of the log) are sufficient. We don't need the whole log.

@bonbax
Copy link
Author

bonbax commented Jan 19, 2023

I uninstalled the RPM package and performed a build installation.

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 7f0a26800000
device ptrB: 7f0a26800200
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  error Invalid argument(22) while mapping handle b9bf40, rounded_size=4096 offset=0
ERR:  error Invalid argument(22) while mapping handle b9bf70, rounded_size=4096 offset=0
map_d_ptrA: 0
map_d_ptrB: 0
infoA.va: 7f0a26800000
infoA.mapped_size: 552
infoA.page_size: 65536
infoA.mapped: 0
infoA.wc_mapping: 0
infoB.va: 7f0a26800000
infoB.mapped_size: 552
infoB.page_size: 65536
infoB.mapped: 0
infoB.wc_mapping: 0
page offsetA: 0
page offsetB: -512
user-space pointerA: 0
user-space pointerB: 0xfffffffffffffe00
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
ERR:  mh is not mapped yet
[ 1610.951847] gdrdrv:gdrdrv_open:minor=0 filep=0xffff88a2dbd12400
[ 1610.951851] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[ 1610.951861] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[ 1610.951862] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f0a26800000 len=40 p2p_tok=0 va_tok=0)
[ 1610.951946] gdrdrv:__gdrdrv_pin_buffer:page table entries: 1
[ 1610.951947] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x00000000c0080000
[ 1610.951949] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[ 1610.951949] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f0a26800000 len=552 p2p_tok=0 va_tok=0)
[ 1610.951953] gdrdrv:__gdrdrv_pin_buffer:page table entries: 1
[ 1610.951954] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x00000000c0080000
[ 1610.951957] gdrdrv:gdrdrv_mmap:mmap filp=0xffff88a2dbd12400 vma=0xffff88a2e03ac9f8 vm_file=0xffff88a2dbd12400 start=0x7f0a535ae000 size=4096 off=0x0
[ 1610.951959] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951960] gdrdrv:gdrdrv_mmap:offset != 0 is not supported
[ 1610.951970] gdrdrv:gdrdrv_mmap:mmap filp=0xffff88a2dbd12400 vma=0xffff88a2e03ac9f8 vm_file=0xffff88a2dbd12400 start=0x7f0a535ae000 size=4096 off=0x0
[ 1610.951971] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951971] gdrdrv:gdrdrv_mmap:offset != 0 is not supported
[ 1610.951982] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da04)
[ 1610.951982] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.951988] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da04)
[ 1610.951988] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952011] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da02)
[ 1610.952012] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952013] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f0a26800000 p2p_tok=0 va_tok=0)
[ 1610.952020] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da02)
[ 1610.952020] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[ 1610.952020] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f0a26800000 p2p_tok=0 va_tok=0)
[ 1610.952037] gdrdrv:gdrdrv_release:closing

@pakmarkthub
Copy link
Collaborator

pakmarkthub commented Jan 19, 2023

Hi @bonbax,

Thank you for the debug info. In your case, the failure is because your second buffer is not GPU-page-aligned. We documented this restriction in README.md in the second paragraph of the "Restrictions and known issues" section. So, the failure in gdr_map(..., handleB, ...) is expected.

Nevertheless, it should not cause gdr_map(..., handleA, ...) to fail. Thanks to your help, we have discovered a bug and I filed it in #244.

To solve your issue, I would like to suggest that you properly align both buffers. You will not run into #244 if you do that. You can look at https://github.com/NVIDIA/gdrcopy/blob/master/tests/common.cpp#L46 to see how we allocate aligned CUDA buffers.

@bonbax
Copy link
Author

bonbax commented Jan 19, 2023

We are happy to have found the bug.
I tried to incorporate what is being done with gpu_mem_alloc, but it did not work.

   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   float *h_arrayA, *h_arrayB;
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

# GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
device ptrA: 0
device ptrB: 0
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  ioctl error (errno=22)
ERR:  mh is mapped already
ERR:  ioctl error (errno=22)
Segmentation fault
[29680.356553] gdrdrv:gdrdrv_open:minor=0 filep=0xffff88a61ea0f000
[29680.356557] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[29680.356569] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[29680.356570] gdrdrv:gdrdrv_pin_buffer:NULL device pointer
[29680.356573] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[29680.356574] gdrdrv:gdrdrv_pin_buffer:NULL device pointer
[29680.356576] gdr[4954]: segfault at 19 ip 00007ff4eb7bf554 sp 00007ffd7297c160 error 4 in libgdrapi.so.2.3[7ff4eb7be000+4000]
[29680.356579] Code: 84 00 00 00 00 00 41 57 66 0f ef c0 41 56 41 55 41 54 55 53 48 83 ec 38 0f 29 04 24 0f 29 44 24 10 48 c7 44 24 20 00 00 00 00 <f6> 46 18 01 0f 85 ba 00 00 00 48 8b 47 10 44 8b 2e 48 89 fd 48 89
[29680.438399] gdrdrv:gdrdrv_release:closing

@pakmarkthub
Copy link
Collaborator

Your devPtrA and devPtrB were 0. Can you check the following?

  1. Does cuMemAlloc return successfully? What are the values of your devPtrA_tmp and devPtrB_tmp?
  2. What is the value of GPU_PAGE_SIZE? This is just in case a different header replaces what gdrapi.h defines.
  3. Does PAGE_ROUND_UP work correctly? Can you double-check that the implementation is from GDRCopy?
  4. Are you using the master branch? If not, please try the master branch. I remember that we merged a fix in the rounding algorithm after v2.3 release.

@bonbax
Copy link
Author

bonbax commented Jan 21, 2023

  1. cuMemAlloc returned "201".
    CUDA_ERROR_INVALID_CONTEXT = 201

  2. GPU_PAGE_SIZE was "65536".

#define PAGE_ROUND_UP(x, n) (((x) + ((n) - 1)) & ~((n) - 1)) defines.

[root@devel gdrcopy]# git branch

  • master

Result

GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 ./gdr
malloc size = 40.000000
GPU_PAGE_SIZE = 65536
malloc size + GPU_PAGE_SIZE - 1 = 65575.000000
result res1 = 201 , devPtrA_tmp address (nil)
result res2 = 201 , devPtrA_tmp address (nil)
device ptrA: 0
device ptrB: 0
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
ERR:  ioctl error (errno=22)
Segmentation fault

Program

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   res1 = cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   printf("malloc size = %lf \n", (sizeof(float) * N));
   printf("GPU_PAGE_SIZE = %d \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %lf \n", (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   //gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   //gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   //gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   //gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

@pakmarkthub
Copy link
Collaborator

You problem this time is in CUDA, not in GDRCopy. If you want to use the driver API (cuMemAlloc), you need to follow the proper CUDA initialization process such as calling cuInit before calling other CUDA driver API. copybw.cpp can give you an example. Otherwise, you can use CUDA runtime API such as cudaMalloc instead.

@bonbax
Copy link
Author

bonbax commented Jan 23, 2023

Sorry.
I made a mistake with the code I should have put.
The actual program I ran is below.
I was executing cuInit before reading cuMemAlloc.
I would check copybw.cpp.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   cuInit(0);
   cuMemAllocHost((void **)&h_arrayA, (sizeof(float) * N));
   cuMemAllocHost((void **)&h_arrayB, (sizeof(float) * N));

   res1 = cuMemAlloc(&devPtrA_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, (sizeof(float) * N) + GPU_PAGE_SIZE - 1);

   printf("malloc size = %lf \n", (sizeof(float) * N));
   printf("GPU_PAGE_SIZE = %d \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %lf \n", (sizeof(float) * N) + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, (sizeof(float) * N), 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, (sizeof(float) * N));
   gdr_pin_buffer(g, devPtrB, (sizeof(float) * N), 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, (sizeof(float) * N));

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   //gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, (sizeof(float) * N));
   //gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, (sizeof(float) * N));

   /* CUDA Kernel <<< >>>*/

   //gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, (sizeof(float) * N));
   //gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, (sizeof(float) * N));

   gdr_unmap(g, handleA, map_d_ptrA, (sizeof(float) * N));
   gdr_unmap(g, handleB, map_d_ptrB, (sizeof(float) * N));
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

}

@bonbax
Copy link
Author

bonbax commented Jan 24, 2023

I didn't understand.
Is there any documentation that details gdrcopy?
sample

@bonbax
Copy link
Author

bonbax commented Jan 24, 2023

They could find the bug, but they couldn't fix mine.

@pakmarkthub
Copy link
Collaborator

pakmarkthub commented Jan 25, 2023

Hi @bonbax,

This should work. Basically, your CUDA driver API setup was incomplete.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <iostream>
#include "gdrapi.h"
#define PAGE_ROUND_UP(x, n)     (((x) + ((n) - 1)) & ~((n) - 1))

using namespace std;

int main(){

   float N = 10.0;
   CUresult res1, res2;
   CUdeviceptr devPtrA, devPtrA_tmp, devPtrB, devPtrB_tmp;
   //float *h_arrayA, *d_arrayA, *h_arrayB, *d_arrayB;
   float *h_arrayA, *h_arrayB;

   size_t size = sizeof(float) * N;
   size_t rounded_size = PAGE_ROUND_UP(size, GPU_PAGE_SIZE);

   cuInit(0);

   CUdevice dev;
   cuDeviceGet(&dev, 0);

   CUcontext dev_ctx;
   cuDevicePrimaryCtxRetain(&dev_ctx, dev);
   cuCtxSetCurrent(dev_ctx);

   cuMemAllocHost((void **)&h_arrayA, size);
   cuMemAllocHost((void **)&h_arrayB, size);

   res1 = cuMemAlloc(&devPtrA_tmp, rounded_size + GPU_PAGE_SIZE - 1);
   res2 = cuMemAlloc(&devPtrB_tmp, rounded_size + GPU_PAGE_SIZE - 1);

   printf("malloc size = %zu \n", size);
   printf("GPU_PAGE_SIZE = %zu \n", GPU_PAGE_SIZE);
   printf("malloc size + GPU_PAGE_SIZE - 1 = %zu \n", size + GPU_PAGE_SIZE - 1);
   printf("result res1 = %d , devPtrA_tmp address %p \n", res1, (void *)devPtrA_tmp);
   printf("result res2 = %d , devPtrA_tmp address %p \n", res2, (void *)devPtrB_tmp);

   unsigned int flag = 1;
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrA_tmp);
   cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, devPtrB_tmp);

   devPtrA = PAGE_ROUND_UP(devPtrA_tmp, GPU_PAGE_SIZE);
   devPtrB = PAGE_ROUND_UP(devPtrB_tmp, GPU_PAGE_SIZE);

   //cudaMalloc((void **)&d_arrayA, (sizeof(float) * N));
   //cudaMalloc((void **)&d_arrayB, (sizeof(float) * N));
   //devPtrA = (CUdeviceptr)d_arrayA;
   //devPtrB = (CUdeviceptr)d_arrayB;
   //cuMemAlloc(&devPtrA, sizeof(float) * N);
   //cuMemAlloc(&devPtrB, sizeof(float) * N);

   cout << "device ptrA: " << hex << devPtrA << dec << endl;
   cout << "device ptrB: " << hex << devPtrB << dec << endl;

   gdr_t g = gdr_open();
   gdr_mh_t handleA, handleB;
   void *map_d_ptrA = NULL, *map_d_ptrB = NULL;
   gdr_pin_buffer(g, devPtrA, rounded_size, 0, 0, &handleA);
   gdr_map(g, handleA, &map_d_ptrA, size);
   gdr_pin_buffer(g, devPtrB, rounded_size, 0, 0, &handleB);
   gdr_map(g, handleB, &map_d_ptrB, size);

   cout << "map_d_ptrA: " << map_d_ptrA << endl;
   cout << "map_d_ptrB: " << map_d_ptrB << endl;

   gdr_info_t infoA, infoB;
   gdr_get_info(g, handleA, &infoA);
   cout << "infoA.va: " << hex << infoA.va << dec << endl;
   cout << "infoA.mapped_size: " << infoA.mapped_size << endl;
   cout << "infoA.page_size: " << infoA.page_size << endl;
   cout << "infoA.mapped: " << infoA.mapped << endl;
   cout << "infoA.wc_mapping: " << infoA.wc_mapping << endl;

   gdr_get_info(g, handleB, &infoB);
   cout << "infoB.va: " << hex << infoB.va << dec << endl;
   cout << "infoB.mapped_size: " << infoB.mapped_size << endl;
   cout << "infoB.page_size: " << infoB.page_size << endl;
   cout << "infoB.mapped: " << infoB.mapped << endl;
   cout << "infoB.wc_mapping: " << infoB.wc_mapping << endl;

   int offsetA = infoA.va - devPtrA, offsetB = infoB.va - devPtrB;

   cout << "page offsetA: " << offsetA << endl;
   cout << "page offsetB: " << offsetB << endl;

   uint32_t *buf_ptrA = (uint32_t *)((char *)map_d_ptrA + offsetA);
   uint32_t *buf_ptrB = (uint32_t *)((char *)map_d_ptrB + offsetB);

   cout << "user-space pointerA: " << buf_ptrA << endl;
   cout << "user-space pointerB: " << buf_ptrB << endl;

   gdr_copy_to_mapping(handleA, buf_ptrA, h_arrayA, size);
   gdr_copy_to_mapping(handleB, buf_ptrB, h_arrayB, size);

   /* CUDA Kernel <<< >>>*/

   gdr_copy_from_mapping(handleA, h_arrayA, buf_ptrA, size);
   gdr_copy_from_mapping(handleB, h_arrayB, buf_ptrB, size);

   gdr_unmap(g, handleA, map_d_ptrA, size);
   gdr_unmap(g, handleB, map_d_ptrB, size);
   gdr_unpin_buffer(g, handleA);
   gdr_unpin_buffer(g, handleB);
   gdr_close(g);

   cout << "Done" << endl;
}

@bonbax
Copy link
Author

bonbax commented Jan 28, 2023

Thank you very much.
The code you wrote works.
I have found that I need to learn CUDA well first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants