-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" #298
Comments
Every other test works fine. Results are attached below. gdrcopy_sanityTotal: 28, Passed: 28, Failed: 0, Waived: 0 gdrcopy_copybwGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7f8293a00000
map_d_ptr: 0x7f82b2423000
info.va: 7f8293a00000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7f82b2423000
writing test, size=131072 offset=0 num_iters=10000
write BW: 8680.15MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 379.824MB/s
unmapping buffer
unpinning buffer
closing gdrdrv gdrcopy_copylatGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
device ptr: 0x7f68b4000000
allocated size: 16777216
gpu alloc fn: cuMemAlloc
map_d_ptr: 0x7f68e1000000
info.va: 7f68b4000000
info.mapped_size: 16777216
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f68e1000000
gdr_copy_to_mapping num iters for each size: 10000
WARNING: Measuring the API invocation overhead as observed by the CPU. Data might not be ordered all the way to the GPU internal visibility.
Test Size(B) Avg.Time(us)
gdr_copy_to_mapping 1 0.1021
gdr_copy_to_mapping 2 0.1021
gdr_copy_to_mapping 4 0.1020
gdr_copy_to_mapping 8 0.1027
gdr_copy_to_mapping 16 0.1028
gdr_copy_to_mapping 32 0.1020
gdr_copy_to_mapping 64 0.1037
gdr_copy_to_mapping 128 0.1152
gdr_copy_to_mapping 256 0.1187
gdr_copy_to_mapping 512 0.1374
gdr_copy_to_mapping 1024 0.1998
gdr_copy_to_mapping 2048 0.2580
gdr_copy_to_mapping 4096 0.4537
gdr_copy_to_mapping 8192 0.9071
gdr_copy_to_mapping 16384 1.8081
gdr_copy_to_mapping 32768 3.6079
gdr_copy_to_mapping 65536 7.2086
gdr_copy_to_mapping 131072 14.4026
gdr_copy_to_mapping 262144 28.7971
gdr_copy_to_mapping 524288 57.6994
gdr_copy_to_mapping 1048576 115.3423
gdr_copy_to_mapping 2097152 230.9106
gdr_copy_to_mapping 4194304 462.4430
gdr_copy_to_mapping 8388608 925.5537
gdr_copy_to_mapping 16777216 1851.2054
gdr_copy_from_mapping num iters for each size: 100
Test Size(B) Avg.Time(us)
gdr_copy_from_mapping 1 1.0830
gdr_copy_from_mapping 2 1.9810
gdr_copy_from_mapping 4 2.0370
gdr_copy_from_mapping 8 1.9580
gdr_copy_from_mapping 16 0.3330
gdr_copy_from_mapping 32 0.3730
gdr_copy_from_mapping 64 0.7690
gdr_copy_from_mapping 128 0.7300
gdr_copy_from_mapping 256 1.1240
gdr_copy_from_mapping 512 1.5940
gdr_copy_from_mapping 1024 3.5380
gdr_copy_from_mapping 2048 6.2421
gdr_copy_from_mapping 4096 11.5121
gdr_copy_from_mapping 8192 20.8663
gdr_copy_from_mapping 16384 40.7785
gdr_copy_from_mapping 32768 81.3170
gdr_copy_from_mapping 65536 158.9489
gdr_copy_from_mapping 131072 323.6429
gdr_copy_from_mapping 262144 710.4047
gdr_copy_from_mapping 524288 1422.3003
gdr_copy_from_mapping 1048576 2838.1456
gdr_copy_from_mapping 2097152 5688.7214
gdr_copy_from_mapping 4194304 12608.9298
gdr_copy_from_mapping 8388608 28866.0632
gdr_copy_from_mapping 16777216 57983.8880
unmapping buffer
unpinning buffer
closing gdrdrv gdrcopy_apiperfGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0001:00:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0002:00:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0003:00:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0004:00:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0005:00:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0006:00:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0007:00:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0008:00:00
selecting device 0
device ptr: 0x7fde32000000
allocated size: 16777216
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
65536 393.693800 4.849070 0.586010 7.779070 208.910540
Histogram of gdr_pin_buffer latency for 65536 bytes
[386.005000 - 772.010000] 85
[772.010000 - 1158.015000] 12
[1158.015000 - 1544.020000] 1
[1544.020000 - 1930.025000] 0
[1930.025000 - 2316.030000] 0
[2316.030000 - 2702.035000] 0
[2702.035000 - 3088.040000] 0
[3088.040000 - 3474.045000] 1
[3474.045000 - 3860.050000] 0
[3860.050000 - 4246.055000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
131072 390.340740 4.922060 0.586010 5.587070 209.741560
Histogram of gdr_pin_buffer latency for 131072 bytes
[374.904000 - 749.808000] 58
[749.808000 - 1124.712000] 37
[1124.712000 - 1499.616000] 3
[1499.616000 - 1874.520000] 0
[1874.520000 - 2249.424000] 0
[2249.424000 - 2624.328000] 1
[2624.328000 - 2999.232000] 0
[2999.232000 - 3374.136000] 0
[3374.136000 - 3749.040000] 0
[3749.040000 - 4123.944000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
262144 384.364700 5.000060 0.579010 5.766080 205.910470
Histogram of gdr_pin_buffer latency for 262144 bytes
[359.904000 - 719.808000] 15
[719.808000 - 1079.712000] 11
[1079.712000 - 1439.616000] 33
[1439.616000 - 1799.520000] 35
[1799.520000 - 2159.424000] 2
[2159.424000 - 2519.328000] 0
[2519.328000 - 2879.232000] 1
[2879.232000 - 3239.136000] 1
[3239.136000 - 3599.040000] 1
[3599.040000 - 3958.944000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
524288 385.165720 5.195100 0.588020 6.491070 205.447430
Histogram of gdr_pin_buffer latency for 524288 bytes
[361.104000 - 722.208000] 53
[722.208000 - 1083.312000] 42
[1083.312000 - 1444.416000] 2
[1444.416000 - 1805.520000] 1
[1805.520000 - 2166.624000] 1
[2166.624000 - 2527.728000] 0
[2527.728000 - 2888.832000] 0
[2888.832000 - 3249.936000] 0
[3249.936000 - 3611.040000] 0
[3611.040000 - 3972.144000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
1048576 400.635850 5.586050 0.584000 7.938140 210.336570
Histogram of gdr_pin_buffer latency for 1048576 bytes
[362.405000 - 724.810000] 96
[724.810000 - 1087.215000] 1
[1087.215000 - 1449.620000] 1
[1449.620000 - 1812.025000] 0
[1812.025000 - 2174.430000] 0
[2174.430000 - 2536.835000] 0
[2536.835000 - 2899.240000] 0
[2899.240000 - 3261.645000] 1
[3261.645000 - 3624.050000] 0
[3624.050000 - 3986.455000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
2097152 391.923760 8.034110 0.573000 13.988180 208.904540
Histogram of gdr_pin_buffer latency for 2097152 bytes
[386.905000 - 773.810000] 92
[773.810000 - 1160.715000] 4
[1160.715000 - 1547.620000] 1
[1547.620000 - 1934.525000] 1
[1934.525000 - 2321.430000] 0
[2321.430000 - 2708.335000] 0
[2708.335000 - 3095.240000] 0
[3095.240000 - 3482.145000] 0
[3482.145000 - 3869.050000] 1
[3869.050000 - 4255.955000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
4194304 396.802860 10.452100 0.576010 19.818270 209.164510
Histogram of gdr_pin_buffer latency for 4194304 bytes
[388.105000 - 776.210000] 98
[776.210000 - 1164.315000] 1
[1164.315000 - 1552.420000] 0
[1552.420000 - 1940.525000] 0
[1940.525000 - 2328.630000] 0
[2328.630000 - 2716.735000] 0
[2716.735000 - 3104.840000] 0
[3104.840000 - 3492.945000] 0
[3492.945000 - 3881.050000] 0
[3881.050000 - 4269.155000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
8388608 397.254870 14.905130 0.584010 31.263530 213.712470
Histogram of gdr_pin_buffer latency for 8388608 bytes
[370.704000 - 741.408000] 8
[741.408000 - 1112.112000] 14
[1112.112000 - 1482.816000] 69
[1482.816000 - 1853.520000] 6
[1853.520000 - 2224.224000] 0
[2224.224000 - 2594.928000] 2
[2594.928000 - 2965.632000] 0
[2965.632000 - 3336.336000] 0
[3336.336000 - 3707.040000] 0
[3707.040000 - 4077.744000] 0
Size(B) pin.Time(us) map.Time(us) get_info.Time(us) unmap.Time(us) unpin.Time(us)
16777216 396.702820 25.480310 0.573010 54.088660 209.703560
Histogram of gdr_pin_buffer latency for 16777216 bytes
[379.205000 - 758.410000] 72
[758.410000 - 1137.615000] 20
[1137.615000 - 1516.820000] 5
[1516.820000 - 1896.025000] 1
[1896.025000 - 2275.230000] 1
[2275.230000 - 2654.435000] 0
[2654.435000 - 3033.640000] 0
[3033.640000 - 3412.845000] 0
[3412.845000 - 3792.050000] 0
[3792.050000 - 4171.255000] 0
closing gdrdrv |
Hi @osayamenja, Did you have
|
Hey @pakmarkthub thanks for the quick response! I installed gdrcopy using the deb packages, so I am not using |
If you use the deb packages, the binary should be compiled with the correct flags. Some requests / questions:
|
vectorAdd[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done |
Just in case you need these. nvcc --versionnvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0 nvidia-smiFri Jun 28 03:08:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000001:00:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 103MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-32GB On | 00000002:00:00.0 Off | 0 |
| N/A 41C P0 43W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-32GB On | 00000003:00:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-32GB On | 00000004:00:00.0 Off | 0 |
| N/A 39C P0 42W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2-32GB On | 00000005:00:00.0 Off | 0 |
| N/A 37C P0 41W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2-32GB On | 00000006:00:00.0 Off | 0 |
| N/A 39C P0 42W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2-32GB On | 00000007:00:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2-32GB On | 00000008:00:00.0 Off | 0 |
| N/A 41C P0 42W / 300W | 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1848 G /usr/lib/xorg/Xorg 33MiB |
| 1 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 1848 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+ |
Thank you for the additional information. GDRCopy does not rely on CUDA. Can you try compiling from source? You can still use libgdrapi.so as well as the gdrdrv driver from the deb packages. What you need to compile is just the test applications, but it might be easier to compile the whole project.
No. I meant the gdrcopy-test deb package you are using. For example, you probably downloaded something like https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/ubuntu20_04/x64/gdrcopy-tests_2.4-1_amd64.Ubuntu20_04+cuda12.2.deb. This is a gdrcopy-test deb package compiled with CUDA12.2 on Ubuntu 20.04 for x86-64. This is just an example. Can you check the CUDA version that the gdrcopy-tests deb package you are using was compiled with? |
I installed gdrcopy following the README instructions, meaning the script automatically detected my CUDA toolkit and ubuntu version, which is 20.04. I will try recompiling from source and get back soon, thanks for your effort and quick response! |
Running
gdrcopy_pplat
fails withAssertion "(cuStreamQuery(0)) == (CUDA_ERROR_NOT_READY)" failed at pplat.cu:257
.See complete logs below
Click me
If useful, I have GPUDirectAsync configured and nvidia-peermem activated.
The text was updated successfully, but these errors were encountered: