Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed run sanity test in docker container. #224

Open
myanzhang opened this issue Mar 9, 2022 · 4 comments
Open

Failed run sanity test in docker container. #224

myanzhang opened this issue Mar 9, 2022 · 4 comments

Comments

@myanzhang
Copy link

myanzhang commented Mar 9, 2022

docker container envir:
centos 7.2
cuda: 11.0
gpu: A100-SXM4-40GB
when I run sanity test, meet 2 failures as pic:
企业微信截图_16467983167868

But the copybw / copylat / apiperf tests is ok!
企业微信截图_16467986165047
企业微信截图_16467987035247
企业微信截图_16467987369786

Are there any suggestions for finding out the reasons?
Thanks a lot for your help!

@myanzhang myanzhang changed the title Failed run sanity test. Failed run sanity test in docker container. Mar 9, 2022
@pakmarkthub
Copy link
Collaborator

Hi @myanzhang ,

Can you run sanity -v and post the output?

@myanzhang
Copy link
Author

myanzhang commented Mar 9, 2022

[root@ts-6ab12923e4f84b41a1dec977fcf2a978-launcher ~/gdrcopy/tests]# ./sanity -v
Running suite(s): Sanity
&&&& RUNNING basic_cumemalloc
buffer size: 327680
&&&& PASSED basic_cumemalloc
&&&& RUNNING basic_with_tokens
buffer size: 327680
&&&& PASSED basic_with_tokens
&&&& RUNNING basic_unaligned_mapping
First allocation: d_fa=0x7f1263200000, size=4
Second allocation: d_A=0x7f1263220200, size=65540, GPU-page-boundary 0x7f1263220000
d_A is unaligned
Try mapping d_A as is.
Mapping d_A failed as expected.
Align d_A and try mapping it again.
Pin and map aligned address: d_aligned_A=0x7f1263230000, offset=65024, size=516
&&&& PASSED basic_unaligned_mapping
&&&& RUNNING basic_child_thread_pins_buffer_cumemalloc
spawning single child thread
pinning
Assertion "(gdr_pin_buffer(pt->g, pt->d_buf, pt->size, 0, 0, &pt->mh)) == (0)" failed at sanity.cpp:1751
&&&& FAILED basic_child_thread_pins_buffer_cumemalloc
&&&& RUNNING basic_vmmalloc
buffer size: 327680
&&&& PASSED basic_vmmalloc
&&&& RUNNING basic_child_thread_pins_buffer_vmmalloc
spawning single child thread
pinning
Assertion "(gdr_pin_buffer(pt->g, pt->d_buf, pt->size, 0, 0, &pt->mh)) == (0)" failed at sanity.cpp:1751
&&&& FAILED basic_child_thread_pins_buffer_vmmalloc
&&&& RUNNING data_validation_cumemalloc
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size 327669 is not dword aligned, ignoring trailing bytes
unmapping
unpinning
&&&& PASSED data_validation_cumemalloc
&&&& RUNNING data_validation_vmmalloc
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size 327669 is not dword aligned, ignoring trailing bytes
unmapping
unpinning
&&&& PASSED data_validation_vmmalloc
&&&& RUNNING invalidation_access_after_gdr_close_cumemalloc
Mapping bar1
Writing 254 into buf_ptr[0]
Calling gdr_close
Trying to read buf_ptr[0] after gdr_close
Get signal 7 as expected
&&&& PASSED invalidation_access_after_gdr_close_cumemalloc
&&&& RUNNING invalidation_access_after_free_cumemalloc
Mapping bar1
Writing 269 into buf_ptr[0]
Calling gpuMemFree
Trying to read buf_ptr[0] after gpuMemFree
Get signal 7 as expected
&&&& PASSED invalidation_access_after_free_cumemalloc
&&&& RUNNING invalidation_two_mappings_cumemalloc
Mapping bar1
Writing data to both mappings 954 and 955 respectively
Validating that we can read the data back
gpuMemFree and thus destroying the first mapping
Trying to read and validate the data from the second mapping after the first mapping has been destroyed
&&&& PASSED invalidation_two_mappings_cumemalloc
&&&& RUNNING invalidation_fork_access_after_free_cumemalloc
parent: Start
child: Start
child: waiting for cont signal from parent
parent: writing buf_ptr[0] with 689
parent: read buf_ptr[0] before gpuMemFree get 689
parent: calling gpuMemFree
parent: waiting for child write signal
child: receive cont signal 1 from parent
child: writing buf_ptr[0] with 699
child: signal parent that I have written
child: waiting for signal from parent before calling gpuMemFree
parent: trying to read buf_ptr[0]
Get signal 7 as expected
&&&& PASSED invalidation_fork_access_after_free_cumemalloc
&&&& RUNNING invalidation_fork_after_gdr_map_cumemalloc
parent: Start
parent: writing buf_ptr[0] with 557
parent: trying to read buf_ptr[0]
parent: read buf_ptr[0] get 557
parent: signaling child
parent: waiting for child to exit
child: Start
child: waiting for cont signal from parent
child: receive cont signal 1 from parent
child: trying to read buf_ptr[0]
Get signal 11 as expected
parent: trying to read buf_ptr[0] after child exits
parent: read buf_ptr[0] after child exits get 557
&&&& PASSED invalidation_fork_after_gdr_map_cumemalloc
&&&& RUNNING invalidation_fork_child_gdr_map_parent_cumemalloc
parent: Start
child: Start
child: attempting to gdr_map parent's pinned GPU memory
child: cannot do gdr_map as expected
&&&& PASSED invalidation_fork_child_gdr_map_parent_cumemalloc
&&&& RUNNING invalidation_fork_map_and_free_cumemalloc
parent: Start
child: Start
child: writing buf_ptr[0] with 305
child: calling gpuMemFree
child: signal parent that I have called gpuMemFree
parent: writing buf_ptr[0] with 305
parent: waiting for signal from child
parent: received cont signal 1 from child
parent: trying to read buf_ptr[0]
parent: read buf_ptr[0] get 305
&&&& PASSED invalidation_fork_map_and_free_cumemalloc
&&&& RUNNING invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
parent: Start
child: Start
child: Receiving fd from parent via unix socket
parent: Calling gdr_open
parent: Extracted fd from gdr_t got fd 4
parent: Sending fd to child via unix socket
parent: Waiting for child to finish
child: Got fd 5
child: Converting fd to gdr_t
child: Trying to do gdr_pin_buffer with the received fd
child: Cannot do gdr_pin_buffer with the received fd as expected
&&&& PASSED invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
&&&& RUNNING invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
parent: Start
child: Start
child: Receiving fd from parent via unix socket
parent: Calling gdr_open
parent: Calling gdr_pin_buffer
parent: Extracted fd from gdr_t got fd 8
parent: Sending fd to child via unix socket
parent: Extracted gdr_memh_t from gdr_mh_t got handle 0x0
parent: Sending gdr_memh_t to child
parent: Waiting for child to finish
child: Got fd 9
child: Converting fd to gdr_t
child: Receiving gdr_memh_t from parent
child: Got handle 0x0
child: Converting gdr_memh_t to gdr_mh_t
child: Attempting gdr_map
child: Cannot do gdr_map as expected
&&&& PASSED invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
&&&& RUNNING invalidation_fork_child_gdr_pin_parent_with_tokens
parent: Start
child: Start
parent: CUDA generated tokens.p2pToken 0, tokens.vaSpaceToken 65024
child: Received from parent tokens.p2pToken 0, tokens.vaSpaceToken 65024
&&&& PASSED invalidation_fork_child_gdr_pin_parent_with_tokens
&&&& RUNNING invalidation_access_after_gdr_close_vmmalloc
Mapping bar1
Writing 646 into buf_ptr[0]
Calling gdr_close
Trying to read buf_ptr[0] after gdr_close
Get signal 7 as expected
&&&& PASSED invalidation_access_after_gdr_close_vmmalloc
&&&& RUNNING invalidation_access_after_free_vmmalloc
Mapping bar1
Writing 766 into buf_ptr[0]
Calling gpuMemFree
Trying to read buf_ptr[0] after gpuMemFree
Get signal 7 as expected
&&&& PASSED invalidation_access_after_free_vmmalloc
&&&& RUNNING invalidation_two_mappings_vmmalloc
Mapping bar1
Writing data to both mappings 32 and 33 respectively
Validating that we can read the data back
gpuMemFree and thus destroying the first mapping
Trying to read and validate the data from the second mapping after the first mapping has been destroyed
&&&& PASSED invalidation_two_mappings_vmmalloc
&&&& RUNNING invalidation_fork_access_after_free_vmmalloc
parent: Start
child: Start
child: waiting for cont signal from parent
parent: writing buf_ptr[0] with 310
parent: read buf_ptr[0] before gpuMemFree get 310
parent: calling gpuMemFree
parent: waiting for child write signal
child: receive cont signal 1 from parent
child: writing buf_ptr[0] with 320
child: signal parent that I have written
child: waiting for signal from parent before calling gpuMemFree
parent: trying to read buf_ptr[0]
Get signal 7 as expected
&&&& PASSED invalidation_fork_access_after_free_vmmalloc
&&&& RUNNING invalidation_fork_after_gdr_map_vmmalloc
parent: Start
parent: writing buf_ptr[0] with 669
parent: trying to read buf_ptr[0]
parent: read buf_ptr[0] get 669
parent: signaling child
parent: waiting for child to exit
child: Start
child: waiting for cont signal from parent
child: receive cont signal 1 from parent
child: trying to read buf_ptr[0]
Get signal 11 as expected
parent: trying to read buf_ptr[0] after child exits
parent: read buf_ptr[0] after child exits get 669
&&&& PASSED invalidation_fork_after_gdr_map_vmmalloc
&&&& RUNNING invalidation_fork_child_gdr_map_parent_vmmalloc
parent: Start
child: Start
child: attempting to gdr_map parent's pinned GPU memory
child: cannot do gdr_map as expected
&&&& PASSED invalidation_fork_child_gdr_map_parent_vmmalloc
&&&& RUNNING invalidation_fork_map_and_free_vmmalloc
parent: Start
child: Start
child: writing buf_ptr[0] with 757
child: calling gpuMemFree
child: signal parent that I have called gpuMemFree
parent: writing buf_ptr[0] with 387
parent: waiting for signal from child
parent: received cont signal 1 from child
parent: trying to read buf_ptr[0]
parent: read buf_ptr[0] get 387
&&&& PASSED invalidation_fork_map_and_free_vmmalloc
&&&& RUNNING invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc
parent: Start
child: Start
parent: Calling gdr_open
parent: Extracted fd from gdr_t got fd 4
parent: Sending fd to child via unix socket
parent: Waiting for child to finish
child: Receiving fd from parent via unix socket
child: Got fd 5
child: Converting fd to gdr_t
child: Trying to do gdr_pin_buffer with the received fd
child: Cannot do gdr_pin_buffer with the received fd as expected
&&&& PASSED invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc
&&&& RUNNING invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
parent: Start
child: Start
parent: Calling gdr_open
parent: Calling gdr_pin_buffer
parent: Extracted fd from gdr_t got fd 8
parent: Sending fd to child via unix socket
parent: Extracted gdr_memh_t from gdr_mh_t got handle 0x0
parent: Sending gdr_memh_t to child
parent: Waiting for child to finish
child: Receiving fd from parent via unix socket
child: Got fd 9
child: Converting fd to gdr_t
child: Receiving gdr_memh_t from parent
child: Got handle 0x0
child: Converting gdr_memh_t to gdr_mh_t
child: Attempting gdr_map
child: Cannot do gdr_map as expected
&&&& PASSED invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
92%: Checks: 27, Failures: 2, Errors: 0
sanity.cpp:1856:F:Basic:basic_child_thread_pins_buffer_cumemalloc:0: Failed
sanity.cpp:1862:F:Basic:basic_child_thread_pins_buffer_vmmalloc:0: Failed

@pakmarkthub Here is the output info, thanks!

PS: How do I install and use gdrcopy for tests?

  1. Install gdrcpy on the physical machine, and map gdrdrv inside the container. The container shows the following:
    1646802825(1)
  2. Install gdrcopy library in container.
    cd ~/gdrcopy/ && sudo make lib_install
    Then, In the /usr/local/lib dir have .so files.
    企业微信截图_16468030325682
    3)sanity test.
    cd ~/gdrcopy/tests && make
    Then, perform test.

@pakmarkthub
Copy link
Collaborator

Thank you for the info. What is the gdrdrv version you are using? For libgdrapi, can I ask for confirmation that it is version 2.3?

Based on your post, I guess that you have sudo access to the physical machine. Can you do the followings?

  1. On your physical machine, change this line https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to sudo /sbin/insmod src/gdrdrv/gdrdrv.ko dbg_enabled=1 info_enabled=1.
  2. Do make driver && sudo ./insmod.sh.
  3. On your container, run ./sanity -v.
  4. Can you post the output from dmesg from your physical machine here? I want to see the output during the sanity run. You don't need to post the whole log if you don't want to.

@myanzhang
Copy link
Author

@pakmarkthub Thanks, I will reply when I confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants