Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/ucx: fi_rdm_tagged_peek cleanup race condition #10126

Open
zachdworkin opened this issue Jun 25, 2024 · 1 comment
Open

prov/ucx: fi_rdm_tagged_peek cleanup race condition #10126

zachdworkin opened this issue Jun 25, 2024 · 1 comment

Comments

@zachdworkin
Copy link
Contributor

Describe the bug
fi_rdm_tagged_peek has a race condition cleanup error where the process segmentation faults when trying to close the endpoint

To Reproduce
Build with UCX
server_cmd: fi_rdm_tagged_peek -p ucx -E
client_cmd: fi_rdm_tagged_peek -p ucx -E server_address

Expected behavior
Test passes successfuly

Output
Server output:
server_cmd: /path_to_fabtests_install/fi_rdm_tagged_peek -p "ucx" -E
server_stdout: |
Sending 10 tagged messages
Waiting for messages to complete
[node:3176869:0:3176869] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:3176869) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000033210 ucp_ep_destroy_base() ???:0
2 0x000000000004b3ee ucp_worker_discard_uct_ep_progress() ???:0
3 0x000000000004b4b5 ucp_worker_destroy() ???:0
4 0x00000000000ca7fa ucx_ep_close() ucx_ep.c:0
5 0x0000000000404081 fi_close() /path_to_libfabric_install/include/rdma/fabric.h:632
6 0x0000000000404081 ft_close_fids() /path_to_libfabric_source/fabtests/common/shared.c:1792
7 0x0000000000404b6a ft_free_res() /path_to_libfabric_source/fabtests/common/shared.c:1862
8 0x0000000000401bfa main() /hpath_to_libfabric_source/fabtests/functional/rdm_tagged_peek.c:364
9 0x0000000000401bfa main() /path_to_libfabric_source/fabtests/functional/rdm_tagged_peek.c:365
10 0x000000000003ad85 __libc_start_main() ???:0
11 0x000000000040203e _start() ???:0

Client output:
client_cmd: /path_to_fabtests_install/fi_rdm_tagged_peek -p "ucx" -E server_address
client_stdout: |
Peek for a bad msg
Peek w/ claim for a bad msg
Peek msg 1
Receive msg 1
Peek w/ claim msg 2
Receive claimed msg 2
Peek & discard msg 3
Checking to see if msg 3 was discarded
Peek w/ claim msg 4
Claim and discard msg 4
Receive msg 5
Receive msg 6
Receive msg 10
Receive msg 9
Receive msg 8
Receive msg 7

Environment:
rocky 8.7

Additional context
Fails as a race condition. No known 100% fail case.

@zachdworkin
Copy link
Contributor Author

Revert #10124's ucx test disable when this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant