You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found one problem when sending msg (by calling fi_tsend) to a failed/killed target process over fi socket provider.
We got a hang at that case, it is because in the case of connection failure the fi_tsend() returns “-FI_EAGAIN”, so in our code we just retry it when seeing the EAGAIN. Then it caused the dead loop.
The back trace is:
#0 0x00007f1a0e343efd in nanosleep () from /lib64/libc.so.6 #1 0x00007f1a0e343d94 in sleep () from /lib64/libc.so.6 #2 0x00007f1a0cc34df7 in sock_ep_connect (ep_attr=0x18e9830, index=2) at prov/sockets/src/sock_conn.c:496 #3 0x00007f1a0cc21e4d in sock_ep_get_conn (attr=0x18e9830, tx_ctx=0x18ea080, index=2, pconn=0x7ffe324c4688) at prov/sockets/src/sock_ep.c:1819 #4 0x00007f1a0cc3669b in sock_ep_tsendmsg (ep=0x18e9740, msg=0x7ffe324c4740, flags=2305843009213693952) at prov/sockets/src/sock_msg.c:558 #5 0x00007f1a0cc36a2f in sock_ep_tsend (ep=0x18e9740, buf=0x194d000, len=170, desc=0x194a340, dest_addr=2, tag=4, context=0x194c8c0) at prov/sockets/src/sock_msg.c:646 #6 0x00007f1a0ea60fca in fi_tsend (context=0x194c8c0, tag=4, dest_addr=, desc=0x194a340, len=170, buf=0x194d000, ep=0x18e9740)
at /home/xliu9/src/daos_m/install/include/rdma/fi_tagged.h:116
The sock_ep_connect retried 5 times (sleep 10 second each time) and returns NULL to sock_ep_get_conn() inside that returns “-FI_EAGAIN” (because errno == EINPROGRESS) to user.
Two questions related to this issue:
in the case of the target already dead/not reachable, is it possible that the send() API returns a more proper error code rather than “-FI_EAGAIN”, as the EAGAIN seems mean that user is just free to retry it.
can see the sleep(10) in sock_ep_connect:
retry:
do_retry--;
sleep(10);
if (!do_retry)
goto err;
Is it possible to refine it that removing that long period sleep()? As it will cause a hug delay if user tries to connect some one not reachable.
The text was updated successfully, but these errors were encountered:
We found one problem when sending msg (by calling fi_tsend) to a failed/killed target process over fi socket provider.
We got a hang at that case, it is because in the case of connection failure the fi_tsend() returns “-FI_EAGAIN”, so in our code we just retry it when seeing the EAGAIN. Then it caused the dead loop.
The back trace is:
#0 0x00007f1a0e343efd in nanosleep () from /lib64/libc.so.6
#1 0x00007f1a0e343d94 in sleep () from /lib64/libc.so.6
#2 0x00007f1a0cc34df7 in sock_ep_connect (ep_attr=0x18e9830, index=2) at prov/sockets/src/sock_conn.c:496
#3 0x00007f1a0cc21e4d in sock_ep_get_conn (attr=0x18e9830, tx_ctx=0x18ea080, index=2, pconn=0x7ffe324c4688) at prov/sockets/src/sock_ep.c:1819
#4 0x00007f1a0cc3669b in sock_ep_tsendmsg (ep=0x18e9740, msg=0x7ffe324c4740, flags=2305843009213693952) at prov/sockets/src/sock_msg.c:558
#5 0x00007f1a0cc36a2f in sock_ep_tsend (ep=0x18e9740, buf=0x194d000, len=170, desc=0x194a340, dest_addr=2, tag=4, context=0x194c8c0) at prov/sockets/src/sock_msg.c:646
#6 0x00007f1a0ea60fca in fi_tsend (context=0x194c8c0, tag=4, dest_addr=, desc=0x194a340, len=170, buf=0x194d000, ep=0x18e9740)
at /home/xliu9/src/daos_m/install/include/rdma/fi_tagged.h:116
The sock_ep_connect retried 5 times (sleep 10 second each time) and returns NULL to sock_ep_get_conn() inside that returns “-FI_EAGAIN” (because errno == EINPROGRESS) to user.
Two questions related to this issue:
The text was updated successfully, but these errors were encountered: