Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests: Synchronize on Initialization #10108

Merged
merged 2 commits into from
Jun 25, 2024
Merged

fabtests: Synchronize on Initialization #10108

merged 2 commits into from
Jun 25, 2024

Conversation

zachdworkin
Copy link
Contributor

@zachdworkin zachdworkin commented Jun 19, 2024

fabtests: Synchronize on Initialization
fabtests/functional: Add manual init sync to fi_rdm_multiclient

Some providers (verbs ud) might require the server to be fully initialized before the client process calls getinfo with the server address. This causes a No Data Available error due to the fi_info call failing during initialization by not being able to find the name on the server. This is seen most often in cases where a socket (usually oob [out of band]) is initialized before getinfo in the ft_init_fabric sequence. Adding a sync only if an oob socket has been initialized to order the initialization correctly will prevent this from happening.

The syncronization is for client to start getinfo only after the server is done initializing everything.

fi_rdm_multiclient test needs its client startup to have a manual sync since it does not follow the normal codepath of going through ft_init_fabric like the server and other tests do. This sync will only happen on the first client that connects because it is only necessary to give the server enough time to spin up all the resources. It is not necessary for future clients because the server has already started all of those resources.

@zachdworkin
Copy link
Contributor Author

bot:aws:retest

@zachdworkin
Copy link
Contributor Author

@shijin-aws can you please tell me why the first run of AWS CI failed?

@shijin-aws
Copy link
Contributor

shijin-aws commented Jun 20, 2024

We have a multi-client test that involves fi_rdm failed

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received

client returncode: 0

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_stdout:

client returncode: 107

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -C 2 -p efa -E=9228'"'"''
server_stdout:

client_0_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.8.166 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_0_stdout:
libfabric:69633:1718843593::efa:ep_ctrl:efa_rdm_peer_destruct():56<warn> Closing EP with unacked CONNREQs in flight
Sending message...
Send completion received

client_1_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.13.34 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10108/install/fabtests/bin/fi_rdm -p efa -E=9228 172.31.13.34'"'"''
client_1_stdout:

@shijin-aws
Copy link
Contributor

I think the failure is related. Need to look into further.

@zachdworkin
Copy link
Contributor Author

I agree that these changes broke the test. I think it might need some manual progression like how I had to add it to fi_rdm_multi_client. It looks like this test is just doing fi_rdm which goes through the main codepath and will have the second client hang while waiting to recv the oob socket from the server (the server only sends it once). Do you have any ideas for how to solve this problem?

@j-xiong
Copy link
Contributor

j-xiong commented Jun 20, 2024

OOB for multiple clients is handled in ft_accept_next_cleint, which calls ft_reset_oob() which in turns calls ft_init_oob(). You can try adding a send inside ft_accept_next_client after ft_reset_oob is called.

Make the socket to synchronize on an argument so that any
socket can be specified for synchronization instead of only
sock.

Signed-off-by: Zach Dworkin <[email protected]>
@zachdworkin
Copy link
Contributor Author

@j-xiong how is this?

Comment on lines 56 to 59
if (oob_sock >= 0 && !opts.dst_addr) {
ret = ft_sock_sync(oob_sock, 1);
if (ret)
return ret;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into the if block above, otherwise the server would call extra sync for the first client.

@j-xiong
Copy link
Contributor

j-xiong commented Jun 20, 2024

Looks good now. Let's see if CI would find any issue.

@zachdworkin
Copy link
Contributor Author

@shijin-aws can you share the AWS CI failure?

@shijin-aws
Copy link
Contributor

The same test failed again, looking at it now

@shijin-aws
Copy link
Contributor

shijin-aws commented Jun 21, 2024

So the test is run like this

server expects to get messages from 2 clients

fi_rdm -C 2 -p efa -E

client 1 and 2 comes and left and does ping-pong with server via (Just run the following command twice )

fi_rdm -p efa -E <server_ip>

The sever does OK when the first client comes and left

But when the second client comes it starts to get the error

ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected)

@shijin-aws
Copy link
Contributor

shijin-aws commented Jun 21, 2024

I can reproduce the error with TCP provider as well

ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ fi_rdm -C 2 -p tcp -E
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!
ft_sock_recv(): common/shared.c:3951, ret=-107 (Transport endpoint is not connected

So it looks the change broke the multi-client mode of fi_rdm

fabtests/functional: Add manual init sync to fi_rdm_multiclient and fi_rdm

Some providers (verbs ud) might require the server to be fully initialized
before the client process calls getinfo with the server address.
This causes a No Data Available error due to the fi_info call failing during
initialization by not being able to find the name on the server.
This is seen most often in cases where a socket (usually oob [out of band]) is
initialized before getinfo in the ft_init_fabric sequence. Adding a sync only
if an oob socket has been initialized to order the initialization correctly will
prevent this from happening.

The syncronization is for client to start getinfo only after the server is done
initializing everything.

fi_rdm_multiclient test needs its client startup to have a manual sync since it
does not follow the normal codepath of going through ft_init_fabric like the server
and other tests do. This sync will only happen on the first client that connects
because it is only necessary to give the server enough time to spin up all the
resources. It is not necessary for future clients because the server has already
started all of those resources.

fi_rdm needs the syncronization added after accepting a new client because the
client will be waiting for a sock_send and the server only does it once on its
initialization. Adding a ft_sock_sync will force the server to do it for each
new client.

Signed-off-by: Zach Dworkin <[email protected]>
@zachdworkin zachdworkin changed the title fabtests: Syncronize on Initialization fabtests: Synchronize on Initialization Jun 25, 2024
@j-xiong j-xiong merged commit 3285b8b into ofiwg:main Jun 25, 2024
13 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants