Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Hanging bootstrap communication #92

Closed
chhwang opened this issue Jun 2, 2023 · 4 comments
Closed

[Bug] Hanging bootstrap communication #92

chhwang opened this issue Jun 2, 2023 · 4 comments

Comments

@chhwang
Copy link
Contributor

chhwang commented Jun 2, 2023

mscclpp-test AllGather fails in the current main (9cee6c4), during void AllGatherTestEngine::setupConnections().

@Binyang2014
Copy link
Contributor

Uploaded debug log:

mscclpp/debug-out

Lines 4049 to 4053 in 0ac1476

[1,8]<stdout>:mscclpp-000002:2098:2098 [0] MSCCLPP INFO rank 8 - unexpected message from 3 with tag 0 size 44
[1,8]<stdout>:mscclpp-000002:2098:2098 [0] MSCCLPP INFO IBConnection endSetup: recv qp info, size 44, remote rank 3, tag 0, qpInfo: port 0, lid 49, qpn 0, spn 0, link layer 0
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - try to get message from 6 with tag 0 size 8
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - unexpected message from 6 with tag 0 size 8
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - try to get message from 6 with tag 1 size 49

At line L4050, the received qp info is different with the sender L733:

mscclpp/debug-out

Lines 732 to 735 in 0ac1476

[1,3]<stdout>:mscclpp-000001:53149:53149 [3] MSCCLPP INFO rank 3 - send message to 7 with tag 1 size 125
[1,3]<stdout>:mscclpp-000001:53149:53149 [3] MSCCLPP INFO IBConnection beginSetup: send qp info, size 44, remote rank 8, tag 0, qpInfo: port 1, lid 2732, qpn 1111, spn 0, link layer 1
[1,3]<stdout>:mscclpp-000001:53149:53149 [3] 84.337740 mscclppSocketConnect:651 MSCCLPP TRACE Connecting to socket 172.16.3.250<56111>
[1,7]<stdout>:mscclpp-000001:53161:53161 [7] MSCCLPP INFO rank 7 - send message to 12 with tag 1 size 49

I suspect the problem is due to using a different connection for each send. It also seems that the server does not guarantee to accept the socket in the same order as the client sent it.

Refer to https://www.man7.org/linux/man-pages/man2/accept.2.html
accept will extract the first connection request on the queue of pending connections for the listening socket. But the queue of pending connections is not clear. Seems when server received the ack from the client side, the connection will be moved to the queue. But in client side, connect function returned when received ack+syn from the server side. Which means we don't know when the client will send ack to the server. It seems depends on kernel implementation and hardware driver.

To avoid this. We'd better reuse the same connection for each send. And the order is guaranteed by TCP protocol

@chhwang
Copy link
Contributor Author

chhwang commented Jun 5, 2023

Added a failing test case that is probably related with this issue. https://github.com/microsoft/mscclpp/blob/chhwang/ut/test/mp_unit_tests.cu#L172-L184

@chhwang chhwang changed the title [Bug] AllGather test failure [Bug] Hanging bootstrap communication Jun 6, 2023
@chhwang
Copy link
Contributor Author

chhwang commented Jun 12, 2023

Uploaded debug log:

mscclpp/debug-out

Lines 4049 to 4053 in 0ac1476

[1,8]<stdout>:mscclpp-000002:2098:2098 [0] MSCCLPP INFO rank 8 - unexpected message from 3 with tag 0 size 44
[1,8]<stdout>:mscclpp-000002:2098:2098 [0] MSCCLPP INFO IBConnection endSetup: recv qp info, size 44, remote rank 3, tag 0, qpInfo: port 0, lid 49, qpn 0, spn 0, link layer 0
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - try to get message from 6 with tag 0 size 8
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - unexpected message from 6 with tag 0 size 8
[1,10]<stdout>:mscclpp-000002:2100:2100 [2] MSCCLPP INFO rank 10 - try to get message from 6 with tag 1 size 49

At line L4050, the received qp info is different with the sender L733:

mscclpp/debug-out

Lines 732 to 735 in 0ac1476

[1,3]<stdout>:mscclpp-000001:53149:53149 [3] MSCCLPP INFO rank 3 - send message to 7 with tag 1 size 125
[1,3]<stdout>:mscclpp-000001:53149:53149 [3] MSCCLPP INFO IBConnection beginSetup: send qp info, size 44, remote rank 8, tag 0, qpInfo: port 1, lid 2732, qpn 1111, spn 0, link layer 1
[1,3]<stdout>:mscclpp-000001:53149:53149 [3] 84.337740 mscclppSocketConnect:651 MSCCLPP TRACE Connecting to socket 172.16.3.250<56111>
[1,7]<stdout>:mscclpp-000001:53161:53161 [7] MSCCLPP INFO rank 7 - send message to 12 with tag 1 size 49

I suspect the problem is due to using a different connection for each send. It also seems that the server does not guarantee to accept the socket in the same order as the client sent it.

Refer to https://www.man7.org/linux/man-pages/man2/accept.2.html accept will extract the first connection request on the queue of pending connections for the listening socket. But the queue of pending connections is not clear. Seems when server received the ack from the client side, the connection will be moved to the queue. But in client side, connect function returned when received ack+syn from the server side. Which means we don't know when the client will send ack to the server. It seems depends on kernel implementation and hardware driver.

To avoid this. We'd better reuse the same connection for each send. And the order is guaranteed by TCP protocol

@Binyang2014 Do you think #98 solves this issue?

@Binyang2014
Copy link
Contributor

Tested more than 10 times. The bootstrap works well. We can close this issue.

@chhwang chhwang closed this as completed Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants