-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Hanging bootstrap communication #92
Comments
Uploaded debug log: Lines 4049 to 4053 in 0ac1476
At line L4050, the received qp info is different with the sender L733: Lines 732 to 735 in 0ac1476
I suspect the problem is due to using a different connection for each send. It also seems that the server does not guarantee to accept the socket in the same order as the client sent it. Refer to https://www.man7.org/linux/man-pages/man2/accept.2.html To avoid this. We'd better reuse the same connection for each send. And the order is guaranteed by TCP protocol |
Added a failing test case that is probably related with this issue. https://github.com/microsoft/mscclpp/blob/chhwang/ut/test/mp_unit_tests.cu#L172-L184 |
@Binyang2014 Do you think #98 solves this issue? |
Tested more than 10 times. The bootstrap works well. We can close this issue. |
mscclpp-test AllGather fails in the current main (9cee6c4), during
void AllGatherTestEngine::setupConnections()
.The text was updated successfully, but these errors were encountered: