-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Skip dead nodes to avoid connection timeout. #4154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test FAILed. |
|
Test FAILed. |
ericl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. @hartikainen could you try this out?
|
Test FAILed. |
|
Are there instructions somewhere for how to build ray with the new bazel stuff? |
|
Nevermind, after installing bazel, |
|
@hartikainen Does this fix your case? @ericl It looks like there is test hanging. I need to take a look later. |
|
This does not seem to fix my case. The autoscaling still stops and I see this on the newly started worker: |
|
@hartikainen sorry for that. I will reproduce it locally and try to fix it. |
|
According to https://github.com/ray-project/ray/blob/master/src/ray/raylet/raylet.cc#L71 , |
|
@hartikainen Please help to test this PR using the old way |
|
Test FAILed. |
2c711a9 to
97c001e
Compare
|
I currently get the following: One thing I'm not sure about is what signal is used when I stop/preempt a node. |
python/ray/tests/test_basic.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go in python/ray/tests/test_multi_node_2.py. Did you add the allow_graceful flag for consistency or for some other reason? It doesn't seem to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this flag is only used for this test. This test is used to repro the problem @hartikainen mentioned. I can check the raylet log to see whether there is Failed to connect to client. However, this required allow_graceful=True to send SIGTERM to raylet instead of SIGKILL. Actually, even without allow_graceful=True the test also passes due to #2905 . This option is used for reproducing. I can remove the flag after this problem fixed.
Do you have any comment for adding a callback to close the main service to guarantee the closing entry is added to GCS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertnishihara I think the option allow_graceful of ray.tests.cluster_utils will be kept to make the test case work.
|
Test FAILed. |
|
@hartikainen It looks like there is a duplicated removal message. It probably comes from |
|
@hartikainen I modified the |
|
Test FAILed. |
|
Test FAILed. |
|
Looks like the stopped nodes now get cleaned up correctly and this solves the immediate issue: I do see some weird behavior with the autoscaler when a node gets preempted though. It took almost 30 minutes for the autoscaling to resume after nodes where stopped. Several trials (7 in total) that were still running on healthy nodes failed 1-2 times for some reason. In the tune trial logs, I see: This problem is probably unrelated to this pull request however. Thanks @guoyuhong! |
|
@hartikainen Thanks for the confirmation. I will change this PR according to the comments. |
|
I think it would also make sense to write tests for the two failed cases above. |
dd96b66 to
d2371c6
Compare
|
The PR is ready for review. |
|
Test FAILed. |
|
Test FAILed. |
ericl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen if there is a race condition right? Maybe we could just say "Failed to connect to ray node <client_id>. This may be since the node was recently removed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed according to your comment.
Yes. there could be 2 cases.
- If raylet crashes or is killed by SIGKILL and the heartbeat timeout (30 seconds) has not happened, this may happen. This could be the case that happens most since 30 seconds is very long.
- Race condition: The new node just gets the client information and tries to connect to the existing nodes and at this time the disconnecting message arrives. This it is very rare since the time interval is very short.
|
Test FAILed. |
|
@guoyuhong Do you know why the Java test failed? |
|
Jenkins retest this please |
|
Test FAILed. |
|
@pcmoritz It looks like that the callback function is not called and the raylet socket file is not deleted. I added the code the handle this. |
|
Test FAILed. |
What do these changes do?
Skip the dead nodes when a node is newly connected to the cluster.
The time sequence is not as expected because there is a callback in another callback.
In
Raylet::RegisterGcs,client_table().Connectis called first andnode_manager_.RegisterGcs()which containsclient_table().RegisterClientAddedCallbackis called later. If the callback functionnotification_callbackinclient_table().Connectis finished beforeclient_table().RegisterClientAddedCallback, the logic is correct. However, this function is called in the callback function ofAppend. That is to sayclient_table().RegisterClientAddedCallbackcould be called ahead ofnotification_callback. In this caseHandleNotificationwill callclient_added_callback_to connect the dead node unexpectedlyRelated issue number
#4140