-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Fix node manager failure when ClientTable has a disconnected entry. #2905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test FAILed. |
|
Test PASSed. |
|
So just to make sure I understand, previously we would optimistically update resource tracking before we actually verified that the client was connected? And now we would update the resource tracking only after connection check returns True? Is it possible to add a test for this? |
src/ray/raylet/node_manager.cc
Outdated
| TcpConnect(socket, client_info.node_manager_address, client_info.node_manager_port); | ||
| // ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for | ||
| // a disconnected client. The first one has is_insertion=true and the second | ||
| // one has is_insertion=false. We must make sure this is not a close client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this comment. ClientAdded should never be called if is_insertion=true. The case is_insertion=true should be handled by ClientRemoved, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right ClientAdded would never be called with is_insertion=false. The comments is misleading.
| // a disconnected client. The first one has is_insertion=true and the second | ||
| // one has is_insertion=false. We must make sure this is not a close client. | ||
| if (!status.ok()) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances would status not be ok? I agree that failing to connect to the remote client shouldn't be a fatal error, but won't things go wrong later if we don't succeed at connecting to it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return here and don't connect, when will the connection be established?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientTableDataT may come from a disconnected client. The connection should never be established.
src/ray/raylet/node_manager.cc
Outdated
|
|
||
| ResourceSet resources_total(client_data.resources_total_label, | ||
| client_data.resources_total_capacity); | ||
| this->cluster_resource_map_.emplace(client_id, SchedulingResources(resources_total)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using this (I realize this is copied from above).
|
@robertnishihara @richardliaw By my understanding, when a raylet disconnected, it will write a message to ClientTable with Another way to solve this could be cleaning up both entries for the disconnected client. |
|
Sorry to mis-express the understanding. For a entry with The problem is that, after a raylet client disconnected, it only append an entry with |
|
Test PASSed. |
|
This seems related to #2852 |
|
Test PASSed. |
|
Test PASSed. |
|
@robertnishihara Is this PR OK for you? Is |
|
@guoyuhong I agree it feels fragile. I removed it. I think this is pretty good now. Would you be able to add a test in a follow up PR? E.g., the test should go in
What do you think about something like that? |
src/ray/raylet/node_manager.cc
Outdated
| // The first one has is_insertion=true and the second one has is_insertion=false. | ||
| // When a new raylet starts, ClientAdded will be called with the disconnected client's | ||
| // first entry, which will cause IOError and "Connection refused". | ||
| if (status.code() == StatusCode::IOError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably preferable to just check !status.ok() like you were doing before
src/ray/raylet/node_manager.cc
Outdated
| RAY_CHECK(status.message() == "Connection refused"); | ||
| RAY_LOG(WARNING) << "ClientAdded: " << status.message() | ||
| << " (conde=" << static_cast<int>(status.code()) << ")" | ||
| << " which may be caused by a disconnected client."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(WARNING) << "Failed to connect to client " << client_id
<< " in ClientAdded. TcpConnect returned status: "
<< status.ToString() << ". This may be caused by "
<< "trying to connect to a node manager that has failed.";
src/ray/raylet/node_manager.cc
Outdated
| // Establish a new NodeManager connection to this GCS client. | ||
| auto client_info = gcs_client_->client_table().GetClient(client_id); | ||
| RAY_LOG(DEBUG) << "[ClientAdded] CONNECTING TO: " | ||
| RAY_LOG(DEBUG) << "[ClientAdded] TRY TO CONNECTING TO: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(DEBUG) << "[ClientAdded] Trying to connect to client " << client_id
<< " at " << client_info.node_manager_address << ":"
<< client_info.node_manager_port;
|
@robertnishihara Yes. I will add the test in later PR. |
|
Test PASSed. |
What do these changes do?
ClientTable is Log<ActorID, ActorTableData>, which has multiple entries for a disconnected client. The first one has is_insertion=true and the second one has is_insertion=false. We must make sure this is not a closed client.
When a new raylet starts, the ClientAdded will be called with the disconnected client data. However, since the client was closed, the connection will fail. The following message shows the example log:
Client
8fcca9d6b3122effeb2424faa315713b79829ca9and397783a6eddf33c74007a4cbabb156cbcfd0220dhas 2 entries.Related issue number
#2878