Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix the spilling back failure in case of node missing #19564

Merged
merged 26 commits into from
Oct 22, 2021

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Oct 20, 2021

Why are these changes needed?

When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

Related issue number

#19438

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone changed the title [wip] Spill back node fix [core] Fix the spilling back failure in case of node missing Oct 21, 2021
@fishbone fishbone marked this pull request as ready for review October 21, 2021 06:44
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Just a couple nits. How do you plan to verify this works btw? Will you just run the nightly test?

@@ -462,8 +462,8 @@ class NodeInfoAccessor {
/// \param filter_dead_nodes Whether or not if this method will filter dead nodes.
/// \return The item returned by GCS. If the item to read doesn't exist or the node is
/// dead, this optional object is empty.
virtual absl::optional<rpc::GcsNodeInfo> Get(const NodeID &node_id,
bool filter_dead_nodes = true) const = 0;
virtual const rpc::GcsNodeInfo *Get(const NodeID &node_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we move back to the raw pointer? I feel like raw pointers are really easy to make a mistake because it is common to forget not to handle the return nullptr in the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for optional it's the same. For example, optional can also be nullopt right? So in either case you need to handle nullptr (or nullopt). I feel if we'd like to model return by value data, we should use optional, otherwise, pointer should also work since there is no difference.

The reason I make this ptr is that I'd like to avoid the copy of GcsNodeInfo. I check every node during scheduler, and seeing the copy here makes me feel a little bit bad. So I updated this one. optionalrcp::GcsNodeInfo& is not allowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the cons is that if we use the nullptr, the program will crash, but not for nullopt? (am I correct?)

But if the performance is the concern is makes sense. I remember the syntax to pass references to the optional was too complex and ugly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/utility/optional/value From the doc it says, it'll throw an exception and we are not catching it, so it'll crash as well.

I change this not purely because of the performance, but just feel

  1. it's the same (both will crash).
  2. it might potentially have a performance issue (not tested).

It looks like only gain, so I just update this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. So technically having optional doesn't really have any meaning within Ray... good to know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think if we should have an exception handler in the outermost file (main.cc) (and at least print it although we don't handle them). I assume there are some exceptions that happen but not printed because we don't catch them.

@@ -12,6 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

// clang-format off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because I need mock/ray/gcs/gcs_client.h to be the last one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does include order matter here, is it because mock/ray/gcs/gcs_client.h has incomplete include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's mock type, so it assumes the context has all the necessary structure.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 21, 2021
@@ -12,6 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

// clang-format off
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does include order matter here, is it because mock/ray/gcs/gcs_client.h has incomplete include?

best_node = iter->first;
break;
}
++iter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, dead nodes are pretty rare, right? Otherwise with always searching forward when seeing a dead node, the distribution will no longer be uniform which may or may not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rare. Otherwise, it should be a bug I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is a really good point. I think we will start doing chaos testing, so we should handle this better. If distribution is bad, it can actually cause memory issues easily I think. If it is not an easy fix, can we add a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about how to fix this and I can't come out with an easy, elegant algorithm for this :( any ideas? otherwise, I'll just add TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me first add todo for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about a way besides scanning all nodes once. Leaving a TODO seems fine.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might still prefer optional return, but if it is due to the perf reason, I have no strong preference (maybe we can follow up in the thread we are talking about it).

@fishbone fishbone merged commit 48fb86a into ray-project:master Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants