Skip to content

Conversation

@elibol
Copy link
Contributor

@elibol elibol commented May 30, 2018

Implements ray.wait for xray via the local scheduler. This includes both back-end and front-end changes.

@elibol elibol changed the title [xray] Implements ray.wait. [xray] Implements ray.wait May 30, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5712/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented May 30, 2018

Btw does this also fix #1128 ?

@elibol
Copy link
Contributor Author

elibol commented May 30, 2018

Chatted with @ericl offline: The existing semantics of num_returns is to return at most the number of object ids specified by num_returns when timeout is specified, and exactly num_returns when timeout is not specified. This implementation changes the semantics to be return at least the number of object ids specified by num_returns. For instance, if you do ray.wait on 100 objects with num_returns=2, and 50 objects are done, then ray.wait returns 50 objects instead of just 2.

Here are some options:

  1. Change the semantics of num_returns, which solves ray.wait doesn't prioritize returning tasks that have finished earlier #1128. @ericl notes this breaks syntax such as ([x,y], z) = ray.wait(object_ids, num_returns=2), which waits indefinitely until at least 2 objects are found.
  2. Keep existing semantics, but come up with a way to be "fair" with which objects are returned. Some ideas include random sampling or stratified sampling over nodes. We should include a test that captures these semantics.

@robertnishihara can you provide your thoughts on this?

@stephanie-wang
Copy link
Contributor

I think we should keep the existing semantics since changing it now would be a surprise to other Ray users.

I'm not sure if "fairness" is necessary for ray.wait semantics? I would be okay with just taking the first num_returns objects, in whatever order the object manager happens to process them in.

@robertnishihara
Copy link
Collaborator

Yeah, I think the existing semantics make sense, especially since ray.wait is often used to wait for exactly 1 object, and so having to handle more than 1 object would be confusing.

If you want to just get all of the objects that are ready after a certain timeout, then you can specify a timeout along with num_returns=len(object_ids).


@unittest.skipIf(
os.environ.get("RAY_USE_XRAY") == "1",
"This test does not work with xray yet.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are other tests that can be added back in, e.g., testMultipleWaitsAndGets and testWait in stress_tests.py.

if num_returns > len(object_ids):
raise Exception("num_returns cannot be greater than the number "
"of objects provided to ray.wait.")
timeout = timeout if timeout is not None else 2**30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the above code should run in both code paths, not just the raylet code path

object_id_strs = [
plasma.ObjectID(object_id.id()) for object_id in object_ids
]
timeout = timeout if timeout is not None else 2**30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5829/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5826/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5827/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5828/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5835/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5836/
Test PASSed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most pressing comment is the one about the issue with the SubscribeObjectLocations callbacks getting invoked in the same call stack. It would be great to add a regression test that triggers the issue I described in the comment. If it's not feasible to write that kind of test right now, we should figure out how to restructure the code to make it feasible.

}
}

if (wait_state.remaining.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to this if block detailing the reason for this condition? I had the same question as Robert when reading this code, so it'd be good to explain to the reader why you still have to do lookups for the remaining objects.

status = gcs_client_->object_table().RequestNotifications(
JobID::nil(), object_id, gcs_client_->client_table().GetLocalClientId());
}
if (listeners_[object_id].callbacks.count(callback_id) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can, let's try to avoid using the bracket accessor wherever we can, and use listeners_.find(object_id) instead.

status = gcs_client_->object_table().RequestNotifications(
JobID::nil(), object_id, gcs_client_->client_table().GetLocalClientId());
}
if (listeners_[object_id].callbacks.count(callback_id) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this can happen when you call SubscribeObjectLocations twice, for a Pull?


/// Unfulfilled Push tasks.
/// The timer is for removing a push task due to unsatisfied local object.
UniqueID object_directory_pull_callback_id_ = UniqueID::from_random();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document this field.

current_wait_test += 1;
switch (current_wait_test) {
case 0: {
TestWait(100, 5, 3, 0, false, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document why you chose these values for each of the calls to TestWait? It'd be good if the reader can tell immediately from reading this code what is different about each of these cases.

@unittest.skipIf(
os.environ.get("RAY_USE_XRAY") == "1",
"This test does not work with xray yet.")
def testWaitIterables(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this test case get removed?

void ObjectManager::WaitComplete(const UniqueID &wait_id) {
auto &wait_state = active_wait_requests_.find(wait_id)->second;
// If we complete with outstanding requests, then wait_ms should be non-zero.
RAY_CHECK(!(wait_state.requested_objects.size() > 0) || wait_state.wait_ms > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but this is confusing to read. Can we change it to something like:

if (!wait_state.requested_objects.empty()) {
  RAY_CHECK(wait_state.wait_ms > 0);
}

wait_id, oid, [this, wait_id](const std::vector<ClientID> &client_ids,
const ObjectID &object_id) {
auto &wait_state = active_wait_requests_.find(wait_id)->second;
if (wait_state.remaining.count(object_id) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the code correctly, this condition should always be true, right? If yes, we should change it to skip the if check and just do a RAY_CHECK that the wait_state.remaining.erase(object_id) succeeds.

RAY_CHECK_OK(object_directory_->SubscribeObjectLocations(
wait_id, oid, [this, wait_id](const std::vector<ClientID> &client_ids,
const ObjectID &object_id) {
auto &wait_state = active_wait_requests_.find(wait_id)->second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with the way this code is currently structured because of the fact that the callback registered in SubscribeObjectLocations may now get called directly. It's possible that right now we will not actually see a bug, but that is only because of the specific ordering of calls made on the object directory by the object manager, and that seems quite brittle. Here is a scenario where I think the code would break:

  1. SubscribeObjectLocations gets called on objects A and B (e.g., for a Pull, or for a different Wait). Locations for both are cached in the object directory.
  2. Wait is called on objects A and B, with 1 object required.
  3. In the same call stack, SubscribeObjectLocations is called on object A. The cached locations are found, this callback fires, and the wait request completes and is erased from active_wait_requests_.
  4. Again, in the same call stack, SubscribeObjectLocations is called on object B. active_wait_requests_.find(wait_id) will fail (silently).

@unittest.skipIf(
os.environ.get("RAY_USE_XRAY") == "1",
"This test does not work with xray yet.")
def testWait(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the other test cases that should now pass? I forget exactly why ones, but I think there was one like testMultipleWaitsAndGets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled all wait-related tests.

elibol added 3 commits June 4, 2018 17:33
Add test for ObjectManager.Wait during subscribe to a common same object.
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5871/
Test PASSed.

int wait_local;

if (!PyArg_ParseTuple(args, "Oili", &py_object_ids, &num_returns, &timeout_ms,
&wait_local)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to do this is to parse it as O and then use PyObject_IsTrue

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe undefined behavior is still possible due to the way SubscribeObjectLocations is implemented. Although the scenario might be rare because of the current ordering of calls on the object directory, we should really try to come up with a way to trigger the failure before we merge this PR. Since the bug depends on a particular order of calls and callbacks from the object directory, perhaps there's a way we can trigger the failure by either mocking the object directory and/or calling the methods (e.g., AllWaitLookupsComplete) on the object manager directly?

for (auto &oid : object_ids) {
if (local_objects_.count(oid) > 0) {
wait_state.found.insert(oid);
for (auto &object_id : object_ids) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto wherever you can.

} else {
for (auto &oid : wait_state.remaining) {
// Subscribe to objects in order to ensure Wait-related tests are deterministic.
for (auto &object_id : wait_state.object_id_order) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto wherever you can.

for (auto &oid : wait_state.remaining) {
// Subscribe to objects in order to ensure Wait-related tests are deterministic.
for (auto &object_id : wait_state.object_id_order) {
if (wait_state.remaining.count(object_id) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug that I described in the earlier comment is still an issue here. wait_state is a reference to the value at active_wait_requests_. The reference will become invalid if the entry is erased from active_wait_requests_ between iterations of this for loop, so this line can produce undefined behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the relevant check that corrects the bug you described earlier: active_wait_requests_.find(wait_id) == active_wait_requests_.end()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think it will break in that particular way anymore, but undefined behavior is still possible because of the reference to wait_state. Same underlying issue, but it will break at a different line.

Copy link
Contributor Author

@elibol elibol Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I've made the fix and added a regression test.

@elibol
Copy link
Contributor Author

elibol commented Jun 6, 2018

@stephanie-wang Before each SubscribeObjectLocations is issued, we check whether the executing wait request is still active. I believe this corrects the issue we discussed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5879/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5882/
Test PASSed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the regression test looks good! Just a few small comments, and then we can merge.

std::vector<ObjectID> found;
std::vector<ObjectID> remaining;
for (auto item : wait_state.object_id_order) {
for (const auto item : wait_state.object_id_order) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto & to avoid copy.

}
}

return ray::Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method always returns OK, I would make it void.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually also returns ray::Status::NotImplemented currently if wait_local=true.

/// \param success_cb Invoked with non-empty list of client ids and object_id.
/// \return Status of whether subscription succeeded.
virtual ray::Status SubscribeObjectLocations(const ObjectID &object_id,
virtual ray::Status SubscribeObjectLocations(const UniqueID &callback_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a NOTE to the documentation here that the callback may fire in the invocation of SubscribeObjectLocations? Until we can figure out a better way to do it, it'd be good to warn the user.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5894/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5895/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants