[xray] Implements ray.wait #2162

elibol · 2018-05-30T08:00:34Z

Implements ray.wait for xray via the local scheduler. This includes both back-end and front-end changes.

AmplabJenkins · 2018-05-30T08:31:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5712/
Test FAILed.

ericl · 2018-05-30T19:55:21Z

Btw does this also fix #1128 ?

elibol · 2018-05-30T20:30:05Z

Chatted with @ericl offline: The existing semantics of num_returns is to return at most the number of object ids specified by num_returns when timeout is specified, and exactly num_returns when timeout is not specified. This implementation changes the semantics to be return at least the number of object ids specified by num_returns. For instance, if you do ray.wait on 100 objects with num_returns=2, and 50 objects are done, then ray.wait returns 50 objects instead of just 2.

Here are some options:

Change the semantics of num_returns, which solves ray.wait doesn't prioritize returning tasks that have finished earlier #1128. @ericl notes this breaks syntax such as ([x,y], z) = ray.wait(object_ids, num_returns=2), which waits indefinitely until at least 2 objects are found.
Keep existing semantics, but come up with a way to be "fair" with which objects are returned. Some ideas include random sampling or stratified sampling over nodes. We should include a test that captures these semantics.

@robertnishihara can you provide your thoughts on this?

stephanie-wang · 2018-05-30T20:32:40Z

I think we should keep the existing semantics since changing it now would be a surprise to other Ray users.

I'm not sure if "fairness" is necessary for ray.wait semantics? I would be okay with just taking the first num_returns objects, in whatever order the object manager happens to process them in.

robertnishihara · 2018-05-30T21:29:05Z

Yeah, I think the existing semantics make sense, especially since ray.wait is often used to wait for exactly 1 object, and so having to handle more than 1 object would be confusing.

If you want to just get all of the objects that are ready after a certain timeout, then you can specify a timeout along with num_returns=len(object_ids).

robertnishihara · 2018-05-30T21:31:39Z

test/runtest.py


-    @unittest.skipIf(
-        os.environ.get("RAY_USE_XRAY") == "1",
-        "This test does not work with xray yet.")


I think there are other tests that can be added back in, e.g., testMultipleWaitsAndGets and testWait in stress_tests.py.

robertnishihara · 2018-05-30T21:33:25Z

python/ray/worker.py

+            if num_returns > len(object_ids):
+                raise Exception("num_returns cannot be greater than the number "
+                                "of objects provided to ray.wait.")
+            timeout = timeout if timeout is not None else 2**30


all the above code should run in both code paths, not just the raylet code path

robertnishihara · 2018-05-30T21:33:39Z

python/ray/worker.py

+            object_id_strs = [
+                plasma.ObjectID(object_id.id()) for object_id in object_ids
+            ]
+            timeout = timeout if timeout is not None else 2**30


AmplabJenkins · 2018-06-02T01:24:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5829/
Test FAILed.

AmplabJenkins · 2018-06-02T01:28:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5826/
Test PASSed.

AmplabJenkins · 2018-06-02T01:32:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5827/
Test PASSed.

AmplabJenkins · 2018-06-02T01:37:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5828/
Test PASSed.

AmplabJenkins · 2018-06-02T20:39:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5835/
Test PASSed.

AmplabJenkins · 2018-06-02T22:03:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5836/
Test PASSed.

stephanie-wang

Most pressing comment is the one about the issue with the SubscribeObjectLocations callbacks getting invoked in the same call stack. It would be great to add a regression test that triggers the issue I described in the comment. If it's not feasible to write that kind of test right now, we should figure out how to restructure the code to make it feasible.

stephanie-wang · 2018-06-04T22:18:59Z

src/ray/object_manager/object_manager.cc

+    }
+  }
+
+  if (wait_state.remaining.empty()) {


Can you add a comment to this if block detailing the reason for this condition? I had the same question as Robert when reading this code, so it'd be good to explain to the reader why you still have to do lookups for the remaining objects.

stephanie-wang · 2018-06-04T22:32:11Z

src/ray/object_manager/object_directory.cc

+    status = gcs_client_->object_table().RequestNotifications(
+        JobID::nil(), object_id, gcs_client_->client_table().GetLocalClientId());
+  }
+  if (listeners_[object_id].callbacks.count(callback_id) > 0) {


If we can, let's try to avoid using the bracket accessor wherever we can, and use listeners_.find(object_id) instead.

stephanie-wang · 2018-06-04T22:33:19Z

src/ray/object_manager/object_directory.cc

+    status = gcs_client_->object_table().RequestNotifications(
+        JobID::nil(), object_id, gcs_client_->client_table().GetLocalClientId());
+  }
+  if (listeners_[object_id].callbacks.count(callback_id) > 0) {


I see, this can happen when you call SubscribeObjectLocations twice, for a Pull?

stephanie-wang · 2018-06-04T22:34:50Z

src/ray/object_manager/object_manager.h


-  /// Unfulfilled Push tasks.
-  /// The timer is for removing a push task due to unsatisfied local object.
+  UniqueID object_directory_pull_callback_id_ = UniqueID::from_random();


Document this field.

stephanie-wang · 2018-06-04T22:39:14Z

src/ray/object_manager/test/object_manager_test.cc

+    current_wait_test += 1;
+    switch (current_wait_test) {
+    case 0: {
+      TestWait(100, 5, 3, 0, false, false);


Can you document why you chose these values for each of the calls to TestWait? It'd be good if the reader can tell immediately from reading this code what is different about each of these cases.

stephanie-wang · 2018-06-04T22:40:22Z

test/runtest.py

-    @unittest.skipIf(
-        os.environ.get("RAY_USE_XRAY") == "1",
-        "This test does not work with xray yet.")
-    def testWaitIterables(self):


Why did this test case get removed?

stephanie-wang · 2018-06-04T22:43:23Z

src/ray/object_manager/object_manager.cc

+void ObjectManager::WaitComplete(const UniqueID &wait_id) {
+  auto &wait_state = active_wait_requests_.find(wait_id)->second;
+  // If we complete with outstanding requests, then wait_ms should be non-zero.
+  RAY_CHECK(!(wait_state.requested_objects.size() > 0) || wait_state.wait_ms > 0);


Nit, but this is confusing to read. Can we change it to something like:

if (!wait_state.requested_objects.empty()) { RAY_CHECK(wait_state.wait_ms > 0); }

stephanie-wang · 2018-06-04T22:47:04Z

src/ray/object_manager/object_manager.cc

+          wait_id, oid, [this, wait_id](const std::vector<ClientID> &client_ids,
+                                        const ObjectID &object_id) {
+            auto &wait_state = active_wait_requests_.find(wait_id)->second;
+            if (wait_state.remaining.count(object_id) != 0) {


If I'm reading the code correctly, this condition should always be true, right? If yes, we should change it to skip the if check and just do a RAY_CHECK that the wait_state.remaining.erase(object_id) succeeds.

stephanie-wang · 2018-06-04T23:06:59Z

src/ray/object_manager/object_manager.cc

+      RAY_CHECK_OK(object_directory_->SubscribeObjectLocations(
+          wait_id, oid, [this, wait_id](const std::vector<ClientID> &client_ids,
+                                        const ObjectID &object_id) {
+            auto &wait_state = active_wait_requests_.find(wait_id)->second;


There is an issue with the way this code is currently structured because of the fact that the callback registered in SubscribeObjectLocations may now get called directly. It's possible that right now we will not actually see a bug, but that is only because of the specific ordering of calls made on the object directory by the object manager, and that seems quite brittle. Here is a scenario where I think the code would break:

SubscribeObjectLocations gets called on objects A and B (e.g., for a Pull, or for a different Wait). Locations for both are cached in the object directory.

Wait is called on objects A and B, with 1 object required.

In the same call stack, SubscribeObjectLocations is called on object A. The cached locations are found, this callback fires, and the wait request completes and is erased from active_wait_requests_.

Again, in the same call stack, SubscribeObjectLocations is called on object B. active_wait_requests_.find(wait_id) will fail (silently).

stephanie-wang · 2018-06-04T23:07:37Z

test/runtest.py

-    @unittest.skipIf(
-        os.environ.get("RAY_USE_XRAY") == "1",
-        "This test does not work with xray yet.")
    def testWait(self):


What happened to the other test cases that should now pass? I forget exactly why ones, but I think there was one like testMultipleWaitsAndGets?

I've enabled all wait-related tests.

Add test for ObjectManager.Wait during subscribe to a common same object.

AmplabJenkins · 2018-06-05T21:57:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5871/
Test PASSed.

robertnishihara · 2018-06-05T23:02:51Z

src/local_scheduler/lib/python/local_scheduler_extension.cc

+  int wait_local;
+
+  if (!PyArg_ParseTuple(args, "Oili", &py_object_ids, &num_returns, &timeout_ms,
+                        &wait_local)) {


I think the correct way to do this is to parse it as O and then use PyObject_IsTrue

stephanie-wang

I believe undefined behavior is still possible due to the way SubscribeObjectLocations is implemented. Although the scenario might be rare because of the current ordering of calls on the object directory, we should really try to come up with a way to trigger the failure before we merge this PR. Since the bug depends on a particular order of calls and callbacks from the object directory, perhaps there's a way we can trigger the failure by either mocking the object directory and/or calling the methods (e.g., AllWaitLookupsComplete) on the object manager directly?

stephanie-wang · 2018-06-05T23:07:14Z

src/ray/object_manager/object_manager.cc

-  for (auto &oid : object_ids) {
-    if (local_objects_.count(oid) > 0) {
-      wait_state.found.insert(oid);
+  for (auto &object_id : object_ids) {


const auto wherever you can.

stephanie-wang · 2018-06-05T23:09:04Z

src/ray/object_manager/object_manager.cc

  } else {
-    for (auto &oid : wait_state.remaining) {
+    // Subscribe to objects in order to ensure Wait-related tests are deterministic.
+    for (auto &object_id : wait_state.object_id_order) {


const auto wherever you can.

stephanie-wang · 2018-06-05T23:16:50Z

src/ray/object_manager/object_manager.cc

-    for (auto &oid : wait_state.remaining) {
+    // Subscribe to objects in order to ensure Wait-related tests are deterministic.
+    for (auto &object_id : wait_state.object_id_order) {
+      if (wait_state.remaining.count(object_id) == 0) {


The bug that I described in the earlier comment is still an issue here. wait_state is a reference to the value at active_wait_requests_. The reference will become invalid if the entry is erased from active_wait_requests_ between iterations of this for loop, so this line can produce undefined behavior.

This is the relevant check that corrects the bug you described earlier: active_wait_requests_.find(wait_id) == active_wait_requests_.end()

Yes, I don't think it will break in that particular way anymore, but undefined behavior is still possible because of the reference to wait_state. Same underlying issue, but it will break at a different line.

Nice catch. I've made the fix and added a regression test.

elibol · 2018-06-06T00:00:32Z

@stephanie-wang Before each SubscribeObjectLocations is issued, we check whether the executing wait request is still active. I believe this corrects the issue we discussed.

AmplabJenkins · 2018-06-06T01:27:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5879/
Test PASSed.

AmplabJenkins · 2018-06-06T02:47:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5882/
Test PASSed.

stephanie-wang

Thanks, the regression test looks good! Just a few small comments, and then we can merge.

stephanie-wang · 2018-06-06T17:59:52Z

src/ray/object_manager/object_manager.cc

  std::vector<ObjectID> found;
  std::vector<ObjectID> remaining;
-  for (auto item : wait_state.object_id_order) {
+  for (const auto item : wait_state.object_id_order) {


const auto & to avoid copy.

stephanie-wang · 2018-06-06T18:03:00Z

src/ray/object_manager/object_manager.cc

    }
  }

+  return ray::Status::OK();


If this method always returns OK, I would make it void.

It actually also returns ray::Status::NotImplemented currently if wait_local=true.

stephanie-wang · 2018-06-06T18:06:32Z

src/ray/object_manager/object_directory.h

  /// \param success_cb Invoked with non-empty list of client ids and object_id.
  /// \return Status of whether subscription succeeded.
-  virtual ray::Status SubscribeObjectLocations(const ObjectID &object_id,
+  virtual ray::Status SubscribeObjectLocations(const UniqueID &callback_id,


Can you add a NOTE to the documentation here that the callback may fire in the invocation of SubscribeObjectLocations? Until we can figure out a better way to do it, it'd be good to warn the user.

AmplabJenkins · 2018-06-06T19:27:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5894/
Test PASSed.

AmplabJenkins · 2018-06-06T20:39:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5895/
Test PASSed.

elibol added 19 commits May 16, 2018 16:16

Use pubsub instead of timeout.

0e18ca7

Correct status message.

fc2572c

eric's feedback!

2e8af60

Changes from Stephanie's review.

b57a548

object directory changes for ray.wait.

a128698

Merge branch 'master' into om_pubsub

fa5c32d

Merge branch 'master' into om_pubsub

0ccf46b

Merge branch 'om_pubsub' into om_wait

b02de4f

wait without testing or timeout=0.

f9a9e16

Handle remaining cases for wait.

15b7f61

linting

a22263b

added tests of om wait imp.

8ab41f0

add local test.

98bacfa

Merge branch 'master' into om_wait

d518a89

plasma imp.

53f33e0

block worker as with pull.

8ef35f7

local scheduler implementation of wait.

6e10f9e

with passing tests.

9a95c65

minor adjustments.

aa12bd7

elibol changed the title ~~[xray] Implements ray.wait.~~ [xray] Implements ray.wait May 30, 2018

Merge branch 'master' into om_wait_local_scheduler

9e1602d

handle return statuses.

304b39c

robertnishihara reviewed May 30, 2018

View reviewed changes

c++ style casting.

8e1947c

linting.

83d04dd

linting.

080282f

robertnishihara mentioned this pull request Jun 2, 2018

Experimental asyncio support #2015

Merged

stephanie-wang requested changes Jun 4, 2018

View reviewed changes

elibol added 3 commits June 4, 2018 17:33

incorporate second round of feedback.

a58f5c9

correct python tests.

c6d8ba5

Add test for ObjectManager.Wait during subscribe to a common same object.

test comments.

7d8d756

robertnishihara reviewed Jun 5, 2018

View reviewed changes

stephanie-wang reviewed Jun 5, 2018

View reviewed changes

incorporate reviews.

6b6e2f3

Fixes with regression tests.

3a86c93

stephanie-wang approved these changes Jun 6, 2018

View reviewed changes

elibol added 2 commits June 6, 2018 11:17

update documentation.

1a99f25

reference to avoid copy.

00eafd7

elibol merged commit 7246ff8 into ray-project:master Jun 6, 2018

pcmoritz mentioned this pull request Jun 7, 2018

[xray] Fix compilation on mac #2199

Merged

stephanie-wang mentioned this pull request Nov 6, 2018

Refactor ObjectDirectory to reduce and fix callback usage #3227

Merged

[xray] Implements ray.wait #2162

[xray] Implements ray.wait #2162

Uh oh!

Conversation

elibol commented May 30, 2018

Uh oh!

AmplabJenkins commented May 30, 2018

Uh oh!

ericl commented May 30, 2018

Uh oh!

elibol commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanie-wang commented May 30, 2018

Uh oh!

robertnishihara commented May 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

AmplabJenkins commented Jun 2, 2018

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elibol Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

elibol commented May 30, 2018 •

edited

Loading

elibol Jun 6, 2018 •

edited

Loading

elibol commented Jun 6, 2018 •

edited

Loading