Refactor ObjectDirectory to reduce and fix callback usage #3227

stephanie-wang · 2018-11-04T22:14:42Z

What do these changes do?

This refactors the ObjectDirectory to only use callbacks where necessary and to post callbacks on an event loop instead of calling the callbacks in the same stack.

Related issue number

#2959

AmplabJenkins · 2018-11-04T23:10:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9054/
Test FAILed.

robertnishihara · 2018-11-05T01:05:00Z

src/ray/object_manager/object_directory.cc

-  callback(client_id_vec, object_id);
+  io_service_.post([this, callback, client_id_vec, object_id]() {
+    callback(client_id_vec, object_id);
+  });


This is the key fix, right?

Yes, at least for #3201.

Thank you Stephanie. Could you explain what's happening, and how this change fixes #3201?

The problem is basically the same one that was discussed when ray.wait was first implemented here. Basically the callback deletes from a data structure shared with the caller, so an iterator held by the caller gets invalidated when the callback returns.

Sorry I'm still confused because I thought we addressed that issue before merging. The iterator in SubscribeRemainingWaitObjects makes a copy of the object ids it iterates over (see here), which is not shared with the callback. The callback does not modify the vector of object ids (see here). Could you explain precisely the failure scenario?

Yes, but the memory referenced by wait_state in that function gets invalidated because the callback deletes the wait_id from active_wait_requests.

If I understand correctly, one failure scenario is as follows:

The final object id's callback is invoked immediately.

The wait_id is removed within the callback (i.e. WaitComplete is invoked).

At this point, the loop exits, so this is not invoked.

This is invoked, which references memory that doesn't exist.

Does that make sense?

robertnishihara · 2018-11-05T01:08:47Z

src/ray/object_manager/object_manager.cc

+    flatbuffers::FlatBufferBuilder fbb;
+    bool is_transfer = (type == ConnectionPool::ConnectionType::TRANSFER);
+    auto message = object_manager_protocol::CreateConnectClientMessage(
+        fbb, fbb.CreateString(client_id_.binary()), is_transfer);


Minor, but can be to_flatbuf(fbb, client_id_).

AmplabJenkins · 2018-11-06T21:06:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9118/
Test FAILed.

AmplabJenkins · 2018-11-06T21:16:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9119/
Test PASSed.

AmplabJenkins · 2018-11-06T23:02:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9121/
Test PASSed.

elibol

In general, this looks good to me. I'd prefer to avoid passing in the main service into the object directory by using some other asynchronous mechanism to invoke the callback on a separate stack, but I think this solution is better than further complicating the logic of object manager's wait implementation.

With these changes in place, we can clean up some of the code that originally dealt with the immediate callback issue within SubscribeRemainingWaitObjects, like this condition.

We ought to consider cleaning up that function in this PR, or at least add a TODO to do so. Leaving it as-is is harmless, but it unnecessarily complicates the code, and some of the comments are no longer valid.

stephanie-wang · 2018-11-07T00:42:00Z

Thanks, @elibol, I removed the old code.

AmplabJenkins · 2018-11-07T01:25:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9133/
Test FAILed.

AmplabJenkins · 2018-11-07T02:45:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9140/
Test PASSed.

stephanie-wang added 5 commits November 4, 2018 13:09

Add event loop as member of ObjectDirectory

4a5d788

Post object subscription callback to event loop

3e69e44

Convert GetInformation for a client to not use callbacks

db191f5

Convert RunFunctionOnEachClient to not use callbacks and other cleanups

a98f086

todo

c6ba1d2

stephanie-wang requested review from elibol, ericl and robertnishihara November 4, 2018 22:15

robertnishihara reviewed Nov 5, 2018

View reviewed changes

stephanie-wang added 2 commits November 6, 2018 11:37

Fix object free test, clean up GetClient interface

edcade8

Use to_flatbuf

fae7fae

fix osx build

b1c459c

ericl approved these changes Nov 6, 2018

View reviewed changes

lint

d7733da

elibol reviewed Nov 7, 2018

View reviewed changes

Clean up object manager Wait code

463ac57

robertnishihara merged commit ca58570 into ray-project:master Nov 7, 2018

robertnishihara deleted the object-directory-callbacks branch November 7, 2018 04:33

stephanie-wang mentioned this pull request Nov 7, 2018

[xray] ObjectDirectory fires callbacks in the same function #2959

Closed

clarkzinzow mentioned this pull request Feb 10, 2021

[Core] Ownership-based Object Directory - Added support for object spilling in the ownership-based object directory. #13948

Merged

6 tasks

Refactor ObjectDirectory to reduce and fix callback usage #3227

Refactor ObjectDirectory to reduce and fix callback usage #3227

Uh oh!

Conversation

stephanie-wang commented Nov 4, 2018

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Nov 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elibol Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Nov 6, 2018

Uh oh!

AmplabJenkins commented Nov 6, 2018

Uh oh!

AmplabJenkins commented Nov 6, 2018

Uh oh!

elibol left a comment

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Nov 7, 2018

Uh oh!

AmplabJenkins commented Nov 7, 2018

Uh oh!

AmplabJenkins commented Nov 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elibol Nov 7, 2018 •

edited

Loading