Skip to content

Conversation

@stephanie-wang
Copy link
Contributor

What do these changes do?

This refactors the ObjectDirectory to only use callbacks where necessary and to post callbacks on an event loop instead of calling the callbacks in the same stack.

Related issue number

#2959

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9054/
Test FAILed.

callback(client_id_vec, object_id);
io_service_.post([this, callback, client_id_vec, object_id]() {
callback(client_id_vec, object_id);
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key fix, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least for #3201.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Stephanie. Could you explain what's happening, and how this change fixes #3201?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is basically the same one that was discussed when ray.wait was first implemented here. Basically the callback deletes from a data structure shared with the caller, so an iterator held by the caller gets invalidated when the callback returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm still confused because I thought we addressed that issue before merging. The iterator in SubscribeRemainingWaitObjects makes a copy of the object ids it iterates over (see here), which is not shared with the callback. The callback does not modify the vector of object ids (see here). Could you explain precisely the failure scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the memory referenced by wait_state in that function gets invalidated because the callback deletes the wait_id from active_wait_requests.

Copy link
Contributor

@elibol elibol Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, one failure scenario is as follows:

  1. The final object id's callback is invoked immediately.
  2. The wait_id is removed within the callback (i.e. WaitComplete is invoked).
  3. At this point, the loop exits, so this is not invoked.
  4. This is invoked, which references memory that doesn't exist.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

flatbuffers::FlatBufferBuilder fbb;
bool is_transfer = (type == ConnectionPool::ConnectionType::TRANSFER);
auto message = object_manager_protocol::CreateConnectClientMessage(
fbb, fbb.CreateString(client_id_.binary()), is_transfer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but can be to_flatbuf(fbb, client_id_).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9118/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9119/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9121/
Test PASSed.

Copy link
Contributor

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks good to me. I'd prefer to avoid passing in the main service into the object directory by using some other asynchronous mechanism to invoke the callback on a separate stack, but I think this solution is better than further complicating the logic of object manager's wait implementation.

With these changes in place, we can clean up some of the code that originally dealt with the immediate callback issue within SubscribeRemainingWaitObjects, like this condition.

We ought to consider cleaning up that function in this PR, or at least add a TODO to do so. Leaving it as-is is harmless, but it unnecessarily complicates the code, and some of the comments are no longer valid.

@stephanie-wang
Copy link
Contributor Author

Thanks, @elibol, I removed the old code.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9133/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9140/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants