Skip to content

Conversation

@suquark
Copy link
Member

@suquark suquark commented Dec 17, 2018

Related issue number

#3524

Changes include:

  • Implement raylet events subscription. (https://docs.google.com/document/d/11WJ6eQs_IC_Ne1VUjf7Tx8KQaPbD4aR7ewom6XiTdRA/edit?usp=sharing)
  • Introduce a new socket called raylet_events.
  • Implemented SimpleConnection for raylet client. We can replace the legacy io.h methods with this.
  • Reuse RegisterClientReply to sync RegisterClientRequest. RegisterClientReply was slightly modified to remove the legacy gpu_ids.
  • Implemented RayletEventTransport. Now async_api can be initialized synchronously.
  • Use new RayletEventProtocol to replace the old PlasmaProtocol. Improve performance using bytearray.

@suquark suquark changed the title Raylet events subscription support [WIP] Raylet events subscription support Dec 17, 2018
@suquark suquark self-assigned this Dec 18, 2018
@suquark suquark changed the title [WIP] Raylet events subscription support Raylet events subscription support Dec 19, 2018
@suquark
Copy link
Member Author

suquark commented Dec 21, 2018

The CI test still failing on Linux builds (macos is ok). I cannot reproduce it in local machine/ubuntu docker. When shutting down a pytest, it will print

*** Error in `python -m pytest -v --durations=10 python/ray/tune/test/trial_scheduler_test.py': corrupted size vs. prev_size: 0x000055aaeebc32b0 ***
/home/travis/.travis/job_stages: line 104: 15098 Aborted                 (core dumped) python -m pytest -v --durations=10 python/ray/tune/test/trial_scheduler_test.py
The command "python -m pytest -v --durations=10 python/ray/tune/test/trial_scheduler_test.py" exited with 134.

All failing tests are related to rllibs & raytune, and I cannot find where my implementation includes something could also cause a memory leak. So I suspect that if there is something wrong in rllibs & raytune.

@richardliaw
Copy link
Contributor

Is there a longer stack trace? Maybe temporary use -s on the pytest

@suquark
Copy link
Member Author

suquark commented Dec 22, 2018

OK. I have temporarily changed the pytest. It could take some time to see the results.

@suquark suquark force-pushed the subscription branch 3 times, most recently from d23fe22 to 6da980e Compare January 1, 2019 05:22
@suquark
Copy link
Member Author

suquark commented Jan 1, 2019

I squashed these commits because there were too many conflicts between this branch and upstream/master due to the interface change.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10544/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10547/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10548/
Test PASSed.

@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@ray-project ray-project deleted a comment from AmplabJenkins Jan 3, 2019
@suquark
Copy link
Member Author

suquark commented Jan 3, 2019

I have removed some jenkins comments since related commits have been squashed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10559/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10642/
Test PASSed.

@suquark suquark force-pushed the subscription branch 2 times, most recently from 1438e6f to 2d6499c Compare January 7, 2019 08:57
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10648/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10649/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10651/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10654/
Test PASSed.

.travis.yml Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the TF issue, but it's not clear to me why we need this. What goes wrong without it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get something like

*** Error in `python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu': corrupted size vs. prev_size: 0x000055bd463f1f00 ***
Fatal Python error: Aborted
Stack (most recent call first):
/home/travis/.travis/job_stages: line 104: 17219 Aborted                 (core dumped) python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu
The command "python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu" exited with 134.

when exiting a pytest like test_start_stopu.

The test test_start_stopu only includes:

def test_start_stopu():
    import ray.rllib.models
    import scipy.signal

    ray.init()
    ray.shutdown()

It only happens when:

  1. Use trusty linux under travis CI
  2. Use both tensorflow and scipy.signal

I have checked my code and profiled with valgrind but I cannot see any memory leaks in my code, so I suppose it could come from tensorflow. Use libtcmalloc-minimal4 in the tensorflow issue solves this problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming #3674 is merged soon, we should remove this and instead use ray.ObjectID.from_random().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand all of this graph terminology. What is an "edge"? Does the node manager really need anything more complex than a mapping from raylet client to a set of ObjectIDs that that client is subscribed to?

@suquark
Copy link
Member Author

suquark commented Jan 8, 2019

I have rebased this branch due to upstream updates. It could take me sometime to simplify or remove the "graph" stuff. It seems to be a excessive abstraction.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10681/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10682/
Test PASSed.

event socket support

integrate events manager into node manager

maintaining local objects

Unsubscribe events on finishing

create new API

use LocalClientConnection

more robust connection

receiving events from the event socket

use C++ random ID & use async write

Make API init sync. Fix bugs.

add tests

Implement `unsubscribe`

cleanup files

cleanup python code

use bytearray for message buffering

fix memory leak

remove unused headers

fix client table; use generated flatbuffer config

remove comment

fix tempfile test

implement a simple client

enable futures return ObjectIDs only

improve event transport

remove legacy code

fix java

fix doc
@suquark
Copy link
Member Author

suquark commented Jan 9, 2019

@robertnishihara I have rethought about the graph terminology. My conclusion is that it is still necessary for more general event processing. Under the event processing context, what I need to do is:

  1. The interface for event subscription & cancelling. That's the RayletEvents class.
  2. Classes to manage different kind of events (so later we could properly implement a panel in our webui to watch these events). That's the subclasses of EventsManager.
  3. The classes represent the concrete event, that's the subclasses of EventNode.
  4. The classes represent the dependency/interaction of events, that's the subclasses of EventEdge.

These classes are necessary, and we cannot replace them with functions. This is because we need to know which events to finish(typically a top-down process) and which events to cancel (typically a bottom-up process). This requires class instances (lambda closures won't work here because of their lifetime).

Another benefit of the graph representation is that we only need to focus on the local stuff. For example, if we need to create a chain of events that one event is triggered by another, we only need to create the last event so that the whole chain will be built based on the graph dependency. This will happen when we move more components into the event module (like ray.wait, ray.get, Pull).

The last but not the least, the graph representation allows us to use multiple threads to process events in the future (parallel graph walking).

Some details are included in
https://docs.google.com/document/d/11WJ6eQs_IC_Ne1VUjf7Tx8KQaPbD4aR7ewom6XiTdRA/edit?usp=sharing

I am also waiting for #3674 to use the new ObjectID interface.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10706/
Test PASSed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

2 similar comments
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ericl ericl closed this Dec 29, 2019
@suquark suquark deleted the subscription branch July 9, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants