-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Raylet events subscription support #3557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
718b982 to
cc80408
Compare
|
The CI test still failing on Linux builds (macos is ok). I cannot reproduce it in local machine/ubuntu docker. When shutting down a pytest, it will print All failing tests are related to rllibs & raytune, and I cannot find where my implementation includes something could also cause a memory leak. So I suspect that if there is something wrong in rllibs & raytune. |
|
Is there a longer stack trace? Maybe temporary use |
|
OK. I have temporarily changed the pytest. It could take some time to see the results. |
d23fe22 to
6da980e
Compare
|
I squashed these commits because there were too many conflicts between this branch and upstream/master due to the interface change. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
I have removed some jenkins comments since related commits have been squashed. |
|
Test FAILed. |
|
Test PASSed. |
1438e6f to
2d6499c
Compare
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
.travis.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the TF issue, but it's not clear to me why we need this. What goes wrong without it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will get something like
*** Error in `python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu': corrupted size vs. prev_size: 0x000055bd463f1f00 ***
Fatal Python error: Aborted
Stack (most recent call first):
/home/travis/.travis/job_stages: line 104: 17219 Aborted (core dumped) python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu
The command "python -m pytest -svvv python/ray/tune/test/ray_trial_executor_test.py -k test_start_stopu" exited with 134.
when exiting a pytest like test_start_stopu.
The test test_start_stopu only includes:
def test_start_stopu():
import ray.rllib.models
import scipy.signal
ray.init()
ray.shutdown()It only happens when:
- Use trusty linux under travis CI
- Use both tensorflow and scipy.signal
I have checked my code and profiled with valgrind but I cannot see any memory leaks in my code, so I suppose it could come from tensorflow. Use libtcmalloc-minimal4 in the tensorflow issue solves this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming #3674 is merged soon, we should remove this and instead use ray.ObjectID.from_random().
src/ray/raylet/event/events.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand all of this graph terminology. What is an "edge"? Does the node manager really need anything more complex than a mapping from raylet client to a set of ObjectIDs that that client is subscribed to?
|
I have rebased this branch due to upstream updates. It could take me sometime to simplify or remove the "graph" stuff. It seems to be a excessive abstraction. |
|
Test PASSed. |
|
Test PASSed. |
event socket support integrate events manager into node manager maintaining local objects Unsubscribe events on finishing create new API use LocalClientConnection more robust connection receiving events from the event socket use C++ random ID & use async write Make API init sync. Fix bugs. add tests Implement `unsubscribe` cleanup files cleanup python code use bytearray for message buffering fix memory leak remove unused headers fix client table; use generated flatbuffer config remove comment fix tempfile test implement a simple client enable futures return ObjectIDs only improve event transport remove legacy code fix java fix doc
|
@robertnishihara I have rethought about the graph terminology. My conclusion is that it is still necessary for more general event processing. Under the event processing context, what I need to do is:
These classes are necessary, and we cannot replace them with functions. This is because we need to know which events to finish(typically a top-down process) and which events to cancel (typically a bottom-up process). This requires class instances (lambda closures won't work here because of their lifetime). Another benefit of the graph representation is that we only need to focus on the local stuff. For example, if we need to create a chain of events that one event is triggered by another, we only need to create the last event so that the whole chain will be built based on the graph dependency. This will happen when we move more components into the event module (like The last but not the least, the graph representation allows us to use multiple threads to process events in the future (parallel graph walking). Some details are included in I am also waiting for #3674 to use the new ObjectID interface. |
|
Test PASSed. |
|
Can one of the admins verify this patch? |
2 similar comments
|
Can one of the admins verify this patch? |
|
Can one of the admins verify this patch? |
Related issue number
#3524
Changes include:
raylet_events.SimpleConnectionfor raylet client. We can replace the legacyio.hmethods with this.RegisterClientReplyto syncRegisterClientRequest.RegisterClientReplywas slightly modified to remove the legacygpu_ids.RayletEventTransport. Now async_api can be initialized synchronously.RayletEventProtocolto replace the oldPlasmaProtocol. Improve performance usingbytearray.