Skip to content

Conversation

@robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Jan 29, 2019

This fixes #2173.

In this PR:

  • By default, worker stdout/stderr on all machines (including the single machine case) are written to files and streamed to the driver via file on disk -> log_monitor -> redis -> driver.
  • If the log monitor dies, an exception is pushed to the driver.
  • We now shutdown driver threads when we call ray.shutdown(). This is most relevant for our tests.

Possible TODO:

  • Make sure we don't ignore files once a large number of workers have been created.
  • Think through relevant thread safety issues.
  • Format print statements to include node and process ID of process that logged the statement, maybe function name as well.
  • Code cleanups (hard coded strings).
  • Isolation between drivers, ideally only print logging statements for the relevant driver.
  • Think through the API (naming, and in particular, the interaction with redirect_output and redirect_worker_output).
  • Tests
  • Push error to all drivers if log monitor fails.
  • Test log monitor failure.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11276/
Test PASSed.

@robertnishihara robertnishihara changed the title [WIP] Prototype streaming logs to driver. Stream logs to driver by default. Feb 1, 2019
@robertnishihara
Copy link
Collaborator Author

@richardliaw @pcmoritz @ericl the main thing that's missing here is isolation between drivers. That is, by default, if you connect multiple drivers to the same cluster, then they will all see each other's stdout/stderr.

I don't see a good way to address this at the moment. Two possibilities are

  • Use separate workers for separate drivers.
  • Include extra annotations in the log files so that the log monitor can figure out which log statements belong to which drivers.

I think it's ok to merge without this feature. Though in the multi-tenant setting it will be necessary to use ray.init(log_to_driver=False). What do you think? I'd like to leave log_to_driver=True as the default so that things just work naturally out of the box. That way, things will behave intuitively the first time you start running on a cluster. Then when you need more complex workloads, you can figure out how to turn off the logging. The opposite approach makes the simple behavior hard and the complex setting easy.

@robertnishihara
Copy link
Collaborator Author

robertnishihara commented Feb 1, 2019

This PR lets you do

import ray
import sys

ray.init(redis_address=...)

@ray.remote
def f():
    print("hello!")
    print("hi", file=sys.stderr)
    sys.stdout.flush()
    sys.stderr.flush()

[f.remote() for _ in range(10)]

You will see something like

worker (stdout) (pid=33570, host=Roberts-MBP)
hello!
worker (stdout) (pid=33571, host=Roberts-MBP)
hello!
worker (stderr) (pid=33568, host=Roberts-MBP)
hi
worker (stderr) (pid=33570, host=Roberts-MBP)
hi
worker (stdout) (pid=33567, host=Roberts-MBP)
hello!
worker (stderr) (pid=33564, host=Roberts-MBP)
hi
worker (stdout) (pid=33564, host=Roberts-MBP)
hello!
worker (stderr) (pid=33567, host=Roberts-MBP)
hi
worker (stdout) (pid=33568, host=Roberts-MBP)
hello!
worker (stderr) (pid=33565, host=Roberts-MBP)
hi
worker (stdout) (pid=33566, host=Roberts-MBP)
hello!
worker (stderr) (pid=33571, host=Roberts-MBP)
hi
worker (stderr) (pid=33566, host=Roberts-MBP)
hi
worker (stdout) (pid=33565, host=Roberts-MBP)
hello!
worker (stderr) (pid=33563, host=Roberts-MBP)
hi
worker (stdout) (pid=33563, host=Roberts-MBP)
hello!
worker (stdout) (pid=33567, host=Roberts-MBP)
hello!
worker (stderr) (pid=33567, host=Roberts-MBP)
hi
worker (stderr) (pid=33565, host=Roberts-MBP)
hi
worker (stdout) (pid=33565, host=Roberts-MBP)
hello!

Currently all of this goes to stderr, though maybe worker stdout should also go to stdout? Im not sure. If a worker prints a lot at once, then more lines will be batched together.

Note that the sys.stdout.flush() and sys.stderr.flush() are necessary. There seems to be a lot of buffering otherwise, and it trying to make the files unbuffered doesn't work out of the box..

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11387/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11392/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11391/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11406/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11414/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11426/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11451/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11452/
Test PASSed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/andymccurdy/redis-py says that Redis clients can safely be shared between threads (however pubsub clients cannot).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave a comment here about this thread-safe behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11538/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11539/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11542/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11544/
Test PASSed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw please take a look, this no longer seems necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh because dup2 -- yeah this is probably fine; this was basically to pass one of the tests in run_test so as long as this passes it's fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw I would use the logger module, but the output seemed uglier.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11546/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11545/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11655/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11654/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11662/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11663/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide option to print task stdout to driver on a cluster in real time

5 participants