Skip to content

Conversation

@guoyuhong
Copy link
Contributor

What do these changes do?

The Jenkins test mnist_example.py uses multi-thread in actor and we found frequent Jenkins test failures recently.

  1. Pattern 1
  1. Pattern 2

Pattern 1 is caused by NIL driver ID in multi-thread actor exporting. In this case, exporting is called before worker.task_driver_id is set in main thread. Finally, load_actor won't get the right actor key from GCS.

File "/usr/local/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
File "/home/admin/yuhong/ray/python/ray/tune/function_runner.py", line 80, in run
    self._entrypoint(*self._entrypoint_args)
File "python/ray/experimental/sgd/mnist_example.py", line 112, in train_mnist
    strategy=args.strategy)
File "/home/admin/yuhong/ray/python/ray/experimental/sgd/sgd.py", line 106, in __init__
    all_reduce_alg=all_reduce_alg))
File "/home/admin/yuhong/ray/python/ray/actor.py", line 329, in remote
    return self._remote(args=args, kwargs=kwargs)
File "/home/admin/yuhong/ray/python/ray/actor.py", line 392, in _remote
    self._checkpoint_interval)
File "/home/admin/yuhong/ray/python/ray/function_manager.py", line 520, in export_actor_class
    for line in traceback.format_stack():
 export_actor_class: FunctionDescriptor:ray.experimental.sgd.sgd_worker.SGDWorker.__init__., driver id: ObjectID(fffffffffffffffffffffffffffffffffffffff)

Pattern 2 is because that worker.task_driver_id becomes NIL in _publish_actor_class_to_key. The actor key is right, but driver_id information is not correct. Actually, in this case, worker continues with following failures, but push_error_to_driver suppresses this error message and the task is reconstructed indefinitely.

Traceback (most recent call last):
  File "/home/admin/yuhong/ray/python/ray/workers/default_worker.py", line 106, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/admin/yuhong/ray/python/ray/worker.py", line 963, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/admin/yuhong/ray/python/ray/worker.py", line 891, in _wait_for_and_process_task
    driver_id, function_descriptor)
  File "/home/admin/yuhong/ray/python/ray/function_manager.py", line 453, in get_execution_info
    raise KeyError(message)
KeyError: "Error occurs in get_execution_info: driver_id: 44b1341be3f9ee0bdfb8d249f82eaa4e9cab6930, function_descriptor: FunctionDescriptor:ray.experimental.sgd.sgd_worker.SGDWorker.__init__.. Message: b'j\\xceZ\\xbd\\xba\\x98\\xf0\\x91\\xc6\\xa7m\\x80|\\xfd\\x16m$\\x94\\xf6\\xac'"

In this PR, the following changes are included,

  1. The function manager will wait the driver ID and driver ID will be kept as soon as it is not NIL.
  2. actor_class_info will has the driver_id item at the beginning. Otherwise, if actor_class_info is put to _actors_to_export the driver_id infor will be missing.
  3. Add error log to push_error_to_driver.

Related issue number

N/A

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume I have an actor that only creates background threads to do work. Its main thread never receives tasks.
Sleeping here will block forever.
Instead of sleeping here, I think an easier solution is not resetting task_driver_id for actors. Because:

  1. If a worker is an actor, all the tasks should have the same driver id.
  2. If I have a background thread whose lifecycle expands across multiple main-thread tasks, the worker must be an actor. (Normal tasks shouldn't create a thread whose lifecycle is longer than itself. We can raise an error in this case.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right. If this is happening on an actor, then we should be able to figure out the correct driver ID. If it's happening on a non-actor worker, then we should raise an error.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10446/
Test PASSed.

@guoyuhong
Copy link
Contributor Author

@raulchen Thanks for the suggestion! I have added the logic to this PR. It also looks like that your test script in #3651 has also passed. Shall I add this script to runtest.py or you will have another PR to refine the multi-thread problem and add the test case later?

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guoyuhong thanks! I'll do that in another PR. Something else needs to be fixed as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sleep should be removed now?
I think we can assert driver id != nil here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically, yes. When a actor started another thread, worker.task_driver_id will not be nil.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10497/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10539/
Test PASSed.

@guoyuhong
Copy link
Contributor Author

This ObjectID refactor code is put to #3674 .

if (self._worker.task_driver_id.is_nil()):
logger.warning("export_actor_class with NIL task_driver_id, this "
"may happen when export_actor_class runs not in "
"the main thread. Will wait for the driver id.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning message needs to be changed as well?

if driver_id is None:
driver_id = ray_constants.NIL_JOB_ID.id()
data = {} if data is None else data
logging.error("push_error_to_driver with message: %s" % message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also print error_type?

Yuhong Guo and others added 2 commits January 2, 2019 17:46
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10546/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10551/
Test PASSed.

@raulchen raulchen merged commit 4b23a34 into ray-project:master Jan 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants