-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Fix multi-thread problem of function manager and Jenkins test #3648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/ray/function_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume I have an actor that only creates background threads to do work. Its main thread never receives tasks.
Sleeping here will block forever.
Instead of sleeping here, I think an easier solution is not resetting task_driver_id for actors. Because:
- If a worker is an actor, all the tasks should have the same driver id.
- If I have a background thread whose lifecycle expands across multiple main-thread tasks, the worker must be an actor. (Normal tasks shouldn't create a thread whose lifecycle is longer than itself. We can raise an error in this case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right. If this is happening on an actor, then we should be able to figure out the correct driver ID. If it's happening on a non-actor worker, then we should raise an error.
|
Test PASSed. |
1fe0995 to
44e1614
Compare
raulchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guoyuhong thanks! I'll do that in another PR. Something else needs to be fixed as well.
python/ray/function_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sleep should be removed now?
I think we can assert driver id != nil here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically, yes. When a actor started another thread, worker.task_driver_id will not be nil.
|
Test FAILed. |
44e1614 to
0f2f852
Compare
|
Test PASSed. |
|
This |
python/ray/function_manager.py
Outdated
| if (self._worker.task_driver_id.is_nil()): | ||
| logger.warning("export_actor_class with NIL task_driver_id, this " | ||
| "may happen when export_actor_class runs not in " | ||
| "the main thread. Will wait for the driver id.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning message needs to be changed as well?
python/ray/utils.py
Outdated
| if driver_id is None: | ||
| driver_id = ray_constants.NIL_JOB_ID.id() | ||
| data = {} if data is None else data | ||
| logging.error("push_error_to_driver with message: %s" % message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also print error_type?
|
Test PASSed. |
|
Test PASSed. |
What do these changes do?
The Jenkins test
mnist_example.pyuses multi-thread in actor and we found frequent Jenkins test failures recently.Pattern 1 is caused by NIL driver ID in multi-thread actor exporting. In this case, exporting is called before
worker.task_driver_idis set in main thread. Finally,load_actorwon't get the right actor key from GCS.Pattern 2 is because that
worker.task_driver_idbecomes NIL in_publish_actor_class_to_key. The actor key is right, butdriver_idinformation is not correct. Actually, in this case, worker continues with following failures, butpush_error_to_driversuppresses this error message and the task is reconstructed indefinitely.In this PR, the following changes are included,
actor_class_infowill has thedriver_iditem at the beginning. Otherwise, ifactor_class_infois put to_actors_to_exportthedriver_idinfor will be missing.push_error_to_driver.Related issue number
N/A