-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] ServeHandle detects ActorError and drop replicas from target group #26685
Conversation
f93df4e
to
53ab3a6
Compare
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
53ab3a6
to
451b3dd
Compare
@edoakes ready for review |
ray.get(handle.remote(do_crash=True)) | ||
|
||
pids = ray.get([handle.remote() for _ in range(10)]) | ||
assert len(set(pids)) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert len(handle.router._replica_set.in_flight_queries) == 1
|
||
handle = serve.run(f.bind()) | ||
pids = ray.get([handle.remote() for _ in range(2)]) | ||
assert len(set(pids)) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one more assert to double check on the client side:
assert len(handle.router._replica_set.in_flight_queries) == 2
Signed-off-by: simon-mo <[email protected]>
python/ray/serve/router.py
Outdated
@@ -87,6 +88,12 @@ def __init__( | |||
{"deployment": self.deployment_name} | |||
) | |||
|
|||
def _reset_replica_iterator(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docstring with the behavior here (what happens to inflight & subsequent requests)
python/ray/serve/router.py
Outdated
logger.exception( | ||
"Handle received unexpected error when processing request." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will print the traceback, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
client = get_global_client() | ||
ray.kill(client._controller, no_restart=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we testing by killing the controller? add a comment pls
@@ -701,5 +702,31 @@ def ready(self): | |||
) | |||
|
|||
|
|||
def test_handle_early_detect_failure(shutdown_ray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a header comment describing the behavior that's being tested (let's try to do this in general, really helps readers in the future)
I have one concern. What if RayActorError is sent due to network issues, and the Actor actually is still alive. Will this lead to leak? I think we shouldn't just remove it, instead we should move it out and move it in after x seconds. If later controller remove this replica, we just remove it. |
Signed-off-by: simon-mo <[email protected]>
@iycheng I'm a bit confused now about the semantics of RayActorError. Is this error string now out of date? Line 269 in 0d49901
|
Signed-off-by: simon-mo <[email protected]>
@sihanwang41 @edoakes ready for another look, comments added |
@simon-mo how about this? Do we plan to fix it? I think in case of a network partition, this is going to lead instance leak. |
I'm trying to find documentation about all cases of RayActorError, but I can't. Maybe we should have a doc about this. @jjyao do we have this? |
I think it's more like we think the actor died (code), but somehow it's not. So whether the actor died depends on GCS. |
Signed-off-by: simon-mo <[email protected]>
@iycheng I'm going to group the network error as follow up |
…cas from target group (ray-project#26685)" (ray-project#27283)" This reverts commit 1a10b53.
…icas from target group (ray-project#26685)" (ray-project#27283)" (ray-project#27348) Signed-off-by: simon-mo <[email protected]>
…group (ray-project#26685) Signed-off-by: Stefan van der Kleij <[email protected]>
… target group (ray-project#26685)" (ray-project#27283) This reverts commit 545c516. Signed-off-by: Stefan van der Kleij <[email protected]>
…icas from target group (ray-project#26685)" (ray-project#27283)" (ray-project#27348) Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
When ServeController crashes, the replicas membership updates is paused. This means ServeHandle will continue to send requests to the replicas that also crashed during this time. This PR show how can we detect actor failures locally from within the handle and take those replicas of the group it load balance to.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.