[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

simon-mo · 2022-07-18T22:52:00Z

Why are these changes needed?

When ServeController crashes, the replicas membership updates is paused. This means ServeHandle will continue to send requests to the replicas that also crashed during this time. This PR show how can we detect actor failures locally from within the handle and take those replicas of the group it load balance to.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/serve/router.py

Signed-off-by: simon-mo <[email protected]>

simon-mo · 2022-07-21T00:46:22Z

@edoakes ready for review

sihanwang41 · 2022-07-21T00:52:45Z

python/ray/serve/tests/test_standalone2.py

+        ray.get(handle.remote(do_crash=True))
+
+    pids = ray.get([handle.remote() for _ in range(10)])
+    assert len(set(pids)) == 1


assert len(handle.router._replica_set.in_flight_queries) == 1

sihanwang41 · 2022-07-21T00:56:42Z

python/ray/serve/tests/test_standalone2.py

+
+    handle = serve.run(f.bind())
+    pids = ray.get([handle.remote() for _ in range(2)])
+    assert len(set(pids)) == 2


Add one more assert to double check on the client side:
assert len(handle.router._replica_set.in_flight_queries) == 2

Signed-off-by: simon-mo <[email protected]>

edoakes · 2022-07-21T16:35:01Z

python/ray/serve/router.py

@@ -87,6 +88,12 @@ def __init__(
            {"deployment": self.deployment_name}
        )

+    def _reset_replica_iterator(self):


add docstring with the behavior here (what happens to inflight & subsequent requests)

edoakes · 2022-07-21T16:35:44Z

python/ray/serve/router.py

+                    logger.exception(
+                        "Handle received unexpected error when processing request."
+                    )


this will print the traceback, right?

edoakes · 2022-07-21T16:37:15Z

python/ray/serve/tests/test_standalone2.py

+    client = get_global_client()
+    ray.kill(client._controller, no_restart=True)


what are we testing by killing the controller? add a comment pls

edoakes · 2022-07-21T16:37:36Z

python/ray/serve/tests/test_standalone2.py

@@ -701,5 +702,31 @@ def ready(self):
    )


+def test_handle_early_detect_failure(shutdown_ray):


please add a header comment describing the behavior that's being tested (let's try to do this in general, really helps readers in the future)

…detect

python/ray/serve/router.py

fishbone · 2022-07-26T04:26:10Z

I have one concern. What if RayActorError is sent due to network issues, and the Actor actually is still alive. Will this lead to leak?

I think we shouldn't just remove it, instead we should move it out and move it in after x seconds. If later controller remove this replica, we just remove it.

…detect

Signed-off-by: simon-mo <[email protected]>

simon-mo · 2022-07-28T17:44:29Z

@iycheng I'm a bit confused now about the semantics of RayActorError. Is this error string now out of date?

ray/python/ray/exceptions.py

Line 269 in 0d49901

self.base_error_msg = "The actor died unexpectedly before finishing this task."

Signed-off-by: simon-mo <[email protected]>

…detect

simon-mo · 2022-07-28T17:45:28Z

@sihanwang41 @edoakes ready for another look, comments added

fishbone · 2022-07-28T22:24:35Z

I have one concern. What if RayActorError is sent due to network issues, and the Actor actually is still alive. Will this lead to leak?

I think we shouldn't just remove it, instead we should move it out and move it in after x seconds. If later controller remove this replica, we just remove it.

@simon-mo how about this? Do we plan to fix it? I think in case of a network partition, this is going to lead instance leak.

fishbone · 2022-07-28T22:28:34Z

I'm trying to find documentation about all cases of RayActorError, but I can't. Maybe we should have a doc about this. @jjyao do we have this?

fishbone · 2022-07-28T22:34:16Z

Network issue is one example

I think it's more like we think the actor died (code), but somehow it's not. So whether the actor died depends on GCS.

Signed-off-by: simon-mo <[email protected]>

simon-mo · 2022-07-29T16:50:39Z

@iycheng I'm going to group the network error as follow up

… target group (#26685)" This reverts commit 545c516.

… target group (#26685)" (#27283) This reverts commit 545c516.

…cas from target group (ray-project#26685)" (ray-project#27283)" This reverts commit 1a10b53.

…icas from target group (#26685)" (#27283)" (#27348)

…icas from target group (ray-project#26685)" (ray-project#27283)" (ray-project#27348) Signed-off-by: simon-mo <[email protected]>

…group (#26685)

…group (ray-project#26685) Signed-off-by: Stefan van der Kleij <[email protected]>

… target group (ray-project#26685)" (ray-project#27283) This reverts commit 545c516. Signed-off-by: Stefan van der Kleij <[email protected]>

…icas from target group (ray-project#26685)" (ray-project#27283)" (ray-project#27348) Signed-off-by: Stefan van der Kleij <[email protected]>

simon-mo commented Jul 18, 2022

View reviewed changes

python/ray/serve/router.py Outdated Show resolved Hide resolved

simon-mo force-pushed the serve/early-detect branch from f93df4e to 53ab3a6 Compare July 21, 2022 00:36

simon-mo marked this pull request as ready for review July 21, 2022 00:37

simon-mo added 3 commits July 20, 2022 17:44

[Serve] ServeHandle early detects Actor Error

45d6e1a

Signed-off-by: simon-mo <[email protected]>

test

78be28b

Signed-off-by: simon-mo <[email protected]>

lint

451b3dd

Signed-off-by: simon-mo <[email protected]>

simon-mo force-pushed the serve/early-detect branch from 53ab3a6 to 451b3dd Compare July 21, 2022 00:44

simon-mo requested a review from edoakes July 21, 2022 00:46

simon-mo assigned edoakes Jul 21, 2022

sihanwang41 reviewed Jul 21, 2022

View reviewed changes

comment

d6e0681

Signed-off-by: simon-mo <[email protected]>

simon-mo requested a review from sihanwang41 July 21, 2022 01:57

simon-mo changed the title ~~[Serve] [Prototype] ServeHandle detects ActorError and drop replicas from target group~~ [Serve] ServeHandle detects ActorError and drop replicas from target group Jul 21, 2022

edoakes reviewed Jul 21, 2022

View reviewed changes

edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2022

Merge branch 'master' of github.com:ray-project/ray into serve/early-…

c81a888

…detect

fishbone reviewed Jul 25, 2022

View reviewed changes

python/ray/serve/router.py Outdated Show resolved Hide resolved

fishbone self-assigned this Jul 26, 2022

fishbone mentioned this pull request Jul 26, 2022

[core] Allow 0 waiting for death info to fail a task faster #26993

Merged

6 tasks

fishbone mentioned this pull request Jul 26, 2022

[serve] Serve failed when a lot of replicas failed. #27001

Closed

simon-mo added 2 commits July 27, 2022 16:42

Merge branch 'master' of github.com:ray-project/ray into serve/early-…

027a5a0

…detect

wip

e60dc3b

Signed-off-by: simon-mo <[email protected]>

simon-mo added 2 commits July 28, 2022 10:44

comments

e6e784f

Signed-off-by: simon-mo <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into serve/early-…

f20124f

…detect

simon-mo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022

simon-mo requested a review from edoakes July 28, 2022 17:45

lint and fix test

5991e77

Signed-off-by: simon-mo <[email protected]>

simon-mo merged commit 545c516 into ray-project:master Jul 29, 2022

simon-mo mentioned this pull request Jul 29, 2022

[Serve] Cherry-pick handle early detection for GCS FT #27261

Merged

7 tasks

simon-mo added a commit that referenced this pull request Jul 29, 2022

Revert "[Serve] ServeHandle detects ActorError and drop replicas from…

e1439eb

… target group (#26685)" This reverts commit 545c516.

simon-mo mentioned this pull request Jul 29, 2022

Revert "[Serve] ServeHandle detects ActorError and drop replicas from target group" #27283

Merged

simon-mo added a commit that referenced this pull request Jul 29, 2022

Revert "[Serve] ServeHandle detects ActorError and drop replicas from…

1a10b53

… target group (#26685)" (#27283) This reverts commit 545c516.

simon-mo added a commit to simon-mo/ray that referenced this pull request Aug 1, 2022

Revert "Revert "[Serve] ServeHandle detects ActorError and drop repli…

5b6599e

…cas from target group (ray-project#26685)" (ray-project#27283)" This reverts commit 1a10b53.

simon-mo added a commit that referenced this pull request Aug 3, 2022

Revert "Revert "[Serve] ServeHandle detects ActorError and drop repl…

6084eb6

…icas from target group (#26685)" (#27283)" (#27348)

simon-mo added a commit that referenced this pull request Aug 3, 2022

[Serve] ServeHandle detects ActorError and drop replicas from target …

de0c707

…group (#26685)

scv119 added v2.0.0-pick-done labels Aug 3, 2022

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[Serve] ServeHandle detects ActorError and drop replicas from target …

4367278

…group (ray-project#26685) Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

simon-mo commented Jul 18, 2022 •

edited

Loading

simon-mo commented Jul 21, 2022

sihanwang41 Jul 21, 2022

sihanwang41 Jul 21, 2022

edoakes Jul 21, 2022

edoakes Jul 21, 2022

simon-mo Jul 27, 2022

edoakes Jul 21, 2022

edoakes Jul 21, 2022

fishbone commented Jul 26, 2022

simon-mo commented Jul 28, 2022

simon-mo commented Jul 28, 2022

fishbone commented Jul 28, 2022

fishbone commented Jul 28, 2022

fishbone commented Jul 28, 2022

simon-mo commented Jul 29, 2022

		client = get_global_client()
		ray.kill(client._controller, no_restart=True)

		@@ -701,5 +702,31 @@ def ready(self):
		)


		def test_handle_early_detect_failure(shutdown_ray):

[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

Conversation

simon-mo commented Jul 18, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

simon-mo commented Jul 21, 2022

sihanwang41 Jul 21, 2022

Choose a reason for hiding this comment

sihanwang41 Jul 21, 2022

Choose a reason for hiding this comment

edoakes Jul 21, 2022

Choose a reason for hiding this comment

edoakes Jul 21, 2022

Choose a reason for hiding this comment

simon-mo Jul 27, 2022

Choose a reason for hiding this comment

edoakes Jul 21, 2022

Choose a reason for hiding this comment

edoakes Jul 21, 2022

Choose a reason for hiding this comment

fishbone commented Jul 26, 2022

simon-mo commented Jul 28, 2022

simon-mo commented Jul 28, 2022

fishbone commented Jul 28, 2022

fishbone commented Jul 28, 2022

fishbone commented Jul 28, 2022

simon-mo commented Jul 29, 2022

simon-mo commented Jul 18, 2022 •

edited

Loading