-
Notifications
You must be signed in to change notification settings - Fork 664
[RayService] Support Incremental Zero-Downtime Upgrades #3166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Support Incremental Zero-Downtime Upgrades #3166
Conversation
8f9a396 to
486f98b
Compare
|
I've now added unit tests and one basic e2e test for the incremental upgrade feature so this should be good to start reviewing. In addition to the unit tests, here's some instructions for manually testing this feature in your cluster.
I'll put more comments with manual test results and add more e2e test cases, but this should be good to start reviewing/iterating on to get merge-ready before the v1.4 release. |
|
Tried to test this manually and not seeing Gateway reconcile with this log line: Do I need to set |
No it should be called automatically when |
83f265f to
17f7aa4
Compare
|
I'm running into some issues now with the allowed ports/protocols for listeners with different Gateway controllers (e.g. the GKE controller is pretty restrictive). I'm working now to figure out how to send traffic from the Serve service -> to the Gateway -> to the active and pending RayCluster head services through the HTTPRoute. An alternative would be just to have users directly send traffic to the Gateway which would be set to HTTP and port |
|
Discussed with Ryan offline, there's a validation in the GKE gateway controller that disallows port 8000 for Serve. But this validation will be removed soon. For now we will test with allowed ports like port 80 and change it back to 8000 before merging |
|
I moved the e2e test to it's own folder since it's an experimental feature and shouldn't be part of the pre-submit tests yet. |
|
@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved |
adc6236 to
7694114
Compare
All the conflicts have been resolved, this is the image I'm currently using for testing: us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay:latest |
|
Fixed a
Gateway: HTTPRoute:
The behavior that |
| const ( | ||
| // During upgrade, IncrementalUpgrade strategy will create an upgraded cluster to gradually scale | ||
| // and migrate traffic to using Gateway API. | ||
| IncrementalUpgrade RayServiceUpgradeType = "IncrementalUpgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe too late to change this, but wondering if RollingUpgrade be a more appropriate name? I assume most people are more familiar with this term. WDYT @ryanaoleary @kevin85421 @MortalHappiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not blocking this PR, we can cahnge it during alpha phase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late to reply to this, but I have no strong preference either way. IncrementalUpgrade is what was used in the feature request and REP so that's why I stuck with it, but if there's a preference from any KubeRay maintainers or users I'm down to go through and change the feature name / all the related variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rueian for sharing opinion.
I think RollingUpgrade is more a more straight forward name for me too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @kevin85421 since from offline discussion you seemed to have a preference against using RollingUpgrade here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 what do you think about ClusterUpgrade and ClusterUpgradeOptions? I prefer to keep the upgrade term generic as the exact behavior could be changed in the future.
@Future-Outlier was also wondering about the history of why we called it "incremental" upgrades.
|
Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release? |
This PR is targeted for KubeRay v1.5, it still needs review and I'll prioritize iterating and resolving comments to get it merged. |
d4ef45d to
be581d6
Compare
|
@andrewsykim Fixed all the merge conflicts and updated the PR, this should be good to re-review. |
|
The failing RayJob CI test seems unrelated |
|
last week, @kevin85421 , @rueian and I were thinking about the load test. In the video, you can find that
based on these 2 metrics, I think the RPS is correct. 2025-10-20.23-49-56.mp4 |
Agree I think |
| hasAccepted := false | ||
| hasProgrammed := false | ||
|
|
||
| for _, condition := range gatewayInstance.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments about GatewayConditionAccepted and GatewayConditionProgrammed? In addition, add comments to explain what does "ready" in IsGatewayReady mean.
From the GEP: https://gateway-api.sigs.k8s.io/geps/gep-1364/
To capture the behavior that Ready currently captures, Programmed will be introduced. This means that the implementation has seen the config, has everything it needs, parsed it, and sent configuration off to the data plane. The configuration should be available "soon". We'll leave "soon" undefined for now.
The condition seems not enough to determine whether the gateway is "ready" or not.
In addition, if the Gateway API has related public API, we should consider using it instead of implementing it by ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any utils in the public API to check the Gateway readiness status besides the existing fields we're checking here. If I'm missing them I can add them instead of this logic, but I didn't see one I can use.
Added comments explaining this helper and the status conditions we check in 71f19a9
| } | ||
|
|
||
| // IsHTTPRouteReady returns whether the HTTPRoute associated with a given Gateway has a ready condition | ||
| func IsHTTPRouteReady(gatewayInstance *gwv1.Gateway, httpRouteInstance *gwv1.HTTPRoute) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "ready" refer to here and explain the logic RouteConditionAccepted and RouteConditionResolvedRefs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Ready" means the HTTPRoute has a parent ref for the Gateway object and that the parent has accepted and resolved the refs of the HTTPRoute:
RouteConditionAccepted: the reason this can be set varies across Gateway controllers, but generally it means the HTTPRoute has a validGatewayobject as the parent and the route is allowed by the Gateway's listener. This condition is mainly checking that the syntax of the rules are valid, but it doesn't guarantee that the backend service exists.RouteConditionResolvedRefs: All the references within the HTTPRoute have been resolved by the Gateway controller. This means that the HTTPRoute's object references are valid, exist, and the Gateway can use them. In our case it's checking the RayCluster Serve service we use as a backend ref.
I can add comments explaining why we check these statuses here. I didn't see any utils to where I could directly check if an HTTPRoute is ready to serve traffic, but checking these two conditions seemed like they'd give reasonable confidence that the HTTPRoute is created and in a good state. Since we also check that the Serve service (backend ref of the HTTPRoute) exists and that the Ray Serve deployment is healthy before migrating traffic with the HTTPRoute, I think we're sufficiently validating that the HTTPRoute can be used to serve traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments in 71f19a9
| return headlessService | ||
| } | ||
|
|
||
| // GetServePort finds the container port named "serve" in the RayCluster's head group spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use utils.FindContainerPort instead of GetServePort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 71f19a9, we now call utils.FindContainerPort from the GetServePort helper. I left the helper function because it's still useful and encapsulates the container port logic rather than copy-and-pasting this code multiple times in the createHTTPRoute function.
In 71f19a9 I also changed utils.FindContainerPort to return an int32 since this is required for the port number and int32->int conversions are safer.
|
|
||
| // getHTTPRouteTrafficWeights fetches the HTTPRoute associated with a RayService and returns | ||
| // the traffic weights for the active and pending clusters. | ||
| func (r *RayServiceReconciler) getHTTPRouteTrafficWeights(ctx context.Context, rayServiceInstance *rayv1.RayService) (activeWeight int32, pendingWeight int32, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to pass in the route instance you reconcile in this reconciliation to calculateStatus instead of associating it again. This may have some inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to follow this pattern in c23f901
|
|
||
| // reconcilePromotionAndServingStatus handles the promotion logic after an upgrade, returning | ||
| // isPendingClusterServing: True if the main Kubernetes services are pointing to the pending cluster. | ||
| func (r *RayServiceReconciler) reconcilePromotionAndServingStatus(ctx context.Context, headSvc, serveSvc *corev1.Service, rayServiceInstance *rayv1.RayService, pendingCluster *rayv1.RayCluster) (isPendingClusterServing bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does incremental upgrade rely on the logic between L279 - L297? If not, we should consider separate them into two different functions like
should_promote = false
if incremental_upgrade_enabled {
should_promote = should_promote_1(...)
} else {
// zero downtime upgrade
should_promote = should_promote_2(...)
}
if should_promote {
promote ...
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say the upgrade relies on that logic, but they're existing safety checks that I think are still relevant for both the upgrade and non-upgrade path. We probably want to keep the checks in L279-297 for incremental upgrade, like this one:
if clusterSvcPointsTo != utils.GetRayClusterNameFromService(serveSvc) {
panic("headSvc and serveSvc are not pointing to the same cluster")
because if the service that we're reconciling does not point to either RayCluster (during either an incremental upgrade or the regular, existing code path) this indicates a broken state in the controller that we should panic on.
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
Hi @ryanaoleary, I’m doing a load test with Ray Serve. To do this, I need to:
Here’s the test script I’m trying to use, and I am still working on the setup: |
|
How to do load test on ray serve with 1500+ RPS?
note: this is the code we are going to run. import ray
from ray import serve
from starlette.requests import Request
@serve.deployment()
class SimpleDeployment:
def __init__(self):
self.counter = 0
async def __call__(self, request: Request):
self.counter += 1
return {
"status": "ok",
"counter": self.counter,
"message": "processed"
}
app = SimpleDeployment.bind()apiVersion: ray.io/v1
kind: RayService
metadata:
name: stress-test-serve
spec:
upgradeStrategy:
type: NewClusterWithIncrementalUpgrade
clusterUpgradeOptions:
gatewayClassName: istio
stepSizePercent: 20
intervalSeconds: 10
maxSurgePercent: 5
serveConfigV2: |
applications:
- name: stress_test_app
import_path: stress_test_serve:app
route_prefix: /
runtime_env:
working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
deployments:
- name: SimpleDeployment
autoscaling_config:
min_replicas: 3
max_replicas: 4
target_ongoing_requests: 300
max_ongoing_requests: 1500
metrics_interval_s: 0.1
upscale_delay_s: 0.5
downscale_delay_s: 3
ray_actor_options:
num_cpus: 2
rayClusterConfig:
rayVersion: "2.46.0"
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.46.0
resources:
requests:
cpu: "1"
memory: "1Gi"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
- groupName: small-group
replicas: 0
minReplicas: 0
maxReplicas: 5
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.46.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
requests:
cpu: "2"
memory: "1Gi"
kubectl port-forward svc/stress-test-serve-gateway-istio 8080:80 -n default
from locust import HttpUser, task, constant
class AppUser(HttpUser):
wait_time = constant(0) # Each user has a constant wait time of 0
def on_start(self):
"""Called when a user starts"""
self.client.verify = False # Disable SSL verification if needed
@task
def test_endpoint(self):
"""Test the main fruit endpoint"""
response = self.client.get("/")
if response.status_code != 200:
print(f"Error: {response.status_code} - {response.text}")locust -f ./locust_example.py --host http://localhost:8080/
kubectl port-forward stress-test-serve-45dn2-head 8265:8265 -n default
open http://localhost:8265/#/actors(You should see 4 proxy actor are all hitting cpu usage limit)
1st test (maxSurgePercent is large)looks good! spec:
upgradeStrategy:
type: NewClusterWithIncrementalUpgrade
clusterUpgradeOptions:
gatewayClassName: istio
stepSizePercent: 10
intervalSeconds: 30
maxSurgePercent: 20
2nd test (maxSurgePercent is small)spec:
upgradeStrategy:
type: NewClusterWithIncrementalUpgrade
clusterUpgradeOptions:
gatewayClassName: istio
intervalSeconds: 10
maxSurgePercent: 5
stepSizePercent: 20
cc @rueian @ryanaoleary @kevin85421 I believe this is the load test you want to see! |
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think we can get this merged.
cc @ryanaoleary @andrewsykim @rueian @kevin85421
The following are tests I've done:
- extreme case (behavior should act as
type: NewCluster)
#3166 (comment) - 1 min_replica, 3 max_replica in serveConfigV2
#3166 (comment) - other cases + reproduce script
#3166 (comment)
How I find RPS limit?
I use binary search to find the RPS limit.
take #3166 (comment) as an example:
(1) use 30 users, found that response time is stable
(2) use 50 users, found that response time start climbing
(3) use 40 users, found that response time is stable
The reason why we have RPS fluctuation is because:
- change the
target_capacityin ray serve - keep sending lots of requests to ray cluster (will trigger autoscaling)
- ray autoscaler try to
a. in old cluster, delete a worker pod (maps to a ray serve replica in our example)
b. in new cluster, create a worker pod (maps to a ray serve replica in our example) - when 3 is happening but new cluster's new worker pod is not ready,
a. we already route the request to the new cluster's serve svc
(which doesn't have enough capacity)
b. we need to wait for new worker pod started, then we can have expected RPS
|
testing 1 more case, similar to #3166 (comment) apiVersion: ray.io/v1
kind: RayService
metadata:
name: stress-test-serve
spec:
upgradeStrategy:
type: NewClusterWithIncrementalUpgrade
clusterUpgradeOptions:
gatewayClassName: istio
stepSizePercent: 10
intervalSeconds: 30
maxSurgePercent: 10
serveConfigV2: |
applications:
- name: stress_test_app
import_path: stress_test_serve:app
route_prefix: /
runtime_env:
working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
deployments:
- name: SimpleDeployment
autoscaling_config:
min_replicas: 1
max_replicas: 3
target_ongoing_requests: 1
max_ongoing_requests: 2
metrics_interval_s: 0.1
upscale_delay_s: 0.5
downscale_delay_s: 3
ray_actor_options:
num_cpus: 2
my logs INFO 2025-10-24 23:41:23,125 controller 940 -- Controller starting (version='2.47.0').
INFO 2025-10-24 23:41:23,133 controller 940 -- Starting proxy on node '1ab47ad6bc01fa77162ef59a45fc70027f38a1424843a5a9762c1299' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:23,935 controller 940 -- Target capacity scaling up from None to 0.0.
INFO 2025-10-24 23:41:23,935 controller 940 -- Deploying new app 'stress_test_app'.
INFO 2025-10-24 23:41:23,936 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:23,963 controller 940 -- Target capacity scaling up from 0.0 to 10.0.
INFO 2025-10-24 23:41:23,964 controller 940 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-24 23:41:23,965 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:26,120 controller 940 -- Imported and built app 'stress_test_app' successfully.
INFO 2025-10-24 23:41:26,122 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:41:26,228 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:26,228 controller 940 -- Starting Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:42,972 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' after 16.7s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:41:42,977 controller 940 -- Starting proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:48,666 controller 940 -- Target capacity scaling up from 10.0 to 20.0.
INFO 2025-10-24 23:41:48,745 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:19,552 controller 940 -- Target capacity scaling up from 20.0 to 30.0.
INFO 2025-10-24 23:42:19,613 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:50,264 controller 940 -- Target capacity scaling up from 30.0 to 40.0.
INFO 2025-10-24 23:42:50,339 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,000 controller 940 -- Target capacity scaling up from 40.0 to 50.0.
INFO 2025-10-24 23:43:21,054 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,684 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 1 to 2 replicas. Current ongoing requests: 55.97, current running replicas: 1.
INFO 2025-10-24 23:43:21,685 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:21,685 controller 940 -- Starting Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:39,655 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' after 18.0s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:43:39,659 controller 940 -- Starting proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:43:49,717 controller 940 -- Target capacity scaling up from 50.0 to 60.0.
INFO 2025-10-24 23:43:49,751 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:20,697 controller 940 -- Target capacity scaling up from 60.0 to 70.0.
INFO 2025-10-24 23:44:20,778 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:52,723 controller 940 -- Target capacity scaling up from 70.0 to 80.0.
INFO 2025-10-24 23:44:52,737 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:23,480 controller 940 -- Target capacity scaling up from 80.0 to 90.0.
INFO 2025-10-24 23:45:23,501 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:24,135 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 3 replicas. Current ongoing requests: 58.21, current running replicas: 2.
INFO 2025-10-24 23:45:24,135 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:24,135 controller 940 -- Starting Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:39,408 controller 940 -- Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' after 15.3s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:45:39,412 controller 940 -- Starting proxy on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:45:53,896 controller 940 -- Target capacity scaling up from 90.0 to 100.0.
INFO 2025-10-24 23:45:53,917 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 3).
INFO 2025-10-24 23:57:29,069 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 3 to 2 replicas. Current ongoing requests: 1.29, current running replicas: 3.
INFO 2025-10-24 23:57:29,070 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:29,070 controller 940 -- Stopping Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:29,074 controller 940 -- Draining proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:57:31,128 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:57:32,398 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 1 replicas. Current ongoing requests: 0.41, current running replicas: 2.
INFO 2025-10-24 23:57:32,399 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:32,399 controller 940 -- Stopping Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:32,403 controller 940 -- Draining proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.
INFO 2025-10-24 23:57:34,441 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:58:00,379 controller 940 -- Removing drained proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:58:03,575 controller 940 -- Removing drained proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.
|
|
I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR. |
Make a lot of sense, thank you! |
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only review the APIs and discuss the tests with @Future-Outlier. I didn't review the implementation details. I will merge the PR because Han-Ru and Rueian have already approved the implementation.
We should be very careful about this. I strongly prefer not to do that until we receive enough signals. This should be Ray's responsibility, in my opinion. If KubeRay gets that deeply involved with the data plane logic, the behavior will be very hard to get right.
For zero-downtime upgrade, we also have the same issue. We don't create Ray Serve applications in the new RayCluster based on the current status of Ray Serve applications in the old RayCluster. It still works well. In addition, this doesn't fulfill Kubernetes's philosophy. We should keep the logic stateless as much as possible. KubeRay should reconcile CRs based on their YAML whenever possible. |
…58293) This PR adds a guide for the new zero-downtime incremental upgrade feature in KubeRay v1.5. This feature was implemented in this PR: ray-project/kuberay#3166. ray-project/kuberay#3209 ## Docs link https://anyscale-ray--58293.com.readthedocs.build/en/58293/serve/advanced-guides/incremental-upgrade.html#rayservice-zero-downtime-incremental-upgrades --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Future-Outlier <[email protected]>











Why are these changes needed?
This PR implements an alpha version of the RayService Incremental Upgrade REP.
The RayService controller logic to reconcile a RayService during an incremental upgrade is as follows:
IncrementalUpgradeOptionsand accept/reject the RayService CR accordinglyreconcileGateway- on the first call this should create a new Gateway CR and subsequent calls will update theListenersas necessary based on any changes to the RayService.SpecreconcileHTTPRoute- on the first call this should create a HTTPRoute CR with twobackendRefs, one pointing to the old cluster and one to the pending cluster withweights 100 and 0 accordingly. Every subsequent call toreconcileHTTPRoutewill update the HTTPRoute by changing theweightof eachbackendRefbyStepSizePercentuntil theweightassociated with each cluster equals theTargetCapacityassociated with that cluster. ThebackendRefweight is exposed through the RayService Status fieldTrafficRoutedPercent. Theweightis only changed if it's been at leastIntervalSecondssinceRayService.Status.LastTrafficMigratedTime, otherwise the controller waits until the next iteration and checks again.TrafficRoutedPercent == TargetCapacity, if so thetarget_capacitycan be updated for one of the clusters.reconcileServeTargetCapacity. If the totaltarget_capacityof both Serve configs is less than or equal to 100%, the pending cluster'starget_capacitycan be safely scaled up byMaxSurgePercent. If the totaltarget_capacityis greater than 100%, the active clustertarget_capacitycan be decreased byMaxSurgePercent.TargetCapacityandTrafficRoutedPercentof the pending RayService instance RayCluster equal 100%, the upgrade is complete.Related issue number
#3209
Checks