[RayService] Support Incremental Zero-Downtime Upgrades #3166

ryanaoleary · 2025-03-07T01:50:29Z

Why are these changes needed?

This PR implements an alpha version of the RayService Incremental Upgrade REP.

The RayService controller logic to reconcile a RayService during an incremental upgrade is as follows:

Validate the IncrementalUpgradeOptions and accept/reject the RayService CR accordingly
Call reconcileGateway - on the first call this should create a new Gateway CR and subsequent calls will update the Listeners as necessary based on any changes to the RayService.Spec
Call reconcileHTTPRoute - on the first call this should create a HTTPRoute CR with two backendRefs, one pointing to the old cluster and one to the pending cluster with weights 100 and 0 accordingly. Every subsequent call to reconcileHTTPRoute will update the HTTPRoute by changing the weight of each backendRef by StepSizePercent until the weight associated with each cluster equals the TargetCapacity associated with that cluster. The backendRef weight is exposed through the RayService Status field TrafficRoutedPercent. The weight is only changed if it's been at least IntervalSeconds since RayService.Status.LastTrafficMigratedTime, otherwise the controller waits until the next iteration and checks again.
The controller then checks if the TrafficRoutedPercent == TargetCapacity, if so the target_capacity can be updated for one of the clusters.
The controller then calls reconcileServeTargetCapacity. If the total target_capacity of both Serve configs is less than or equal to 100%, the pending cluster's target_capacity can be safely scaled up by MaxSurgePercent. If the total target_capacity is greater than 100%, the active cluster target_capacity can be decreased by MaxSurgePercent.
The controller then continues with the reconciliation logic as normal. Once TargetCapacity and TrafficRoutedPercent of the pending RayService instance RayCluster equal 100%, the upgrade is complete.

Related issue number

#3209

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ryanaoleary · 2025-05-15T13:05:48Z

I've now added unit tests and one basic e2e test for the incremental upgrade feature so this should be good to start reviewing. In addition to the unit tests, here's some instructions for manually testing this feature in your cluster.

Create a cloud provider cluster with a Gateway controller installed. I used a GKE cluster with Gateway API enabled, which installs the GKE gateway controller in your cluster.
Retrieve the name of the Gateway class to use:

kubectl get gatewayclass
NAME                                       CONTROLLER                                   ACCEPTED   AGE
gke-l7-global-external-managed             networking.gke.io/gateway                    True       10h
gke-l7-gxlb                                networking.gke.io/gateway                    True       10h
gke-l7-regional-external-managed           networking.gke.io/gateway                    True       10h
gke-l7-rilb                                networking.gke.io/gateway                    True       10h
gke-persistent-regional-external-managed   networking.gke.io/persistent-ip-controller   True       10h
gke-persistent-regional-internal-managed   networking.gke.io/persistent-ip-controller   True       10h

Install the KubeRay operator in your cluster with a dev image built with these changes.:

# from helm-chart/kuberay-operator
# replace image in values.yaml with: `us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay` and tag with `latest`, or use your own image

helm install kuberay-operator .

Create a RayService CR named ray-service-incremental-upgrade.yaml with the following specifications, feel free to edit the IncrementalUpgradeOptions to test different upgrade behaviors.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: gke-l7-rilb
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"

kubectl apply -f ray-service-incremental-upgrade.yaml

Trigger an incremental upgrade by editing the above serveConfigV2 or a field in the PodSpec and re-applying the RayService yaml.

I'll put more comments with manual test results and add more e2e test cases, but this should be good to start reviewing/iterating on to get merge-ready before the v1.4 release.

andrewsykim · 2025-05-16T00:40:59Z

Tried to test this manually and not seeing Gateway reconcile with this log line:

{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}

Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

ray-operator/controllers/ray/rayservice_controller.go

ray-operator/controllers/ray/utils/util.go

helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml

ryanaoleary · 2025-05-20T19:43:23Z

Tried to test this manually and not seeing Gateway reconcile with this log line:
{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}
Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

No it should be called automatically when IncrementalUpgrade is enabled and there are non-nil pending and active RayClusters. I was thinking that spec.Gateway should be set by the controller if it doesn't exist, but the reconcile appears to be failing for some reason so I'm debugging it now. For the stale cached value issue - I'll change the other places in the code to return the actual Gateway object (rather than just rayServiceInstance.Spec.Gateway) from context, similar to how getRayServiceInstance works.

ryanaoleary · 2025-05-21T12:05:47Z

I'm running into some issues now with the allowed ports/protocols for listeners with different Gateway controllers (e.g. the GKE controller is pretty restrictive). I'm working now to figure out how to send traffic from the Serve service -> to the Gateway -> to the active and pending RayCluster head services through the HTTPRoute. An alternative would be just to have users directly send traffic to the Gateway which would be set to HTTP and port 80, but I don't really want users to have to change the path/endpoint they send traffic to for the Serve applications during an upgrade.

andrewsykim · 2025-05-22T14:06:36Z

Discussed with Ryan offline, there's a validation in the GKE gateway controller that disallows port 8000 for Serve. But this validation will be removed soon. For now we will test with allowed ports like port 80 and change it back to 8000 before merging

ray-operator/main.go

ryanaoleary · 2025-05-30T21:54:33Z

I moved the e2e test to it's own folder since it's an experimental feature and shouldn't be part of the pre-submit tests yet.

andrewsykim · 2025-06-04T01:26:55Z

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

ryanaoleary · 2025-06-04T04:22:25Z

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

All the conflicts have been resolved, this is the image I'm currently using for testing: us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay:latest

ryanaoleary · 2025-06-04T12:40:23Z

Fixed a target_capacity getting set bug with 5f46a1b. I've now tested it fully e2e with Istio as follows:

Install Istio

istioctl install --set profile=test -y

Create a Istio GatewayClass with the following contents:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller

kubectl apply -f istio-gateway-class.yaml

Create a RayService CR with the following spec:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"

kubectl apply -f ray-service-incremental-upgrade.yaml

Validate RayService Pods are created:

kubectl get pods
NAME                                                            READY   STATUS    RESTARTS   AGE
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0          4m17s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0          100s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running   0          98s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running   0          76s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running   0          98s

Validate Gateway and HTTPRoute are created

Gateway:

kubectl describe gateways
Name:         rayservice-incremental-upgrade-gateway
Namespace:    default
Labels:       <none>
Annotations:  networking.gke.io/addresses: /projects/624496840364/global/addresses/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--0zyq5o7gnhde
              networking.gke.io/backend-services:
                /projects/624496840364/global/backendServices/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/backendServic...
              networking.gke.io/firewalls: /projects/624496840364/global/firewalls/gkegw1-9ok6-l7-default-global
              networking.gke.io/forwarding-rules:
                /projects/624496840364/global/forwardingRules/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--ux5k3yt9mxdb
              networking.gke.io/health-checks:
                /projects/624496840364/global/healthChecks/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/healthChecks/gke...
              networking.gke.io/last-reconcile-time: 2025-06-04T10:59:25Z
              networking.gke.io/lb-route-extensions: 
              networking.gke.io/lb-traffic-extensions: 
              networking.gke.io/ssl-certificates: 
              networking.gke.io/target-http-proxies:
                /projects/624496840364/global/targetHttpProxies/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
              networking.gke.io/target-https-proxies: 
              networking.gke.io/url-maps: /projects/624496840364/global/urlMaps/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
API Version:  gateway.networking.k8s.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:             2025-06-04T10:58:28Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-06-04T10:59:22Z
  Finalizers:
    gateway.finalizer.networking.gke.io
  Generation:  3
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   945c03ba-6551-4928-b294-6522e13e91af
  Resource Version:        1749034812216063007
  UID:                     f2273e8f-cf19-4cae-9508-9cebe76b21aa
Spec:
  Gateway Class Name:  istio
  Listeners:
    Allowed Routes:
      Namespaces:
        From:  Same
    Name:      rayservice-incremental-upgrade-listener
    Port:      80
    Protocol:  HTTP
Status:
  Addresses:
    Type:   IPAddress
    Value:  173.255.121.76
  Conditions:
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has deprecated this condition, do not depend on it.
    Observed Generation:   1
    Reason:                Scheduled
    Status:                True
    Type:                  Scheduled
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               Resource accepted
    Observed Generation:   3
    Reason:                Accepted
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2025-06-04T11:00:03Z
    Message:               Resource programmed, assigned to service(s) rayservice-incremental-upgrade-gateway-istio.default.svc.cluster.local:80
    Observed Generation:   3
    Reason:                Programmed
    Status:                True
    Type:                  Programmed
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               
    Observed Generation:   1
    Reason:                Healthy
    Status:                True
    Type:                  networking.gke.io/GatewayHealthy
  Listeners:
    Attached Routes:  1
    Conditions:
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Programmed
      Status:                True
      Type:                  Programmed
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
      Observed Generation:   1
      Reason:                Ready
      Status:                True
      Type:                  Ready
      Last Transition Time:  2025-06-04T10:59:34Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                NoConflicts
      Status:                False
      Type:                  Conflicted
    Name:                    rayservice-incremental-upgrade-listener
    Supported Kinds:
      Group:  gateway.networking.k8s.io
      Kind:   HTTPRoute
      Group:  gateway.networking.k8s.io
      Kind:   GRPCRoute
Events:
  Type     Reason  Age                  From                   Message
  ----     ------  ----                 ----                   -------
  Normal   ADD     2m10s                sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    76s (x8 over 90s)    sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   UPDATE  72s (x4 over 2m10s)  sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    72s                  sc-gateway-controller  SYNC on default/rayservice-incremental-upgrade-gateway was a success

HTTPRoute:

kubectl describe httproutes
Name:         httproute-rayservice-incremental-upgrade
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2025-06-04T11:00:12Z
  Generation:          1
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   2b74618c-a884-4b3d-a960-27b32d6d7c50
  Resource Version:        1749034812216031016
  UID:                     f0906529-40ea-4e46-ad50-31bfb239acb0
Spec:
  Parent Refs:
    Group:      gateway.networking.k8s.io
    Kind:       Gateway
    Name:       rayservice-incremental-upgrade-gateway
    Namespace:  default
  Rules:
    Backend Refs:
      Group:      
      Kind:       Service
      Name:       rayservice-incremental-upgrade-qfmsg-head-svc
      Namespace:  default
      Port:       8000
      Weight:     100
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
Status:
  Parents:
    Conditions:
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               All references resolved
      Observed Generation:   1
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:      gateway.networking.k8s.io
      Kind:       Gateway
      Name:       rayservice-incremental-upgrade-gateway
      Namespace:  default
Events:
  Type    Reason  Age   From                   Message
  ----    ------  ----  ----                   -------
  Normal  ADD     107s  sc-gateway-controller  default/httproute-rayservice-incremental-upgrade

Send a request to the RayService using the Gateway

kubectl run curl --image=radial/busyboxplus:curl -i --tty

curl -X POST -H 'Content-Type: application/json' http://173.255.121.76/fruit/ -d '["MANGO", 2]'
6

Modify the RayService spec as follows and initiate an upgrade by re-applying the spec:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app_updated
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.5
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.5
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.5
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.5
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "1"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"

kubectl apply -f ray-service-incremental-upgrade.yaml

Verify IncrementalUpgrade is triggered:

kubectl get pods
NAME                                                            READY   STATUS              RESTARTS        AGE
curl                                                            1/1     Running             1 (2m10s ago)   3m17s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running             0               8m40s
rayservice-incremental-upgrade-44r5r-head                       0/2     ContainerCreating   0               1s
rayservice-incremental-upgrade-44r5r-small-group-worker-p6kmz   0/1     Init:0/1            0               1s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running             0               6m3s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running             0               6m1s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running             0               5m39s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running             0               6m1s

Use serve status to validate target_capacity is updated on both active and pending RayCluster

# pending cluster Head pod
kubectl exec -it rayservice-incremental-upgrade-44r5r-head -- /bin/bash
Defaulted container "ray-head" out of: ray-head, autoscaler
(base) ray@rayservice-incremental-upgrade-44r5r-head:~$ serve status
proxies:
  b4bfd9d59aa2fdc8a3c464268f83a124f178b3f59c8291b7ac72b4eb: HEALTHY
  98e99c593f7c0ab6546faf7ffd4279e52bfdecb589862df29a84a285: HEALTHY
applications:
  fruit_app_updated:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749035152.4872093
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 80.0

# active cluster Head pod - after one iteration
serve status
proxies:
  c52833bdc0af5dedaa852dbdc78c9f76bd940dd144a7032f795f51cf: HEALTHY
  3c6f771276464144b506ab5c2e5ac98e7529c67790df9db98e464e52: HEALTHY
  efd31857830f53a8d39c21c404e823d2e5d0a3a8dc7ddc9d4af7e1bc: HEALTHY
applications:
  fruit_app_updated_again:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749039739.5597744
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 20.0

Validate incremental upgrade completes successfully

kubectl get pods
NAME                                                            READY   STATUS    RESTARTS        AGE
curl                                                            1/1     Running   1 (6m43s ago)   7m50s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0               13m
rayservice-incremental-upgrade-44r5r-head                       2/2     Running   0               4m34s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0               10m

The behavior that TrafficRoutedPercent (i.e. the weights on the HTTPRoute) for the pending and active RayClusters respectively are incrementally migrated by StepSizePercent every IntervalSeconds is validated through the e2e test which can be ran with make test-e2e-incremental-upgrade.

andrewsykim · 2025-06-04T18:45:50Z

ray-operator/apis/ray/v1/rayservice_types.go

 const (
+	// During upgrade, IncrementalUpgrade strategy will create an upgraded cluster to gradually scale
+	// and migrate traffic to using Gateway API.
+	IncrementalUpgrade RayServiceUpgradeType = "IncrementalUpgrade"


Maybe too late to change this, but wondering if RollingUpgrade be a more appropriate name? I assume most people are more familiar with this term. WDYT @ryanaoleary @kevin85421 @MortalHappiness

(not blocking this PR, we can cahnge it during alpha phase)

Late to reply to this, but I have no strong preference either way. IncrementalUpgrade is what was used in the feature request and REP so that's why I stuck with it, but if there's a preference from any KubeRay maintainers or users I'm down to go through and change the feature name / all the related variable names.

cc @rueian for sharing opinion.
I think RollingUpgrade is more a more straight forward name for me too

cc: @kevin85421 since from offline discussion you seemed to have a preference against using RollingUpgrade here

@kevin85421 what do you think about ClusterUpgrade and ClusterUpgradeOptions? I prefer to keep the upgrade term generic as the exact behavior could be changed in the future.

@Future-Outlier was also wondering about the history of why we called it "incremental" upgrades.

ray-operator/controllers/ray/common/association.go

ray-operator/controllers/ray/rayservice_controller.go

ray-operator/apis/ray/v1/rayservice_types.go

ray-operator/controllers/ray/rayservice_controller.go

joshdevins · 2025-09-09T09:15:31Z

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

ryanaoleary · 2025-09-09T09:51:27Z

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

This PR is targeted for KubeRay v1.5, it still needs review and I'll prioritize iterating and resolving comments to get it merged.

ryanaoleary · 2025-09-16T10:30:16Z

@andrewsykim Fixed all the merge conflicts and updated the PR, this should be good to re-review.

ryanaoleary · 2025-10-21T04:19:56Z

The failing RayJob CI test seems unrelated

Future-Outlier · 2025-10-21T07:35:02Z

last week, @kevin85421 , @rueian and I were thinking about the load test.
and Kai-Hsun pointed out that the RPS is too small.
I did some experiment to validate that the RPS is a normal number.

In the video, you can find that

ServeReplica:fruit_app_updated:FruitMarket use more than 100% CPU
proxy actor's cpu usage is around 50%
I set wait_time = constant(0.01) in locust. (I randomly figure out wait_time = constant(0) will be slower)

based on these 2 metrics, I think the RPS is correct.

2025-10-20.23-49-56.mp4

andrewsykim · 2025-10-21T14:30:54Z

I think RayServiceNewClusterWithIncrementalUpgrade seems kind of long for a feature flag, and either way it's clear to users that the feature is for the NewClusterWithIncrementalUpgrade type. I slightly prefer not to change it but am okay with either too.

Agree I think RayServiceIncrementalUpgrade for a feature flag is fine

ray-operator/controllers/ray/utils/util.go

kevin85421 · 2025-10-21T17:41:07Z

ray-operator/controllers/ray/utils/util.go

+	hasAccepted := false
+	hasProgrammed := false
+
+	for _, condition := range gatewayInstance.Status.Conditions {


can you add some comments about GatewayConditionAccepted and GatewayConditionProgrammed? In addition, add comments to explain what does "ready" in IsGatewayReady mean.

From the GEP: https://gateway-api.sigs.k8s.io/geps/gep-1364/

To capture the behavior that Ready currently captures, Programmed will be introduced. This means that the implementation has seen the config, has everything it needs, parsed it, and sent configuration off to the data plane. The configuration should be available "soon". We'll leave "soon" undefined for now.

The condition seems not enough to determine whether the gateway is "ready" or not.

In addition, if the Gateway API has related public API, we should consider using it instead of implementing it by ourselves.

I don't see any utils in the public API to check the Gateway readiness status besides the existing fields we're checking here. If I'm missing them I can add them instead of this logic, but I didn't see one I can use.

Added comments explaining this helper and the status conditions we check in 71f19a9

kevin85421 · 2025-10-21T17:42:47Z

ray-operator/controllers/ray/utils/util.go

+}
+
+// IsHTTPRouteReady returns whether the HTTPRoute associated with a given Gateway has a ready condition
+func IsHTTPRouteReady(gatewayInstance *gwv1.Gateway, httpRouteInstance *gwv1.HTTPRoute) bool {


what does "ready" refer to here and explain the logic RouteConditionAccepted and RouteConditionResolvedRefs.

"Ready" means the HTTPRoute has a parent ref for the Gateway object and that the parent has accepted and resolved the refs of the HTTPRoute:

RouteConditionAccepted: the reason this can be set varies across Gateway controllers, but generally it means the HTTPRoute has a valid Gateway object as the parent and the route is allowed by the Gateway's listener. This condition is mainly checking that the syntax of the rules are valid, but it doesn't guarantee that the backend service exists.

RouteConditionResolvedRefs: All the references within the HTTPRoute have been resolved by the Gateway controller. This means that the HTTPRoute's object references are valid, exist, and the Gateway can use them. In our case it's checking the RayCluster Serve service we use as a backend ref.

I can add comments explaining why we check these statuses here. I didn't see any utils to where I could directly check if an HTTPRoute is ready to serve traffic, but checking these two conditions seemed like they'd give reasonable confidence that the HTTPRoute is created and in a good state. Since we also check that the Serve service (backend ref of the HTTPRoute) exists and that the Ray Serve deployment is healthy before migrating traffic with the HTTPRoute, I think we're sufficiently validating that the HTTPRoute can be used to serve traffic.

Added comments in 71f19a9

ray-operator/controllers/ray/utils/util.go

ray-operator/apis/ray/v1/rayservice_types.go

kevin85421 · 2025-10-21T17:57:26Z

ray-operator/controllers/ray/common/service.go

 	return headlessService
 }

+// GetServePort finds the container port named "serve" in the RayCluster's head group spec.


Use utils.FindContainerPort instead of GetServePort?

Done in 71f19a9, we now call utils.FindContainerPort from the GetServePort helper. I left the helper function because it's still useful and encapsulates the container port logic rather than copy-and-pasting this code multiple times in the createHTTPRoute function.

In 71f19a9 I also changed utils.FindContainerPort to return an int32 since this is required for the port number and int32->int conversions are safer.

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2025-10-21T18:07:58Z

ray-operator/controllers/ray/rayservice_controller.go

+
+// getHTTPRouteTrafficWeights fetches the HTTPRoute associated with a RayService and returns
+// the traffic weights for the active and pending clusters.
+func (r *RayServiceReconciler) getHTTPRouteTrafficWeights(ctx context.Context, rayServiceInstance *rayv1.RayService) (activeWeight int32, pendingWeight int32, err error) {


it's better to pass in the route instance you reconcile in this reconciliation to calculateStatus instead of associating it again. This may have some inconsistency.

Changed it to follow this pattern in c23f901

ray-operator/controllers/ray/rayservice_controller.go

kevin85421 · 2025-10-21T18:19:09Z

ray-operator/controllers/ray/rayservice_controller.go

+
+// reconcilePromotionAndServingStatus handles the promotion logic after an upgrade, returning
+// isPendingClusterServing: True if the main Kubernetes services are pointing to the pending cluster.
+func (r *RayServiceReconciler) reconcilePromotionAndServingStatus(ctx context.Context, headSvc, serveSvc *corev1.Service, rayServiceInstance *rayv1.RayService, pendingCluster *rayv1.RayCluster) (isPendingClusterServing bool) {


does incremental upgrade rely on the logic between L279 - L297? If not, we should consider separate them into two different functions like

should_promote = false if incremental_upgrade_enabled { should_promote = should_promote_1(...) } else { // zero downtime upgrade should_promote = should_promote_2(...) } if should_promote { promote ... } ```

I wouldn't say the upgrade relies on that logic, but they're existing safety checks that I think are still relevant for both the upgrade and non-upgrade path. We probably want to keep the checks in L279-297 for incremental upgrade, like this one:

if clusterSvcPointsTo != utils.GetRayClusterNameFromService(serveSvc) { panic("headSvc and serveSvc are not pointing to the same cluster")

because if the service that we're reconciling does not point to either RayCluster (during either an incremental upgrade or the regular, existing code path) this indicates a broken state in the controller that we should panic on.

Signed-off-by: Ryan O'Leary <[email protected]>

Future-Outlier · 2025-10-23T04:56:57Z

Hi @ryanaoleary,

I’m doing a load test with Ray Serve.
I’m trying to run 1,000–2,000+ requests/second on the service with a proxy actor hitting the limit, and also perform an incremental upgrade at the same time.

To do this, I need to:

Use an async script to run Ray Serve
Check Ray’s dashboard to make sure I hit the limit

Here’s the test script I’m trying to use, and I am still working on the setup:
https://github.com/ray-project/ray/blob/3cca5415c3a74ebc46db4bef9572f0048b29c3cf/release/serve_tests/workloads/multi_deployment_1k_noop_replica.py#L2-L201

Future-Outlier · 2025-10-23T07:32:25Z

How to do load test on ray serve with 1500+ RPS?

apply this yaml

note: this is the code we are going to run.

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment()
class SimpleDeployment:
    def __init__(self):
        self.counter = 0
    
    async def __call__(self, request: Request):
        self.counter += 1
        return {
            "status": "ok",
            "counter": self.counter,
            "message": "processed"
        }

app = SimpleDeployment.bind()

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 10
      maxSurgePercent: 5
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 3
              max_replicas: 4
              target_ongoing_requests: 300
              max_ongoing_requests: 1500
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
  rayClusterConfig:
    rayVersion: "2.46.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.46.0
              resources:
                requests:
                  cpu: "1"
                  memory: "1Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 0
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.46.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "2"
                    memory: "1Gi"

use kubectl port forward to connect gateway

kubectl port-forward svc/stress-test-serve-gateway-istio 8080:80 -n default

use locust with 17 users (my personal experience)

from locust import HttpUser, task, constant

class AppUser(HttpUser):
    wait_time = constant(0)  # Each user has a constant wait time of 0
    
    def on_start(self):
        """Called when a user starts"""
        self.client.verify = False  # Disable SSL verification if needed
    
    @task
    def test_endpoint(self):
        """Test the main fruit endpoint"""
        response = self.client.get("/")
        if response.status_code != 200:
            print(f"Error: {response.status_code} - {response.text}")

locust -f ./locust_example.py --host http://localhost:8080/

go to ray dashboard, and select Actor

kubectl port-forward stress-test-serve-45dn2-head 8265:8265 -n default
open http://localhost:8265/#/actors

(You should see 4 proxy actor are all hitting cpu usage limit)

do incremental upgrade and check if there's any request drop

1st test (maxSurgePercent is large)

looks good!

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 20

2nd test (maxSurgePercent is small)

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      intervalSeconds: 10
      maxSurgePercent: 5
      stepSizePercent: 20

cc @rueian @ryanaoleary @kevin85421

I believe this is the load test you want to see!
thank you

ray-operator/controllers/ray/rayservice_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

Future-Outlier · 2025-10-25T03:39:56Z

extreme case

find RPS limit (observe CPU saturation)
(1 replica with 2 cpu can serve 500 RPS in my laptop, and in the following example, I use 2 replica (which implies 1000 RPS)

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 100
      intervalSeconds: 1
      maxSurgePercent: 100
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 2
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
  rayClusterConfig:
    rayVersion: "2.47.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.47.0
              resources:
                requests:
                  cpu: "1"
                  memory: "1Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 0
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.47.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "2"
                    memory: "1Gi"

(the wrong screenshot, ignore this)

(the right screenshot, the RPS happens when we are scaling the 2nd replica)

upgrade with extreme case (should have the same behavior like `type: NewCluster)

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 100
      intervalSeconds: 1
      maxSurgePercent: 100

check if there's any request drop in locust
(SUCCEEDED!)

Future-Outlier · 2025-10-25T03:52:22Z

use 1 min_replica, 3 max_replica in my serve application config

  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2

find RPS limit (hit CPU saturation)
I use binary search to find the RPS limit.
(1) use 30 users, found that response time is stable
(2) use 50 users, found that response time start climbing
(3) use 40 users, found that response time is stable

start upgrade

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 20

succeed

Future-Outlier

Hi, I think we can get this merged.
cc @ryanaoleary @andrewsykim @rueian @kevin85421

The following are tests I've done:

extreme case (behavior should act as type: NewCluster)
#3166 (comment)
1 min_replica, 3 max_replica in serveConfigV2
#3166 (comment)
other cases + reproduce script
#3166 (comment)

How I find RPS limit?

I use binary search to find the RPS limit.
take #3166 (comment) as an example:
(1) use 30 users, found that response time is stable
(2) use 50 users, found that response time start climbing
(3) use 40 users, found that response time is stable

The reason why we have RPS fluctuation is because:

change the target_capacity in ray serve
keep sending lots of requests to ray cluster (will trigger autoscaling)
ray autoscaler try to
a. in old cluster, delete a worker pod (maps to a ray serve replica in our example)
b. in new cluster, create a worker pod (maps to a ray serve replica in our example)
when 3 is happening but new cluster's new worker pod is not ready,
a. we already route the request to the new cluster's serve svc
(which doesn't have enough capacity)
b. we need to wait for new worker pod started, then we can have expected RPS

Future-Outlier · 2025-10-25T06:37:11Z

testing 1 more case, similar to #3166 (comment)

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 10
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2

my logs

INFO 2025-10-24 23:41:23,125 controller 940 -- Controller starting (version='2.47.0').
INFO 2025-10-24 23:41:23,133 controller 940 -- Starting proxy on node '1ab47ad6bc01fa77162ef59a45fc70027f38a1424843a5a9762c1299' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:23,935 controller 940 -- Target capacity scaling up from None to 0.0.
INFO 2025-10-24 23:41:23,935 controller 940 -- Deploying new app 'stress_test_app'.
INFO 2025-10-24 23:41:23,936 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:23,963 controller 940 -- Target capacity scaling up from 0.0 to 10.0.
INFO 2025-10-24 23:41:23,964 controller 940 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-24 23:41:23,965 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:26,120 controller 940 -- Imported and built app 'stress_test_app' successfully.
INFO 2025-10-24 23:41:26,122 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:41:26,228 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:26,228 controller 940 -- Starting Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:42,972 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' after 16.7s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:41:42,977 controller 940 -- Starting proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:48,666 controller 940 -- Target capacity scaling up from 10.0 to 20.0.
INFO 2025-10-24 23:41:48,745 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:19,552 controller 940 -- Target capacity scaling up from 20.0 to 30.0.
INFO 2025-10-24 23:42:19,613 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:50,264 controller 940 -- Target capacity scaling up from 30.0 to 40.0.
INFO 2025-10-24 23:42:50,339 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,000 controller 940 -- Target capacity scaling up from 40.0 to 50.0.
INFO 2025-10-24 23:43:21,054 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,684 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 1 to 2 replicas. Current ongoing requests: 55.97, current running replicas: 1.
INFO 2025-10-24 23:43:21,685 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:21,685 controller 940 -- Starting Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:39,655 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' after 18.0s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:43:39,659 controller 940 -- Starting proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:43:49,717 controller 940 -- Target capacity scaling up from 50.0 to 60.0.
INFO 2025-10-24 23:43:49,751 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:20,697 controller 940 -- Target capacity scaling up from 60.0 to 70.0.
INFO 2025-10-24 23:44:20,778 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:52,723 controller 940 -- Target capacity scaling up from 70.0 to 80.0.
INFO 2025-10-24 23:44:52,737 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:23,480 controller 940 -- Target capacity scaling up from 80.0 to 90.0.
INFO 2025-10-24 23:45:23,501 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:24,135 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 3 replicas. Current ongoing requests: 58.21, current running replicas: 2.
INFO 2025-10-24 23:45:24,135 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:24,135 controller 940 -- Starting Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:39,408 controller 940 -- Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' after 15.3s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:45:39,412 controller 940 -- Starting proxy on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:45:53,896 controller 940 -- Target capacity scaling up from 90.0 to 100.0.
INFO 2025-10-24 23:45:53,917 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 3).
INFO 2025-10-24 23:57:29,069 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 3 to 2 replicas. Current ongoing requests: 1.29, current running replicas: 3.
INFO 2025-10-24 23:57:29,070 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:29,070 controller 940 -- Stopping Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:29,074 controller 940 -- Draining proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:57:31,128 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:57:32,398 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 1 replicas. Current ongoing requests: 0.41, current running replicas: 2.
INFO 2025-10-24 23:57:32,399 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:32,399 controller 940 -- Stopping Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:32,403 controller 940 -- Draining proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.
INFO 2025-10-24 23:57:34,441 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:58:00,379 controller 940 -- Removing drained proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:58:03,575 controller 940 -- Removing drained proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.

Future-Outlier · 2025-10-25T07:16:52Z

testing ray image 2.47.0 -> 2.49.2

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 10
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2

logs

INFO 2025-10-25 00:10:25,008 controller 944 -- Controller starting (version='2.49.2').
INFO 2025-10-25 00:10:25,019 controller 944 -- Starting proxy on node 'cf17fc5070219b83d84fe1f9231ce2dd15da6dc7a84cdbcb2bfb6702' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:10:25,847 controller 944 -- Target capacity scaling up from None to 0.0.
INFO 2025-10-25 00:10:25,848 controller 944 -- Deploying new app 'stress_test_app'.
INFO 2025-10-25 00:10:25,848 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:25,872 controller 944 -- Target capacity scaling up from 0.0 to 10.0.
INFO 2025-10-25 00:10:25,873 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:25,874 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:25,935 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:25,935 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:26,001 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:26,001 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:28,231 controller 944 -- Imported and built app 'stress_test_app' successfully.
INFO 2025-10-25 00:10:28,235 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:10:28,340 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:10:28,340 controller 944 -- Starting Replica(id='hrkes5kr', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:10:45,242 controller 944 -- Replica(id='hrkes5kr', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '8d604ba793edf91e949e3d73cbcd2c5ade2cacccd0d501a2ce37911d' after 16.9s (PID: 272). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:10:45,246 controller 944 -- Starting proxy on node '8d604ba793edf91e949e3d73cbcd2c5ade2cacccd0d501a2ce37911d' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:10:50,771 controller 944 -- Target capacity scaling up from 10.0 to 20.0.
INFO 2025-10-25 00:10:50,791 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:11:21,596 controller 944 -- Target capacity scaling up from 20.0 to 30.0.
INFO 2025-10-25 00:11:21,651 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:11:52,299 controller 944 -- Target capacity scaling up from 30.0 to 40.0.
INFO 2025-10-25 00:11:52,370 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:12:23,039 controller 944 -- Target capacity scaling up from 40.0 to 50.0.
INFO 2025-10-25 00:12:23,140 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:12:23,774 controller 944 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 1 to 2 replicas. Current ongoing requests: 4.42, current running replicas: 1.
INFO 2025-10-25 00:12:23,774 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:12:23,775 controller 944 -- Starting Replica(id='m0h4ueba', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:12:41,624 controller 944 -- Replica(id='m0h4ueba', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '6e1c7550d39e4596ab4f355426c303d902a5e6cab854cd8c176c5ebb' after 17.8s (PID: 273). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:12:41,628 controller 944 -- Starting proxy on node '6e1c7550d39e4596ab4f355426c303d902a5e6cab854cd8c176c5ebb' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:12:51,704 controller 944 -- Target capacity scaling up from 50.0 to 60.0.
INFO 2025-10-25 00:12:51,720 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:13:23,727 controller 944 -- Target capacity scaling up from 60.0 to 70.0.
INFO 2025-10-25 00:13:23,773 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:13:54,497 controller 944 -- Target capacity scaling up from 70.0 to 80.0.
INFO 2025-10-25 00:13:54,544 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:14:25,275 controller 944 -- Target capacity scaling up from 80.0 to 90.0.
INFO 2025-10-25 00:14:25,327 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:14:25,964 controller 944 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 3 replicas. Current ongoing requests: 25.14, current running replicas: 2.
INFO 2025-10-25 00:14:25,965 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:14:25,965 controller 944 -- Starting Replica(id='9tqncl6o', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:14:42,782 controller 944 -- Replica(id='9tqncl6o', deployment='SimpleDeployment', app='stress_test_app') started successfully on node 'f986d50bce2a9383204975a6dc1167df21256a3e6825ce3709c107f5' after 16.8s (PID: 273). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:14:42,786 controller 944 -- Starting proxy on node 'f986d50bce2a9383204975a6dc1167df21256a3e6825ce3709c107f5' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:14:55,797 controller 944 -- Target capacity scaling up from 90.0 to 100.0.
INFO 2025-10-25 00:14:55,881 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 3).

rueian · 2025-10-25T21:05:10Z

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

Future-Outlier · 2025-10-25T21:24:42Z

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

Make a lot of sense, thank you!

kevin85421

Only review the APIs and discuss the tests with @Future-Outlier. I didn't review the implementation details. I will merge the PR because Han-Ru and Rueian have already approved the implementation.

kevin85421 · 2025-10-25T23:05:02Z

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

We should be very careful about this. I strongly prefer not to do that until we receive enough signals. This should be Ray's responsibility, in my opinion. If KubeRay gets that deeply involved with the data plane logic, the behavior will be very hard to get right.

In-place update + Cluster upgrade at the same time.
Ray Serve behavior across multiple Ray versions.

For zero-downtime upgrade, we also have the same issue. We don't create Ray Serve applications in the new RayCluster based on the current status of Ray Serve applications in the old RayCluster. It still works well.

In addition, this doesn't fulfill Kubernetes's philosophy. We should keep the logic stateless as much as possible. KubeRay should reconcile CRs based on their YAML whenever possible.

…58293) This PR adds a guide for the new zero-downtime incremental upgrade feature in KubeRay v1.5. This feature was implemented in this PR: ray-project/kuberay#3166. ray-project/kuberay#3209 ## Docs link https://anyscale-ray--58293.com.readthedocs.build/en/58293/serve/advanced-guides/incremental-upgrade.html#rayservice-zero-downtime-incremental-upgrades --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Future-Outlier <[email protected]>

MortalHappiness self-assigned this Mar 11, 2025

ryanaoleary mentioned this pull request Mar 19, 2025

[Feature] RayService Incremental Upgrade Project Tracker #3209

Open

5 tasks

ryanaoleary marked this pull request as ready for review March 25, 2025 11:26

ryanaoleary force-pushed the incremental-upgrade branch from 8f9a396 to 486f98b Compare May 15, 2025 00:35

ryanaoleary requested review from MortalHappiness, andrewsykim and kevin85421 May 15, 2025 13:17

andrewsykim reviewed May 16, 2025

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayservice_controller.go Show resolved Hide resolved

andrewsykim reviewed May 16, 2025

View reviewed changes

ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved

helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml Show resolved Hide resolved

ryanaoleary force-pushed the incremental-upgrade branch from 83f265f to 17f7aa4 Compare May 21, 2025 11:56

andrewsykim reviewed May 23, 2025

View reviewed changes

ray-operator/main.go Outdated Show resolved Hide resolved

ryanaoleary requested a review from andrewsykim May 24, 2025 02:20

ryanaoleary force-pushed the incremental-upgrade branch from adc6236 to 7694114 Compare June 4, 2025 04:20

andrewsykim reviewed Jun 4, 2025

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

ryanaoleary requested a review from andrewsykim June 4, 2025 19:28

ryanaoleary force-pushed the incremental-upgrade branch from d4ef45d to be581d6 Compare September 16, 2025 10:29

ryanaoleary requested a review from rueian as a code owner September 16, 2025 10:29

kevin85421 reviewed Oct 21, 2025

View reviewed changes

ryanaoleary added 2 commits October 22, 2025 02:34

Clean up utils and add more comments

71f19a9

Signed-off-by: Ryan O'Leary <[email protected]>

reconcileHTTPRoute should pass created object to calculate status

c23f901

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from kevin85421 October 22, 2025 03:54

ryanaoleary and others added 2 commits October 23, 2025 00:56

Merge branch 'master' into incremental-upgrade

e935156

Signed-off-by: Ryan O'Leary <[email protected]>

lint

f04fee1

Signed-off-by: Ryan O'Leary <[email protected]>

Future-Outlier requested changes Oct 23, 2025

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Show resolved Hide resolved

ryanaoleary and others added 2 commits October 23, 2025 12:52

Update ray-operator/controllers/ray/rayservice_controller.go

18be954

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Fix test after suggested fix

2073680

Signed-off-by: Ryan O'Leary <[email protected]>

Future-Outlier approved these changes Oct 25, 2025

View reviewed changes

rueian approved these changes Oct 25, 2025

View reviewed changes

kevin85421 approved these changes Oct 25, 2025

View reviewed changes

kevin85421 merged commit 2acc219 into ray-project:master Oct 25, 2025
27 checks passed

ryanaoleary mentioned this pull request Oct 29, 2025

[Docs] Add guide for RayService Incremental Upgrade KubeRay feature ray-project/ray#58293

Merged

Future-Outlier added the size:large label Nov 1, 2025

fscnick mentioned this pull request Dec 9, 2025

[Bug] [incremental upgrade] The Serve config is updated for the active cluster during incremental upgrade #4189

Open

2 tasks

[RayService] Support Incremental Zero-Downtime Upgrades #3166

[RayService] Support Incremental Zero-Downtime Upgrades #3166

Uh oh!

Conversation

ryanaoleary commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

ryanaoleary commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewsykim commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented May 20, 2025

Uh oh!

ryanaoleary commented May 21, 2025

Uh oh!

andrewsykim commented May 22, 2025

Uh oh!

Uh oh!

ryanaoleary commented May 30, 2025

Uh oh!

andrewsykim commented Jun 4, 2025

Uh oh!

ryanaoleary commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanaoleary commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joshdevins commented Sep 9, 2025

Uh oh!

ryanaoleary commented Sep 9, 2025

Uh oh!

ryanaoleary commented Sep 16, 2025

Uh oh!

ryanaoleary commented Oct 21, 2025

Uh oh!

Future-Outlier commented Oct 21, 2025

Uh oh!

andrewsykim commented Oct 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ryanaoleary commented Mar 7, 2025 •

edited

Loading

ryanaoleary commented May 15, 2025 •

edited

Loading

andrewsykim commented May 16, 2025 •

edited

Loading

ryanaoleary commented Jun 4, 2025 •

edited

Loading

ryanaoleary commented Jun 4, 2025 •

edited

Loading

ryanaoleary Oct 22, 2025 •

edited

Loading

ryanaoleary Oct 21, 2025 •

edited

Loading

ryanaoleary Oct 22, 2025 •

edited

Loading

ryanaoleary Oct 21, 2025 •

edited

Loading

Future-Outlier commented Oct 23, 2025 •

edited

Loading

Future-Outlier commented Oct 25, 2025 •

edited

Loading

Future-Outlier commented Oct 25, 2025 •

edited

Loading

Future-Outlier left a comment •

edited

Loading

Future-Outlier commented Oct 25, 2025 •

edited

Loading

Future-Outlier commented Oct 25, 2025 •

edited

Loading