Skip to content

Conversation

@ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Mar 7, 2025

Why are these changes needed?

This PR implements an alpha version of the RayService Incremental Upgrade REP.

The RayService controller logic to reconcile a RayService during an incremental upgrade is as follows:

  1. Validate the IncrementalUpgradeOptions and accept/reject the RayService CR accordingly
  2. Call reconcileGateway - on the first call this should create a new Gateway CR and subsequent calls will update the Listeners as necessary based on any changes to the RayService.Spec
  3. Call reconcileHTTPRoute - on the first call this should create a HTTPRoute CR with two backendRefs, one pointing to the old cluster and one to the pending cluster with weights 100 and 0 accordingly. Every subsequent call to reconcileHTTPRoute will update the HTTPRoute by changing the weight of each backendRef by StepSizePercent until the weight associated with each cluster equals the TargetCapacity associated with that cluster. The backendRef weight is exposed through the RayService Status field TrafficRoutedPercent. The weight is only changed if it's been at least IntervalSeconds since RayService.Status.LastTrafficMigratedTime, otherwise the controller waits until the next iteration and checks again.
  4. The controller then checks if the TrafficRoutedPercent == TargetCapacity, if so the target_capacity can be updated for one of the clusters.
  5. The controller then calls reconcileServeTargetCapacity. If the total target_capacity of both Serve configs is less than or equal to 100%, the pending cluster's target_capacity can be safely scaled up by MaxSurgePercent. If the total target_capacity is greater than 100%, the active cluster target_capacity can be decreased by MaxSurgePercent.
  6. The controller then continues with the reconciliation logic as normal. Once TargetCapacity and TrafficRoutedPercent of the pending RayService instance RayCluster equal 100%, the upgrade is complete.

Related issue number

#3209

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@MortalHappiness MortalHappiness self-assigned this Mar 11, 2025
@ryanaoleary ryanaoleary marked this pull request as ready for review March 25, 2025 11:26
@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from 8f9a396 to 486f98b Compare May 15, 2025 00:35
@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented May 15, 2025

I've now added unit tests and one basic e2e test for the incremental upgrade feature so this should be good to start reviewing. In addition to the unit tests, here's some instructions for manually testing this feature in your cluster.

  1. Create a cloud provider cluster with a Gateway controller installed. I used a GKE cluster with Gateway API enabled, which installs the GKE gateway controller in your cluster.

  2. Retrieve the name of the Gateway class to use:

kubectl get gatewayclass
NAME                                       CONTROLLER                                   ACCEPTED   AGE
gke-l7-global-external-managed             networking.gke.io/gateway                    True       10h
gke-l7-gxlb                                networking.gke.io/gateway                    True       10h
gke-l7-regional-external-managed           networking.gke.io/gateway                    True       10h
gke-l7-rilb                                networking.gke.io/gateway                    True       10h
gke-persistent-regional-external-managed   networking.gke.io/persistent-ip-controller   True       10h
gke-persistent-regional-internal-managed   networking.gke.io/persistent-ip-controller   True       10h
  1. Install the KubeRay operator in your cluster with a dev image built with these changes.:
# from helm-chart/kuberay-operator
# replace image in values.yaml with: `us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay` and tag with `latest`, or use your own image

helm install kuberay-operator .
  1. Create a RayService CR named ray-service-incremental-upgrade.yaml with the following specifications, feel free to edit the IncrementalUpgradeOptions to test different upgrade behaviors.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: gke-l7-rilb
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml
  1. Trigger an incremental upgrade by editing the above serveConfigV2 or a field in the PodSpec and re-applying the RayService yaml.

I'll put more comments with manual test results and add more e2e test cases, but this should be good to start reviewing/iterating on to get merge-ready before the v1.4 release.

@andrewsykim
Copy link
Member

andrewsykim commented May 16, 2025

Tried to test this manually and not seeing Gateway reconcile with this log line:

{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}

Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

@ryanaoleary
Copy link
Collaborator Author

Tried to test this manually and not seeing Gateway reconcile with this log line:

{"level":"info","ts":"2025-05-16T00:39:32.181Z","logger":"controllers.RayService","msg":"checkIfNeedIncrementalUpgradeUpdate","RayService":{"name":"deepseek-r1-distill-qwen-32b","namespace":"default"},"reconcileID":"ea9e1839-24fd-464a-9f59-49e917f6c495","incrementalUpgradeUpdate":false,"reason":"Gateway for RayService IncrementalUpgrade is not ready."}

Do I need to set spec.gateway for the gateway reconcile to trigger? I didn't think it was needed since the example you shared didn't have it

No it should be called automatically when IncrementalUpgrade is enabled and there are non-nil pending and active RayClusters. I was thinking that spec.Gateway should be set by the controller if it doesn't exist, but the reconcile appears to be failing for some reason so I'm debugging it now. For the stale cached value issue - I'll change the other places in the code to return the actual Gateway object (rather than just rayServiceInstance.Spec.Gateway) from context, similar to how getRayServiceInstance works.

@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from 83f265f to 17f7aa4 Compare May 21, 2025 11:56
@ryanaoleary
Copy link
Collaborator Author

I'm running into some issues now with the allowed ports/protocols for listeners with different Gateway controllers (e.g. the GKE controller is pretty restrictive). I'm working now to figure out how to send traffic from the Serve service -> to the Gateway -> to the active and pending RayCluster head services through the HTTPRoute. An alternative would be just to have users directly send traffic to the Gateway which would be set to HTTP and port 80, but I don't really want users to have to change the path/endpoint they send traffic to for the Serve applications during an upgrade.

@andrewsykim
Copy link
Member

Discussed with Ryan offline, there's a validation in the GKE gateway controller that disallows port 8000 for Serve. But this validation will be removed soon. For now we will test with allowed ports like port 80 and change it back to 8000 before merging

@ryanaoleary ryanaoleary requested a review from andrewsykim May 24, 2025 02:20
@ryanaoleary
Copy link
Collaborator Author

I moved the e2e test to it's own folder since it's an experimental feature and shouldn't be part of the pre-submit tests yet.

@andrewsykim
Copy link
Member

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

@ryanaoleary ryanaoleary force-pushed the incremental-upgrade branch from adc6236 to 7694114 Compare June 4, 2025 04:20
@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented Jun 4, 2025

@ryanaoleary can you resolve all the merge conflcits? I can do some testing on this branch once the conflcits are resolved

All the conflicts have been resolved, this is the image I'm currently using for testing: us-docker.pkg.dev/ryanaoleary-gke-dev/kuberay/kuberay:latest

@ryanaoleary
Copy link
Collaborator Author

ryanaoleary commented Jun 4, 2025

Fixed a target_capacity getting set bug with 5f46a1b. I've now tested it fully e2e with Istio as follows:

  1. Install Istio
istioctl install --set profile=test -y
  1. Create a Istio GatewayClass with the following contents:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: istio
spec:
  controllerName: istio.io/gateway-controller
kubectl apply -f istio-gateway-class.yaml
  1. Create a RayService CR with the following spec:
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 1
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "500m"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml 
  1. Validate RayService Pods are created:
kubectl get pods
NAME                                                            READY   STATUS    RESTARTS   AGE
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0          4m17s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0          100s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running   0          98s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running   0          76s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running   0          98s
  1. Validate Gateway and HTTPRoute are created

Gateway:

kubectl describe gateways
Name:         rayservice-incremental-upgrade-gateway
Namespace:    default
Labels:       <none>
Annotations:  networking.gke.io/addresses: /projects/624496840364/global/addresses/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--0zyq5o7gnhde
              networking.gke.io/backend-services:
                /projects/624496840364/global/backendServices/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/backendServic...
              networking.gke.io/firewalls: /projects/624496840364/global/firewalls/gkegw1-9ok6-l7-default-global
              networking.gke.io/forwarding-rules:
                /projects/624496840364/global/forwardingRules/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--ux5k3yt9mxdb
              networking.gke.io/health-checks:
                /projects/624496840364/global/healthChecks/gkegw1-9ok6-default-gw-serve404-80-8zjp3d8cqfsu, /projects/624496840364/global/healthChecks/gke...
              networking.gke.io/last-reconcile-time: 2025-06-04T10:59:25Z
              networking.gke.io/lb-route-extensions: 
              networking.gke.io/lb-traffic-extensions: 
              networking.gke.io/ssl-certificates: 
              networking.gke.io/target-http-proxies:
                /projects/624496840364/global/targetHttpProxies/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
              networking.gke.io/target-https-proxies: 
              networking.gke.io/url-maps: /projects/624496840364/global/urlMaps/gkegw1-9ok6-defaul-rayservice-incremental-upgrade--v14nm0nv81bk
API Version:  gateway.networking.k8s.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:             2025-06-04T10:58:28Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2025-06-04T10:59:22Z
  Finalizers:
    gateway.finalizer.networking.gke.io
  Generation:  3
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   945c03ba-6551-4928-b294-6522e13e91af
  Resource Version:        1749034812216063007
  UID:                     f2273e8f-cf19-4cae-9508-9cebe76b21aa
Spec:
  Gateway Class Name:  istio
  Listeners:
    Allowed Routes:
      Namespaces:
        From:  Same
    Name:      rayservice-incremental-upgrade-listener
    Port:      80
    Protocol:  HTTP
Status:
  Addresses:
    Type:   IPAddress
    Value:  173.255.121.76
  Conditions:
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has deprecated this condition, do not depend on it.
    Observed Generation:   1
    Reason:                Scheduled
    Status:                True
    Type:                  Scheduled
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               Resource accepted
    Observed Generation:   3
    Reason:                Accepted
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2025-06-04T11:00:03Z
    Message:               Resource programmed, assigned to service(s) rayservice-incremental-upgrade-gateway-istio.default.svc.cluster.local:80
    Observed Generation:   3
    Reason:                Programmed
    Status:                True
    Type:                  Programmed
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
    Last Transition Time:  2025-06-04T10:59:26Z
    Message:               
    Observed Generation:   1
    Reason:                Healthy
    Status:                True
    Type:                  networking.gke.io/GatewayHealthy
  Listeners:
    Attached Routes:  1
    Conditions:
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                Programmed
      Status:                True
      Type:                  Programmed
      Last Transition Time:  2025-06-04T10:59:26Z
      Message:               The OSS Gateway API has altered the "Ready" condition semantics and reserved it for future use.  GKE Gateway will stop emitting it in a future update, use "Programmed" instead.
      Observed Generation:   1
      Reason:                Ready
      Status:                True
      Type:                  Ready
      Last Transition Time:  2025-06-04T10:59:34Z
      Message:               No errors found
      Observed Generation:   3
      Reason:                NoConflicts
      Status:                False
      Type:                  Conflicted
    Name:                    rayservice-incremental-upgrade-listener
    Supported Kinds:
      Group:  gateway.networking.k8s.io
      Kind:   HTTPRoute
      Group:  gateway.networking.k8s.io
      Kind:   GRPCRoute
Events:
  Type     Reason  Age                  From                   Message
  ----     ------  ----                 ----                   -------
  Normal   ADD     2m10s                sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    76s (x8 over 90s)    sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   UPDATE  72s (x4 over 2m10s)  sc-gateway-controller  default/rayservice-incremental-upgrade-gateway
  Normal   SYNC    72s                  sc-gateway-controller  SYNC on default/rayservice-incremental-upgrade-gateway was a success

HTTPRoute:

kubectl describe httproutes
Name:         httproute-rayservice-incremental-upgrade
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2025-06-04T11:00:12Z
  Generation:          1
  Owner References:
    API Version:           ray.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RayService
    Name:                  rayservice-incremental-upgrade
    UID:                   2b74618c-a884-4b3d-a960-27b32d6d7c50
  Resource Version:        1749034812216031016
  UID:                     f0906529-40ea-4e46-ad50-31bfb239acb0
Spec:
  Parent Refs:
    Group:      gateway.networking.k8s.io
    Kind:       Gateway
    Name:       rayservice-incremental-upgrade-gateway
    Namespace:  default
  Rules:
    Backend Refs:
      Group:      
      Kind:       Service
      Name:       rayservice-incremental-upgrade-qfmsg-head-svc
      Namespace:  default
      Port:       8000
      Weight:     100
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
Status:
  Parents:
    Conditions:
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2025-06-04T11:00:12Z
      Message:               All references resolved
      Observed Generation:   1
      Reason:                ResolvedRefs
      Status:                True
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:      gateway.networking.k8s.io
      Kind:       Gateway
      Name:       rayservice-incremental-upgrade-gateway
      Namespace:  default
Events:
  Type    Reason  Age   From                   Message
  ----    ------  ----  ----                   -------
  Normal  ADD     107s  sc-gateway-controller  default/httproute-rayservice-incremental-upgrade
  1. Send a request to the RayService using the Gateway
kubectl run curl --image=radial/busyboxplus:curl -i --tty

curl -X POST -H 'Content-Type: application/json' http://173.255.121.76/fruit/ -d '["MANGO", 2]'
6
  1. Modify the RayService spec as follows and initiate an upgrade by re-applying the spec:
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-incremental-upgrade
spec:
  upgradeStrategy:
    type: IncrementalUpgrade
    incrementalUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 30
      maxSurgePercent: 80
  serveConfigV2: |
    applications:
      - name: fruit_app_updated
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.5
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.5
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.5
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.5
  rayClusterConfig:
    rayVersion: "2.44.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.44.0
              resources:
                requests:
                  cpu: "2"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "2Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 1
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.44.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "1"
                    memory: "2Gi"
                  limits:
                    cpu: "1"
                    memory: "2Gi"
kubectl apply -f ray-service-incremental-upgrade.yaml 
  1. Verify IncrementalUpgrade is triggered:
kubectl get pods
NAME                                                            READY   STATUS              RESTARTS        AGE
curl                                                            1/1     Running             1 (2m10s ago)   3m17s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running             0               8m40s
rayservice-incremental-upgrade-44r5r-head                       0/2     ContainerCreating   0               1s
rayservice-incremental-upgrade-44r5r-small-group-worker-p6kmz   0/1     Init:0/1            0               1s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running             0               6m3s
rayservice-incremental-upgrade-qfmsg-head                       2/2     Running             0               6m1s
rayservice-incremental-upgrade-qfmsg-small-group-worker-7dnn5   1/1     Running             0               5m39s
rayservice-incremental-upgrade-qfmsg-small-group-worker-jsbjs   1/1     Running             0               6m1s
  1. Use serve status to validate target_capacity is updated on both active and pending RayCluster
# pending cluster Head pod
kubectl exec -it rayservice-incremental-upgrade-44r5r-head -- /bin/bash
Defaulted container "ray-head" out of: ray-head, autoscaler
(base) ray@rayservice-incremental-upgrade-44r5r-head:~$ serve status
proxies:
  b4bfd9d59aa2fdc8a3c464268f83a124f178b3f59c8291b7ac72b4eb: HEALTHY
  98e99c593f7c0ab6546faf7ffd4279e52bfdecb589862df29a84a285: HEALTHY
applications:
  fruit_app_updated:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749035152.4872093
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 80.0

# active cluster Head pod - after one iteration
serve status
proxies:
  c52833bdc0af5dedaa852dbdc78c9f76bd940dd144a7032f795f51cf: HEALTHY
  3c6f771276464144b506ab5c2e5ac98e7529c67790df9db98e464e52: HEALTHY
  efd31857830f53a8d39c21c404e823d2e5d0a3a8dc7ddc9d4af7e1bc: HEALTHY
applications:
  fruit_app_updated_again:
    status: RUNNING
    message: ''
    last_deployed_time_s: 1749039739.5597744
    deployments:
      MangoStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      OrangeStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      PearStand:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
      FruitMarket:
        status: HEALTHY
        status_trigger: CONFIG_UPDATE_COMPLETED
        replica_states:
          RUNNING: 1
        message: ''
target_capacity: 20.0
  1. Validate incremental upgrade completes successfully
kubectl get pods
NAME                                                            READY   STATUS    RESTARTS        AGE
curl                                                            1/1     Running   1 (6m43s ago)   7m50s
kuberay-operator-86b59b85f5-7kpwl                               1/1     Running   0               13m
rayservice-incremental-upgrade-44r5r-head                       2/2     Running   0               4m34s
rayservice-incremental-upgrade-gateway-istio-7c9447d7c9-dxcgj   1/1     Running   0               10m

The behavior that TrafficRoutedPercent (i.e. the weights on the HTTPRoute) for the pending and active RayClusters respectively are incrementally migrated by StepSizePercent every IntervalSeconds is validated through the e2e test which can be ran with make test-e2e-incremental-upgrade.

const (
// During upgrade, IncrementalUpgrade strategy will create an upgraded cluster to gradually scale
// and migrate traffic to using Gateway API.
IncrementalUpgrade RayServiceUpgradeType = "IncrementalUpgrade"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe too late to change this, but wondering if RollingUpgrade be a more appropriate name? I assume most people are more familiar with this term. WDYT @ryanaoleary @kevin85421 @MortalHappiness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking this PR, we can cahnge it during alpha phase)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to reply to this, but I have no strong preference either way. IncrementalUpgrade is what was used in the feature request and REP so that's why I stuck with it, but if there's a preference from any KubeRay maintainers or users I'm down to go through and change the feature name / all the related variable names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rueian for sharing opinion.
I think RollingUpgrade is more a more straight forward name for me too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @kevin85421 since from offline discussion you seemed to have a preference against using RollingUpgrade here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 what do you think about ClusterUpgrade and ClusterUpgradeOptions? I prefer to keep the upgrade term generic as the exact behavior could be changed in the future.

@Future-Outlier was also wondering about the history of why we called it "incremental" upgrades.

@ryanaoleary ryanaoleary requested a review from andrewsykim June 4, 2025 19:28
@joshdevins
Copy link

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

@ryanaoleary
Copy link
Collaborator Author

Just wondering if there is an update on this PR. Will it make it into a KubeRay 1.4.x or 1.5 release?

This PR is targeted for KubeRay v1.5, it still needs review and I'll prioritize iterating and resolving comments to get it merged.

@ryanaoleary
Copy link
Collaborator Author

@andrewsykim Fixed all the merge conflicts and updated the PR, this should be good to re-review.

@ryanaoleary
Copy link
Collaborator Author

The failing RayJob CI test seems unrelated

@Future-Outlier
Copy link
Member

last week, @kevin85421 , @rueian and I were thinking about the load test.
and Kai-Hsun pointed out that the RPS is too small.
I did some experiment to validate that the RPS is a normal number.

In the video, you can find that

  1. ServeReplica:fruit_app_updated:FruitMarket use more than 100% CPU
  2. proxy actor's cpu usage is around 50%
  3. I set wait_time = constant(0.01) in locust. (I randomly figure out wait_time = constant(0) will be slower)

based on these 2 metrics, I think the RPS is correct.

2025-10-20.23-49-56.mp4

@andrewsykim
Copy link
Member

I think RayServiceNewClusterWithIncrementalUpgrade seems kind of long for a feature flag, and either way it's clear to users that the feature is for the NewClusterWithIncrementalUpgrade type. I slightly prefer not to change it but am okay with either too.

Agree I think RayServiceIncrementalUpgrade for a feature flag is fine

hasAccepted := false
hasProgrammed := false

for _, condition := range gatewayInstance.Status.Conditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments about GatewayConditionAccepted and GatewayConditionProgrammed? In addition, add comments to explain what does "ready" in IsGatewayReady mean.

From the GEP: https://gateway-api.sigs.k8s.io/geps/gep-1364/

To capture the behavior that Ready currently captures, Programmed will be introduced. This means that the implementation has seen the config, has everything it needs, parsed it, and sent configuration off to the data plane. The configuration should be available "soon". We'll leave "soon" undefined for now.

The condition seems not enough to determine whether the gateway is "ready" or not.

In addition, if the Gateway API has related public API, we should consider using it instead of implementing it by ourselves.

Copy link
Collaborator Author

@ryanaoleary ryanaoleary Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any utils in the public API to check the Gateway readiness status besides the existing fields we're checking here. If I'm missing them I can add them instead of this logic, but I didn't see one I can use.

Added comments explaining this helper and the status conditions we check in 71f19a9

}

// IsHTTPRouteReady returns whether the HTTPRoute associated with a given Gateway has a ready condition
func IsHTTPRouteReady(gatewayInstance *gwv1.Gateway, httpRouteInstance *gwv1.HTTPRoute) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "ready" refer to here and explain the logic RouteConditionAccepted and RouteConditionResolvedRefs.

Copy link
Collaborator Author

@ryanaoleary ryanaoleary Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Ready" means the HTTPRoute has a parent ref for the Gateway object and that the parent has accepted and resolved the refs of the HTTPRoute:

  • RouteConditionAccepted: the reason this can be set varies across Gateway controllers, but generally it means the HTTPRoute has a valid Gateway object as the parent and the route is allowed by the Gateway's listener. This condition is mainly checking that the syntax of the rules are valid, but it doesn't guarantee that the backend service exists.
  • RouteConditionResolvedRefs: All the references within the HTTPRoute have been resolved by the Gateway controller. This means that the HTTPRoute's object references are valid, exist, and the Gateway can use them. In our case it's checking the RayCluster Serve service we use as a backend ref.

I can add comments explaining why we check these statuses here. I didn't see any utils to where I could directly check if an HTTPRoute is ready to serve traffic, but checking these two conditions seemed like they'd give reasonable confidence that the HTTPRoute is created and in a good state. Since we also check that the Serve service (backend ref of the HTTPRoute) exists and that the Ray Serve deployment is healthy before migrating traffic with the HTTPRoute, I think we're sufficiently validating that the HTTPRoute can be used to serve traffic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in 71f19a9

return headlessService
}

// GetServePort finds the container port named "serve" in the RayCluster's head group spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use utils.FindContainerPort instead of GetServePort?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 71f19a9, we now call utils.FindContainerPort from the GetServePort helper. I left the helper function because it's still useful and encapsulates the container port logic rather than copy-and-pasting this code multiple times in the createHTTPRoute function.

In 71f19a9 I also changed utils.FindContainerPort to return an int32 since this is required for the port number and int32->int conversions are safer.


// getHTTPRouteTrafficWeights fetches the HTTPRoute associated with a RayService and returns
// the traffic weights for the active and pending clusters.
func (r *RayServiceReconciler) getHTTPRouteTrafficWeights(ctx context.Context, rayServiceInstance *rayv1.RayService) (activeWeight int32, pendingWeight int32, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to pass in the route instance you reconcile in this reconciliation to calculateStatus instead of associating it again. This may have some inconsistency.

Copy link
Collaborator Author

@ryanaoleary ryanaoleary Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to follow this pattern in c23f901


// reconcilePromotionAndServingStatus handles the promotion logic after an upgrade, returning
// isPendingClusterServing: True if the main Kubernetes services are pointing to the pending cluster.
func (r *RayServiceReconciler) reconcilePromotionAndServingStatus(ctx context.Context, headSvc, serveSvc *corev1.Service, rayServiceInstance *rayv1.RayService, pendingCluster *rayv1.RayCluster) (isPendingClusterServing bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does incremental upgrade rely on the logic between L279 - L297? If not, we should consider separate them into two different functions like

should_promote = false
if incremental_upgrade_enabled {
   should_promote = should_promote_1(...)
} else {
   // zero downtime upgrade
   should_promote = should_promote_2(...)
}

if should_promote {
   promote ...
}
```

Copy link
Collaborator Author

@ryanaoleary ryanaoleary Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say the upgrade relies on that logic, but they're existing safety checks that I think are still relevant for both the upgrade and non-upgrade path. We probably want to keep the checks in L279-297 for incremental upgrade, like this one:

if clusterSvcPointsTo != utils.GetRayClusterNameFromService(serveSvc) {
  panic("headSvc and serveSvc are not pointing to the same cluster")

because if the service that we're reconciling does not point to either RayCluster (during either an incremental upgrade or the regular, existing code path) this indicates a broken state in the controller that we should panic on.

ryanaoleary and others added 2 commits October 23, 2025 00:56
Signed-off-by: Ryan O'Leary <[email protected]>
@Future-Outlier
Copy link
Member

Hi @ryanaoleary,

I’m doing a load test with Ray Serve.
I’m trying to run 1,000–2,000+ requests/second on the service with a proxy actor hitting the limit, and also perform an incremental upgrade at the same time.

To do this, I need to:

  1. Use an async script to run Ray Serve
  2. Check Ray’s dashboard to make sure I hit the limit

Here’s the test script I’m trying to use, and I am still working on the setup:
https://github.com/ray-project/ray/blob/3cca5415c3a74ebc46db4bef9572f0048b29c3cf/release/serve_tests/workloads/multi_deployment_1k_noop_replica.py#L2-L201

@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 23, 2025

How to do load test on ray serve with 1500+ RPS?

  1. apply this yaml

note: this is the code we are going to run.

import ray
from ray import serve
from starlette.requests import Request

@serve.deployment()
class SimpleDeployment:
    def __init__(self):
        self.counter = 0
    
    async def __call__(self, request: Request):
        self.counter += 1
        return {
            "status": "ok",
            "counter": self.counter,
            "message": "processed"
        }

app = SimpleDeployment.bind()
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 20
      intervalSeconds: 10
      maxSurgePercent: 5
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 3
              max_replicas: 4
              target_ongoing_requests: 300
              max_ongoing_requests: 1500
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
  rayClusterConfig:
    rayVersion: "2.46.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.46.0
              resources:
                requests:
                  cpu: "1"
                  memory: "1Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 0
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.46.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "2"
                    memory: "1Gi"
  1. use kubectl port forward to connect gateway
kubectl port-forward svc/stress-test-serve-gateway-istio 8080:80 -n default
  1. use locust with 17 users (my personal experience)
from locust import HttpUser, task, constant

class AppUser(HttpUser):
    wait_time = constant(0)  # Each user has a constant wait time of 0
    
    def on_start(self):
        """Called when a user starts"""
        self.client.verify = False  # Disable SSL verification if needed
    
    @task
    def test_endpoint(self):
        """Test the main fruit endpoint"""
        response = self.client.get("/")
        if response.status_code != 200:
            print(f"Error: {response.status_code} - {response.text}")
locust -f ./locust_example.py --host http://localhost:8080/
image
  1. go to ray dashboard, and select Actor
kubectl port-forward stress-test-serve-45dn2-head 8265:8265 -n default
open http://localhost:8265/#/actors

(You should see 4 proxy actor are all hitting cpu usage limit)

image
  1. do incremental upgrade and check if there's any request drop

1st test (maxSurgePercent is large)

looks good!

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 20
image

2nd test (maxSurgePercent is small)

spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      intervalSeconds: 10
      maxSurgePercent: 5
      stepSizePercent: 20
image

cc @rueian @ryanaoleary @kevin85421

I believe this is the load test you want to see!
thank you

ryanaoleary and others added 2 commits October 23, 2025 12:52
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 25, 2025

extreme case

  1. find RPS limit (observe CPU saturation)
    (1 replica with 2 cpu can serve 500 RPS in my laptop, and in the following example, I use 2 replica (which implies 1000 RPS)
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 100
      intervalSeconds: 1
      maxSurgePercent: 100
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 2
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
  rayClusterConfig:
    rayVersion: "2.47.0"
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.47.0
              resources:
                requests:
                  cpu: "1"
                  memory: "1Gi"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - groupName: small-group
        replicas: 0
        minReplicas: 0
        maxReplicas: 5
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.47.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  requests:
                    cpu: "2"
                    memory: "1Gi"

(the wrong screenshot, ignore this)
image

(the right screenshot, the RPS happens when we are scaling the 2nd replica)

image
  1. upgrade with extreme case (should have the same behavior like `type: NewCluster)
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 100
      intervalSeconds: 1
      maxSurgePercent: 100
  1. check if there's any request drop in locust
    (SUCCEEDED!)
image

@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 25, 2025

  1. use 1 min_replica, 3 max_replica in my serve application config
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
  1. find RPS limit (hit CPU saturation)
    I use binary search to find the RPS limit.
    (1) use 30 users, found that response time is stable
    (2) use 50 users, found that response time start climbing
    (3) use 40 users, found that response time is stable
image
  1. start upgrade
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 20
  1. succeed
image

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think we can get this merged.
cc @ryanaoleary @andrewsykim @rueian @kevin85421

The following are tests I've done:

  1. extreme case (behavior should act as type: NewCluster)
    #3166 (comment)
  2. 1 min_replica, 3 max_replica in serveConfigV2
    #3166 (comment)
  3. other cases + reproduce script
    #3166 (comment)

How I find RPS limit?

I use binary search to find the RPS limit.
take #3166 (comment) as an example:
(1) use 30 users, found that response time is stable
(2) use 50 users, found that response time start climbing
(3) use 40 users, found that response time is stable

The reason why we have RPS fluctuation is because:

  1. change the target_capacity in ray serve
  2. keep sending lots of requests to ray cluster (will trigger autoscaling)
  3. ray autoscaler try to
    a. in old cluster, delete a worker pod (maps to a ray serve replica in our example)
    b. in new cluster, create a worker pod (maps to a ray serve replica in our example)
  4. when 3 is happening but new cluster's new worker pod is not ready,
    a. we already route the request to the new cluster's serve svc
    (which doesn't have enough capacity)
    b. we need to wait for new worker pod started, then we can have expected RPS

@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 25, 2025

testing 1 more case, similar to #3166 (comment)

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 10
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
image

my logs

INFO 2025-10-24 23:41:23,125 controller 940 -- Controller starting (version='2.47.0').
INFO 2025-10-24 23:41:23,133 controller 940 -- Starting proxy on node '1ab47ad6bc01fa77162ef59a45fc70027f38a1424843a5a9762c1299' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:23,935 controller 940 -- Target capacity scaling up from None to 0.0.
INFO 2025-10-24 23:41:23,935 controller 940 -- Deploying new app 'stress_test_app'.
INFO 2025-10-24 23:41:23,936 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:23,963 controller 940 -- Target capacity scaling up from 0.0 to 10.0.
INFO 2025-10-24 23:41:23,964 controller 940 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-24 23:41:23,965 controller 940 -- Importing and building app 'stress_test_app'.
INFO 2025-10-24 23:41:26,120 controller 940 -- Imported and built app 'stress_test_app' successfully.
INFO 2025-10-24 23:41:26,122 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:41:26,228 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:26,228 controller 940 -- Starting Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:41:42,972 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' after 16.7s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:41:42,977 controller 940 -- Starting proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:41:48,666 controller 940 -- Target capacity scaling up from 10.0 to 20.0.
INFO 2025-10-24 23:41:48,745 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:19,552 controller 940 -- Target capacity scaling up from 20.0 to 30.0.
INFO 2025-10-24 23:42:19,613 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:42:50,264 controller 940 -- Target capacity scaling up from 30.0 to 40.0.
INFO 2025-10-24 23:42:50,339 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,000 controller 940 -- Target capacity scaling up from 40.0 to 50.0.
INFO 2025-10-24 23:43:21,054 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-24 23:43:21,684 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 1 to 2 replicas. Current ongoing requests: 55.97, current running replicas: 1.
INFO 2025-10-24 23:43:21,685 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:21,685 controller 940 -- Starting Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:43:39,655 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' after 18.0s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:43:39,659 controller 940 -- Starting proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:43:49,717 controller 940 -- Target capacity scaling up from 50.0 to 60.0.
INFO 2025-10-24 23:43:49,751 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:20,697 controller 940 -- Target capacity scaling up from 60.0 to 70.0.
INFO 2025-10-24 23:44:20,778 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:44:52,723 controller 940 -- Target capacity scaling up from 70.0 to 80.0.
INFO 2025-10-24 23:44:52,737 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:23,480 controller 940 -- Target capacity scaling up from 80.0 to 90.0.
INFO 2025-10-24 23:45:23,501 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-24 23:45:24,135 controller 940 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 3 replicas. Current ongoing requests: 58.21, current running replicas: 2.
INFO 2025-10-24 23:45:24,135 controller 940 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:24,135 controller 940 -- Starting Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:45:39,408 controller 940 -- Replica(id='zrlqe7u5', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' after 15.3s (PID: 243). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-24 23:45:39,412 controller 940 -- Starting proxy on node '5ec8a7acf6f2aecfeb0b8e773ab3ab723f6d80fee726aaa8adfd231d' listening on '0.0.0.0:8000'.
INFO 2025-10-24 23:45:53,896 controller 940 -- Target capacity scaling up from 90.0 to 100.0.
INFO 2025-10-24 23:45:53,917 controller 940 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 3).
INFO 2025-10-24 23:57:29,069 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 3 to 2 replicas. Current ongoing requests: 1.29, current running replicas: 3.
INFO 2025-10-24 23:57:29,070 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:29,070 controller 940 -- Stopping Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:29,074 controller 940 -- Draining proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:57:31,128 controller 940 -- Replica(id='ebfluuj5', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:57:32,398 controller 940 -- Downscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 1 replicas. Current ongoing requests: 0.41, current running replicas: 2.
INFO 2025-10-24 23:57:32,399 controller 940 -- Removing 1 replica from Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-24 23:57:32,399 controller 940 -- Stopping Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') (currently RUNNING).
INFO 2025-10-24 23:57:32,403 controller 940 -- Draining proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.
INFO 2025-10-24 23:57:34,441 controller 940 -- Replica(id='j4916bks', deployment='SimpleDeployment', app='stress_test_app') is stopped.
INFO 2025-10-24 23:58:00,379 controller 940 -- Removing drained proxy on node 'cba3d05ac4ca1c09a44150b41d3e1db5c42ed45f37226aee76e2753c'.
INFO 2025-10-24 23:58:03,575 controller 940 -- Removing drained proxy on node '45af68f2bc199f4e3d8ac7f6ad51d54e983cf3892f8b077aee466d63'.

@Future-Outlier
Copy link
Member

Future-Outlier commented Oct 25, 2025

testing ray image 2.47.0 -> 2.49.2

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stress-test-serve
spec:
  upgradeStrategy:
    type: NewClusterWithIncrementalUpgrade
    clusterUpgradeOptions:
      gatewayClassName: istio
      stepSizePercent: 10
      intervalSeconds: 30
      maxSurgePercent: 10
  serveConfigV2: |
    applications:
      - name: stress_test_app
        import_path: stress_test_serve:app
        route_prefix: /
        runtime_env:
          working_dir: "https://github.com/future-outlier/ray-serve-load-test/archive/main.zip"
        deployments:
          - name: SimpleDeployment
            autoscaling_config:
              min_replicas: 1
              max_replicas: 3
              target_ongoing_requests: 1
              max_ongoing_requests: 2
              metrics_interval_s: 0.1
              upscale_delay_s: 0.5
              downscale_delay_s: 3
            ray_actor_options:
              num_cpus: 2
image

logs

INFO 2025-10-25 00:10:25,008 controller 944 -- Controller starting (version='2.49.2').
INFO 2025-10-25 00:10:25,019 controller 944 -- Starting proxy on node 'cf17fc5070219b83d84fe1f9231ce2dd15da6dc7a84cdbcb2bfb6702' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:10:25,847 controller 944 -- Target capacity scaling up from None to 0.0.
INFO 2025-10-25 00:10:25,848 controller 944 -- Deploying new app 'stress_test_app'.
INFO 2025-10-25 00:10:25,848 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:25,872 controller 944 -- Target capacity scaling up from 0.0 to 10.0.
INFO 2025-10-25 00:10:25,873 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:25,874 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:25,935 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:25,935 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:26,001 controller 944 -- Received new config for application 'stress_test_app'. Cancelling previous request.
INFO 2025-10-25 00:10:26,001 controller 944 -- Importing and building app 'stress_test_app'.
INFO 2025-10-25 00:10:28,231 controller 944 -- Imported and built app 'stress_test_app' successfully.
INFO 2025-10-25 00:10:28,235 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:10:28,340 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:10:28,340 controller 944 -- Starting Replica(id='hrkes5kr', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:10:45,242 controller 944 -- Replica(id='hrkes5kr', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '8d604ba793edf91e949e3d73cbcd2c5ade2cacccd0d501a2ce37911d' after 16.9s (PID: 272). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:10:45,246 controller 944 -- Starting proxy on node '8d604ba793edf91e949e3d73cbcd2c5ade2cacccd0d501a2ce37911d' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:10:50,771 controller 944 -- Target capacity scaling up from 10.0 to 20.0.
INFO 2025-10-25 00:10:50,791 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:11:21,596 controller 944 -- Target capacity scaling up from 20.0 to 30.0.
INFO 2025-10-25 00:11:21,651 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:11:52,299 controller 944 -- Target capacity scaling up from 30.0 to 40.0.
INFO 2025-10-25 00:11:52,370 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:12:23,039 controller 944 -- Target capacity scaling up from 40.0 to 50.0.
INFO 2025-10-25 00:12:23,140 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 1).
INFO 2025-10-25 00:12:23,774 controller 944 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 1 to 2 replicas. Current ongoing requests: 4.42, current running replicas: 1.
INFO 2025-10-25 00:12:23,774 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:12:23,775 controller 944 -- Starting Replica(id='m0h4ueba', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:12:41,624 controller 944 -- Replica(id='m0h4ueba', deployment='SimpleDeployment', app='stress_test_app') started successfully on node '6e1c7550d39e4596ab4f355426c303d902a5e6cab854cd8c176c5ebb' after 17.8s (PID: 273). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:12:41,628 controller 944 -- Starting proxy on node '6e1c7550d39e4596ab4f355426c303d902a5e6cab854cd8c176c5ebb' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:12:51,704 controller 944 -- Target capacity scaling up from 50.0 to 60.0.
INFO 2025-10-25 00:12:51,720 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:13:23,727 controller 944 -- Target capacity scaling up from 60.0 to 70.0.
INFO 2025-10-25 00:13:23,773 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:13:54,497 controller 944 -- Target capacity scaling up from 70.0 to 80.0.
INFO 2025-10-25 00:13:54,544 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:14:25,275 controller 944 -- Target capacity scaling up from 80.0 to 90.0.
INFO 2025-10-25 00:14:25,327 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 2).
INFO 2025-10-25 00:14:25,964 controller 944 -- Upscaling Deployment(name='SimpleDeployment', app='stress_test_app') from 2 to 3 replicas. Current ongoing requests: 25.14, current running replicas: 2.
INFO 2025-10-25 00:14:25,965 controller 944 -- Adding 1 replica to Deployment(name='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:14:25,965 controller 944 -- Starting Replica(id='9tqncl6o', deployment='SimpleDeployment', app='stress_test_app').
INFO 2025-10-25 00:14:42,782 controller 944 -- Replica(id='9tqncl6o', deployment='SimpleDeployment', app='stress_test_app') started successfully on node 'f986d50bce2a9383204975a6dc1167df21256a3e6825ce3709c107f5' after 16.8s (PID: 273). Replica constructor, reconfigure method, and initial health check took 0.0s.
INFO 2025-10-25 00:14:42,786 controller 944 -- Starting proxy on node 'f986d50bce2a9383204975a6dc1167df21256a3e6825ce3709c107f5' listening on '0.0.0.0:8000'.
INFO 2025-10-25 00:14:55,797 controller 944 -- Target capacity scaling up from 90.0 to 100.0.
INFO 2025-10-25 00:14:55,881 controller 944 -- Deploying new version of Deployment(name='SimpleDeployment', app='stress_test_app') (initial target replicas: 3).

@rueian
Copy link
Collaborator

rueian commented Oct 25, 2025

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

@Future-Outlier
Copy link
Member

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

Make a lot of sense, thank you!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only review the APIs and discuss the tests with @Future-Outlier. I didn't review the implementation details. I will merge the PR because Han-Ru and Rueian have already approved the implementation.

@kevin85421 kevin85421 merged commit 2acc219 into ray-project:master Oct 25, 2025
27 checks passed
@kevin85421
Copy link
Member

I believe we still need to copy all the current num_replicas for each ray serve deployment from the active cluster to the pending cluster before applying target_capacity for traffic migration to better smooth out those traffic drops if ray serve autoscaling is enabled. But this can be in a follow-up PR.

We should be very careful about this. I strongly prefer not to do that until we receive enough signals. This should be Ray's responsibility, in my opinion. If KubeRay gets that deeply involved with the data plane logic, the behavior will be very hard to get right.

  • In-place update + Cluster upgrade at the same time.
  • Ray Serve behavior across multiple Ray versions.

For zero-downtime upgrade, we also have the same issue. We don't create Ray Serve applications in the new RayCluster based on the current status of Ray Serve applications in the old RayCluster. It still works well.

In addition, this doesn't fulfill Kubernetes's philosophy. We should keep the logic stateless as much as possible. KubeRay should reconcile CRs based on their YAML whenever possible.

edoakes pushed a commit to ray-project/ray that referenced this pull request Dec 9, 2025
…58293)

This PR adds a guide for the new zero-downtime incremental upgrade
feature in KubeRay v1.5. This feature was implemented in this PR:
ray-project/kuberay#3166.

ray-project/kuberay#3209

## Docs link

https://anyscale-ray--58293.com.readthedocs.build/en/58293/serve/advanced-guides/incremental-upgrade.html#rayservice-zero-downtime-incremental-upgrades

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants