Skip to content

Conversation

@400Ping
Copy link
Contributor

@400Ping 400Ping commented Oct 24, 2025

Why are these changes needed?

When using sidecar mode, the head pod should not be recreated after it is deleted. The RayJob should be marked as Failed.

Related issue number

Closes #4130

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@400Ping 400Ping marked this pull request as draft October 24, 2025 16:51
@400Ping 400Ping marked this pull request as ready for review October 24, 2025 17:35
Signed-off-by: 400Ping <[email protected]>
@rueian
Copy link
Collaborator

rueian commented Oct 24, 2025

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

@400Ping 400Ping marked this pull request as draft October 25, 2025 00:36
@400Ping
Copy link
Contributor Author

400Ping commented Oct 25, 2025

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

ok, thanks

originatedFrom := utils.GetCRDType(instance.Labels[utils.RayOriginatedFromCRDLabelKey])
if originatedFrom == utils.RayJobCRD {
logger.Info(
"reconcilePods: Found 0 head Pods for a RayJob-managed RayCluster; skipping head creation to let RayJob controller handle the failure",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause no head pod to be created at all? We still need to create the first head pod. I think you can check the RayClusterProvisioned condition to decide whether to create one or not.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to fix this, super important.

400Ping and others added 2 commits October 29, 2025 10:25
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>
@400Ping
Copy link
Contributor Author

400Ping commented Oct 29, 2025

sorry, I am a bit busy taking midterm this week so I am a bit slow to respond

@400Ping 400Ping marked this pull request as ready for review October 29, 2025 02:27
@Future-Outlier
Copy link
Member

sorry, I am a bit busy taking midterm this week so I am a bit slow to respond

good luck for your midterm exam, thank you!

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this in my local laptop, and @machichima did it too.

Image Image

reproduce step

  1. create a RayJob
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sidecar-mode-flaky
spec:
  # In SidecarMode, the KubeRay operator injects a container into the Ray head Pod to submit the Ray job and tail logs.
  # This will avoid inter-Pod communication, which may cause network issues. For example, some users face WebSocket hangs.
  # For more details, see https://github.com/ray-project/kuberay/issues/3928#issuecomment-3187164736.
  submissionMode: "SidecarMode"
  entrypoint: python -c  "import time; time.sleep(60)"
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  rayClusterSpec:
    rayVersion: '2.46.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.46.0
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "200m"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          - name: code-sample
            configMap:
              name: ray-job-code-sample
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.46.0
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "200m"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"
  1. delete the head pod once the head container and submitter container are both running

  2. check the rayjob CR's status (should be failure both)

  3. you might wonder should we delete the worker pod? the answer is no, since we can't know if we use cluster selector from rayjob in raycluster.
    but 1 thing we can do is we can use deletion policy to delete the whole raycluster if needed.

@Future-Outlier
Copy link
Member

cc @rueian @andrewsykim for merge, thank you!

@rueian
Copy link
Collaborator

rueian commented Oct 29, 2025

@400Ping, The RayJob_fails_when_head_Pod_is_deleted_when_job_is_running test is failing

Signed-off-by: Future-Outlier <[email protected]>
Please enter a commit message to explain why this merge is necessary,
Signed-off-by: Future-Outlier <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest change works in my local environment.
I downloaded the log from buildkite and found that k8s job mode's e2e test failed.
based on the log, I think this is a temporary fix.
however, I do found that I can refactor rayjob controller's logic better, I will open a follow-up PR recently.

Image

cc @rueian @andrewsykim

Signed-off-by: Future-Outlier <[email protected]>
@rueian rueian merged commit c21938d into ray-project:master Oct 30, 2025
32 of 38 checks passed
rueian pushed a commit to rueian/kuberay that referenced this pull request Oct 30, 2025
ray-project#4141)

* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

Signed-off-by: 400Ping <[email protected]>

* [Fix] Fix e2e error

Signed-off-by: 400Ping <[email protected]>

* [Fix] fix according to rueian's comment

Signed-off-by: 400Ping <[email protected]>

* [Chore] fix ci error

Signed-off-by: 400Ping <[email protected]>

* Update ray-operator/controllers/ray/raycluster_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* Update ray-operator/controllers/ray/rayjob_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: 400Ping <[email protected]>
Signed-off-by: Ping <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Rueian <[email protected]>
andrewsykim pushed a commit that referenced this pull request Oct 30, 2025
#4141) (#4156)

* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted



* [Fix] Fix e2e error



* [Fix] fix according to rueian's comment



* [Chore] fix ci error



* Update ray-operator/controllers/ray/raycluster_controller.go




* Update ray-operator/controllers/ray/rayjob_controller.go




* update



* update



* Trigger CI



---------

Signed-off-by: 400Ping <[email protected]>
Signed-off-by: Ping <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: Ping <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
andrewsykim added a commit that referenced this pull request Nov 21, 2025
* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted (#4141)

* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

Signed-off-by: 400Ping <[email protected]>

* [Fix] Fix e2e error

Signed-off-by: 400Ping <[email protected]>

* [Fix] fix according to rueian's comment

Signed-off-by: 400Ping <[email protected]>

* [Chore] fix ci error

Signed-off-by: 400Ping <[email protected]>

* Update ray-operator/controllers/ray/raycluster_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* Update ray-operator/controllers/ray/rayjob_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: 400Ping <[email protected]>
Signed-off-by: Ping <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

* fix: dashboard build for kuberay 1.5.0 (#4161)

Signed-off-by: Future-Outlier <[email protected]>

* [Feature Enhancement] Set ordered replica index label to support multi-slice (#4163)

* [Feature Enhancement] Set ordered replica index label to support multi-slice

Signed-off-by: Ryan O'Leary <[email protected]>

* rename replica-id -> replica-name

Signed-off-by: Ryan O'Leary <[email protected]>

* Separate replica index feature gate logic

Signed-off-by: Ryan O'Leary <[email protected]>

* remove index arg in createWorkerPod

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>

* update stale feature gate comments (#4174)

Signed-off-by: Andrew Sy Kim <[email protected]>

* [RayCluster] Add more context why we don't recreate head Pod for RayJob (#4175)

Signed-off-by: Kai-Hsun Chen <[email protected]>

* feature: Remove empty resource list initialization. (#4168)

Fixes #4142.

* [Dockerfile] [KubeRay Dashboard]: Fix Dockerfile warnings (ENV format, CMD JSON args) (#4167)

* [#4166] improvement: Fix Dockerfile warnings (ENV format, CMD JSON args)

* extract the hostname from CMD

Signed-off-by: Neo Chien <[email protected]>

---------

Signed-off-by: Neo Chien <[email protected]>
Co-authored-by: cchung100m <[email protected]>

* [Fix] Resolve int32 overflow by having the calculation in int64 and c… (#4158)

* [Fix] Resolve int32 overflow by having the calculation in int64 and cap it if the count is over math.MaxInt32

Signed-off-by: justinyeh1995 <[email protected]>

* [Test] Add unit tests for CalculateReadyReplicas

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Add a nosec comment to pass the Lint (pre-commit) test

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Add CapInt64ToInt32 to replace #nosec directives

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Rename function to SafeInt64ToInt32 and add a underflowing prevention (it also help pass the lint test)

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Remove the early return as SafeInt64ToInt32 handles the int32 overflow and underflow checking.

Signed-off-by: justinyeh1995 <[email protected]>

---------

Signed-off-by: justinyeh1995 <[email protected]>

* Add RayService incremental upgrade sample for guide (#4164)

Signed-off-by: Ryan O'Leary <[email protected]>

* Edit RayCluster example config for label selectors (#4151)

Signed-off-by: Ryan O'Leary <[email protected]>

* [RayJob] update light weight submitter image from quay.io (#4181)

Signed-off-by: Future-Outlier <[email protected]>

* [flaky] RayJob fails when head Pod is deleted when job is running (#4182)

Signed-off-by: Future-Outlier <[email protected]>

* [CI] Pin Docker api version to avoid API version mismatch (#4188)

Signed-off-by: win5923 <[email protected]>

* Make replicas configurable for kuberay-operator #4180 (#4195)

* Make replicas configurable for kuberay-operator #4180

* Make replicas configurable for kuberay-operator #4180

* [Fix] rayjob update raycluster status (#4192)

* feat: check if raycluster status update in rayjob

* test: e2e test to check the rayjob raycluster status update

* fix: dashboard http client tests discovered and passing (#4173)

Signed-off-by: alimaazamat <[email protected]>

* [RayJob] Lift cluster status while initializing (#4191)

Signed-off-by: Spencer Peterson <[email protected]>

* [RayJob] Remove updateJobStatus call (#4198)

Fast follow to #4191

Signed-off-by: Spencer Peterson <[email protected]>

* Add support for Ray token auth (#4179)

* Add support for Ray token auth

Signed-off-by: Andrew Sy Kim <[email protected]>

* add e2e test for Ray cluster auth

Signed-off-by: Andrew Sy Kim <[email protected]>

* address nits from Ruiean

Signed-off-by: Andrew Sy Kim <[email protected]>

* update RAY_auth_mode -> RAY_AUTH_MODE

Signed-off-by: Andrew Sy Kim <[email protected]>

* configure auth for Ray autoscaler

Signed-off-by: Andrew Sy Kim <[email protected]>

---------

Signed-off-by: Andrew Sy Kim <[email protected]>

* Bump js-yaml from 4.1.0 to 4.1.1 in /dashboard (#4194)

Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1.
- [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md)
- [Commits](nodeca/js-yaml@4.1.0...4.1.1)

---
updated-dependencies:
- dependency-name: js-yaml
  dependency-version: 4.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* update minimum Ray version required for token authentication to 2.52.0 (#4201)

* update minimum Ray version required for token authentication to 2.52.0

Signed-off-by: Andrew Sy Kim <[email protected]>

* update RayCluster auth e2e test to use Ray v2.52

Signed-off-by: Andrew Sy Kim <[email protected]>

---------

Signed-off-by: Andrew Sy Kim <[email protected]>

* add samples for RayCluster token auth (#4200)

Signed-off-by: Andrew Sy Kim <[email protected]>

* update (#4208)

Signed-off-by: Future-Outlier <[email protected]>

* [RayJob] Add token authentication support for All mode (#4210)

* dashboard client authentication support

Signed-off-by: Future-Outlier <[email protected]>

* support rayjob

Signed-off-by: Future-Outlier <[email protected]>

* update to fix api serverr err

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* updarte

Signed-off-by: Future-Outlier <[email protected]>

* Rayjob sidecar mode auth token mode support

Signed-off-by: Future-Outlier <[email protected]>

* RayJob support k8s job mode

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Address Andrew's advice

Signed-off-by: Future-Outlier <[email protected]>

* add todo x-ray-authorization comments

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>

* [RayCluster] Enable Secret informer watch/list and remove unused RBAC verbs (#4202)

* Add authentication secret reconciliation support

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* fix flaky test

Signed-off-by: Future-Outlier <[email protected]>

* remove test fix

Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: Rueian <[email protected]>

* [APIServer][Docs] Add user guide for retry behavior & configuration (#4144)

* [Docs] Add the draft description about feature intro, configurations, and usecases

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Update the retry walk-through

Signed-off-by: justinyeh1995 <[email protected]>

* [Doc] rewrite the first 2 sections

Signed-off-by: justinyeh1995 <[email protected]>

* [Doc] Revise documentation wording and add Observing Retry Behavior section

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] fix linting issue by running pre-commit run berfore commiting

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] fix linting errors in the Markdown linting

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Clean up the math equation

Signed-off-by: justinyeh1995 <[email protected]>

* Update the math formula of Backoff calculation.

Co-authored-by: Nary Yeh <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Fix] Explicitly mentioned exponential backoff and removed the customization parts

Signed-off-by: justinyeh1995 <[email protected]>

* [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer”

Co-authored-by: Cheng-Yeh Chung <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy

Signed-off-by: justinyeh1995 <[email protected]>

* Update Title to KubeRay APIServer Retry Behavior

Co-authored-by: Cheng-Yeh Chung <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Docs] Add a note about the limitation of retry configuration

Signed-off-by: justinyeh1995 <[email protected]>

---------

Signed-off-by: justinyeh1995 <[email protected]>
Signed-off-by: JustinYeh <[email protected]>
Co-authored-by: Nary Yeh <[email protected]>
Co-authored-by: Cheng-Yeh Chung <[email protected]>

* Support X-Ray-Authorization fallback header for accepting auth token via proxy (#4213)

* Support X-Ray-Authorization fallback header for accepting auth token in dashboard

Signed-off-by: Future-Outlier <[email protected]>

* remove todo comment

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>

* [RayCluster] make auth token secret name consistency (#4216)

Signed-off-by: fscnick <[email protected]>

* [RayCluster] Status includes head containter status message (#4196)

* [RayCluster] Status includes head containter status message

Signed-off-by: Spencer Peterson <[email protected]>

* lint

Signed-off-by: Spencer Peterson <[email protected]>

* [RayCluster] Containers not ready status reflects structured reason

Signed-off-by: Spencer Peterson <[email protected]>

* nit

Signed-off-by: Spencer Peterson <[email protected]>

---------

Signed-off-by: Spencer Peterson <[email protected]>

* Remove erroneous  call in applyServeTargetCapacity (#4212)

Signed-off-by: Ryan O'Leary <[email protected]>

* [RayJob] Add token authentication support for light weight job submitter (#4215)

* [RayJob] light weight job submitter auth token support

Signed-off-by: Future-Outlier <[email protected]>

* X-Ray-Authorization

Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: Rueian <[email protected]>

* feat: kubectl ray get token command (#4218)

* feat: kubectl ray get token command

Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token_test.go

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rueian <[email protected]>

* make sure the raycluster exists before getting the secret

Signed-off-by: Rueian <[email protected]>

* better ux

Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Rueian <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

---------

Signed-off-by: 400Ping <[email protected]>
Signed-off-by: Ping <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Neo Chien <[email protected]>
Signed-off-by: justinyeh1995 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: alimaazamat <[email protected]>
Signed-off-by: Spencer Peterson <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: JustinYeh <[email protected]>
Signed-off-by: fscnick <[email protected]>
Co-authored-by: Ping <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Kavish <[email protected]>
Co-authored-by: Neo Chien <[email protected]>
Co-authored-by: cchung100m <[email protected]>
Co-authored-by: JustinYeh <[email protected]>
Co-authored-by: Jun-Hao Wan <[email protected]>
Co-authored-by: Divyam Raj <[email protected]>
Co-authored-by: Nary Yeh <[email protected]>
Co-authored-by: Alima Azamat <[email protected]>
Co-authored-by: Spencer Peterson <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rueian <[email protected]>
Co-authored-by: Cheng-Yeh Chung <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

3 participants