Skip to content

Conversation

@JiangJiaWei1103
Copy link
Contributor

Why are these changes needed?

The current deletionStrategy relies exclusively on the terminal states of JobStatus (SUCCEEDED or FAILED). However, there are several scenarios in which a user-deployed RayJob ends up with JobStatus == "" (JobStatusNew) while JobDeploymentStatus == "Failed". In these cases, the associated resources (e.g., RayJob, RayCluster, etc.) remain stuck and are never cleaned up, resulting in indefinite resource consumption.

Changes

  • Add the JobDeploymentStatus field to DeletionCondition
    • Currently supports Failed only
  • Enforce mutual exclusivity between JobStatus and JobDeploymentStatus within DeletionCondition

Implementation Details

To determine which field the user specifies, we use pointers instead of raw values. Both JobStatus and JobDeploymentStatus have empty strings as their zero values, which correspond to a "new" state. Using nil allows us to reliably distinguish between "unspecified" and "explicitly set," avoiding unintended ambiguity.

Related issue number

Closes #4233.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian
Copy link
Collaborator

rueian commented Dec 8, 2025

The helm lint is failing.

@pickymodel
Copy link

The helm lint is failing.

Will fix after getting of work, thanks for reviewing!

@Future-Outlier
Copy link
Member

cc @seanlaii and @win5923 for help. Note that we need to wait until @andrewsykim is back to discuss the API change.

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?

# DeletionStrategy defines the deletion policies for a RayJob.
# It allows for fine-grained control over resource cleanup after a job finishes.
# DeletionRules is a list of deletion rules, processed based on their trigger conditions.
# While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),
# the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings.
deletionStrategy:
# This sample demonstrates a staged cleanup process for a RayJob.
# Regardless of whether the job succeeds or fails, the cleanup follows these steps:
# 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging.
# 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted.
# 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.

@JiangJiaWei1103
Copy link
Contributor Author

Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?

# DeletionStrategy defines the deletion policies for a RayJob.
# It allows for fine-grained control over resource cleanup after a job finishes.
# DeletionRules is a list of deletion rules, processed based on their trigger conditions.
# While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),
# the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings.
deletionStrategy:
# This sample demonstrates a staged cleanup process for a RayJob.
# Regardless of whether the job succeeds or fails, the cleanup follows these steps:
# 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging.
# 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted.
# 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.

Hi @win5923, nice suggestion. I'm considering adding one more sample demonstrating JobDeploymentStatus-based deletion rules, wdyt?

Copy link
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall LGTM. Only some reminders:

  1. We might need to either move the e2e tests for DeletionStrategy to a separate action in the CI pipeline or increase the timeout for the e2e tests as it exceeds the current timeout: 40mins. cc @rueian
  2. It might be good to clarify that we evaluate the rules in order, so if a user specifies a different deletionPolicy for a similar status, the first deletionRule will be used, for example:
deletionRules:
    - condition:
        jobStatus: FAILED
        ttlSeconds: 30
      policy: DeleteWorkers
    - condition:
        jobDeploymentStatus: FAILED
        ttlSeconds: 30
      policy: DeleteCluster

Thanks!

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Dec 14, 2025

Thanks! Overall LGTM. Only some reminders:

  1. We might need to either move the e2e tests for DeletionStrategy to a separate action in the CI pipeline or increase the timeout for the e2e tests as it exceeds the current timeout: 40mins. cc @rueian
  2. It might be good to clarify that we evaluate the rules in order, so if a user specifies a different deletionPolicy for a similar status, the first deletionRule will be used, for example:
deletionRules:
    - condition:
        jobStatus: FAILED
        ttlSeconds: 30
      policy: DeleteWorkers
    - condition:
        jobDeploymentStatus: FAILED
        ttlSeconds: 30
      policy: DeleteCluster

Thanks!

Thanks for reviewing! For the first one, let's wait for rueian's reply.

As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case), so I think DeleteCluster will be executed first. Following demonstrates an example:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  namespace: default
  name: demo-del-rules
spec:
  submissionMode: "K8sJobMode"
  entrypoint: "python -c 'import sys, time; time.sleep(45); sys.exit(1)'"

  deletionStrategy:
    deletionRules:
    - condition:
        jobStatus: FAILED
        ttlSeconds: 10
      policy: DeleteWorkers
    - condition:
        jobDeploymentStatus: Failed
        ttlSeconds: 10
      policy: DeleteCluster

    # ...

The following shows the most important policy DeleteCluster is executed:

{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing the most impactful overdue deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}},"overdueRulesCount":2}
{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing deletion policy: DeleteCluster","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":"del-seq-gcz64"}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"The associated RayCluster for RayJob is deleted","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":{"name":"del-seq-gcz64","namespace":"default"}}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"deleteClusterResources","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","isClusterDeleted":false}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"All applicable deletion rules have been processed.","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteWorkers","condition":{"jobStatus":"FAILED","ttlSeconds":10}}}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}}}

If I'm mistaken, please let me know. Thanks a lot!

@seanlaii
Copy link
Contributor

seanlaii commented Dec 14, 2025

As for the second, since both rules match the the corresponding status, they will be added to overdueRules. selectMostImpactfulRule then prioritize the most important rule (DeleteCluster 3 > DeleteWorkers 2 in this case) to execute, so I think DeleteCluster will be executed first.

You are right. My mistake. Forgot that it will first fetch all the rules that are matched. Thanks for the explanation!

Copy link
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! For the first one, let's wait for Rueian's reply.

Actually, the current total time for the RayJob e2e tests is around 38 mins, which doesn't exceed 40 mins. We can increase it if needed in the future. It should be fine for now.

Thanks! LGTM.

Tag @rueian for the second pass. Thanks!

WithPolicy(rayv1.DeleteNone).
WithCondition(rayv1ac.DeletionCondition().
WithJobDeploymentStatus(rayv1.JobDeploymentStatusFailed).
WithTTLSeconds(10)), // 10 second TTL for testing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can set TTL to shorter time here to speed up? As we wait after TTL to do the assertion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me. Using a shorter TTL would allow a faster requeue with a smaller nextRequeueTime. How about setting it to 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nary,

I revert the commit for testing shorter TTL since we need to check resource preservation for a short period of time after the condition is matched. For example:

g.Consistently(func(gg Gomega) {
cluster, err := GetRayCluster(test, namespace.Name, rayClusterName)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(cluster).NotTo(BeNil())
workerPods, err := GetWorkerPods(test, cluster)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(workerPods).ToNot(BeEmpty())
headPod, err := GetHeadPod(test, cluster)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(headPod).NotTo(BeNil())
jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(jobObj).NotTo(BeNil())
}, 8*time.Second, 2*time.Second).Should(Succeed()) // Check every 2s for 8s

Comment on lines +772 to +780
jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(jobObj).NotTo(BeNil())
cluster, err := GetRayCluster(test, namespace.Name, rayClusterName)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(cluster).NotTo(BeNil())
workerPods, err := GetWorkerPods(test, cluster)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(workerPods).ToNot(BeEmpty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I saw this check used in multiple places. Maybe we can extract this into a verifyAllResourcesExist function for simplification? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. How about we refactor this part and handle the cleanup logic in a follow-up PR? That way, we can keep this PR more focused and avoid introducing too many changes at once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG!

Comment on lines +784 to +794
// Cleanup: delete RayJob to release cluster and pods.
LogWithTimestamp(test.T(), "Cleaning up RayJob %s/%s after DeleteNone scenario", rayJob.Namespace, rayJob.Name)
err = test.Client().Ray().RayV1().RayJobs(rayJob.Namespace).Delete(test.Ctx(), rayJob.Name, metav1.DeleteOptions{})
g.Expect(err).NotTo(HaveOccurred())
g.Eventually(func() error { _, err := GetRayJob(test, rayJob.Namespace, rayJob.Name); return err }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue()))
g.Eventually(func() error {
_, err := GetRayCluster(test, namespace.Name, rayClusterName)
return err
}, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue()))
LogWithTimestamp(test.T(), "Cleanup after DeleteNone scenario complete")
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to be in this PR, but I think we can probably extract the cleanup to helper function, then we can put the clean up to defer block after the rayjob creation to make test easier to follow

cc @rueian to confirm if we want to do this

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you.

win5923
win5923 approved these changes Dec 14, 2025
Comment on lines +156 to +160
// JobDeploymentStatus is the terminal status of the RayJob deployment that triggers this condition.
// For the initial implementation, only "Failed" is supported.
// +kubebuilder:validation:Enum=Failed
// +optional
JobDeploymentStatus *JobDeploymentStatus `json:"jobDeploymentStatus,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we only support Failed option.

WithPolicy(rayv1.DeleteNone).
WithCondition(rayv1ac.DeletionCondition().
WithJobDeploymentStatus(rayv1.JobDeploymentStatusFailed).
WithTTLSeconds(10)), // 10 second TTL for testing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me. Using a shorter TTL would allow a faster requeue with a smaller nextRequeueTime. How about setting it to 2?

Comment on lines +772 to +780
jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(jobObj).NotTo(BeNil())
cluster, err := GetRayCluster(test, namespace.Name, rayClusterName)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(cluster).NotTo(BeNil())
workerPods, err := GetWorkerPods(test, cluster)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(workerPods).ToNot(BeEmpty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. How about we refactor this part and handle the cleanup logic in a follow-up PR? That way, we can keep this PR more focused and avoid introducing too many changes at once.

Copy link
Collaborator

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thank you

This reverts commit 0588f35.

We need to pass consistency checks for resource preservation.

Signed-off-by: JiangJiaWei1103 <[email protected]>
@rueian rueian merged commit 66744cb into ray-project:master Dec 19, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Automatic Cleanup for RayJobs that exceed activeDeadlineSeconds

7 participants