-
Notifications
You must be signed in to change notification settings - Fork 676
[Feature] Support JobDeploymentStatus as the deletion condition #4262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support JobDeploymentStatus as the deletion condition #4262
Conversation
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
…cies Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
|
The helm lint is failing. |
Will fix after getting of work, thanks for reviewing! |
|
cc @seanlaii and @win5923 for help. Note that we need to wait until @andrewsykim is back to discuss the API change. |
Signed-off-by: JiangJiaWei1103 <[email protected]>
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JiangJiaWei1103, Can you also update the comment to mention that JobDeploymentStatus is also support?
kuberay/ray-operator/config/samples/ray-job.deletion-rules.yaml
Lines 12 to 22 in e32405e
| # DeletionStrategy defines the deletion policies for a RayJob. | |
| # It allows for fine-grained control over resource cleanup after a job finishes. | |
| # DeletionRules is a list of deletion rules, processed based on their trigger conditions. | |
| # While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime), | |
| # the most impactful rule (e.g., DeleteCluster) will be executed first to prioritize resource cleanup and cost savings. | |
| deletionStrategy: | |
| # This sample demonstrates a staged cleanup process for a RayJob. | |
| # Regardless of whether the job succeeds or fails, the cleanup follows these steps: | |
| # 1. After 30 seconds, the worker pods are deleted. This allows for quick resource release while keeping the head pod for debugging. | |
| # 2. After 60 seconds, the entire RayCluster (including the head pod) is deleted. | |
| # 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server. |
Signed-off-by: JiangJiaWei1103 <[email protected]>
Hi @win5923, nice suggestion. I'm considering adding one more sample demonstrating JobDeploymentStatus-based deletion rules, wdyt? |
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: 江家瑋 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
…olicy Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall LGTM. Only some reminders:
- We might need to either move the e2e tests for
DeletionStrategyto a separate action in the CI pipeline or increase the timeout for the e2e tests as it exceeds the current timeout:40mins. cc @rueian - It might be good to clarify that we evaluate the rules in order, so if a user specifies a different
deletionPolicyfor a similar status, the firstdeletionRulewill be used, for example:
deletionRules:
- condition:
jobStatus: FAILED
ttlSeconds: 30
policy: DeleteWorkers
- condition:
jobDeploymentStatus: FAILED
ttlSeconds: 30
policy: DeleteCluster
Thanks!
Thanks for reviewing! For the first one, let's wait for rueian's reply. As for the second, since both rules match the the corresponding status, they will be added to apiVersion: ray.io/v1
kind: RayJob
metadata:
namespace: default
name: demo-del-rules
spec:
submissionMode: "K8sJobMode"
entrypoint: "python -c 'import sys, time; time.sleep(45); sys.exit(1)'"
deletionStrategy:
deletionRules:
- condition:
jobStatus: FAILED
ttlSeconds: 10
policy: DeleteWorkers
- condition:
jobDeploymentStatus: Failed
ttlSeconds: 10
policy: DeleteCluster
# ...The following shows the most important policy {"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing the most impactful overdue deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}},"overdueRulesCount":2}
{"level":"info","ts":"2025-12-14T09:42:29.001+0800","logger":"controllers.RayJob","msg":"Executing deletion policy: DeleteCluster","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":"del-seq-gcz64"}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"The associated RayCluster for RayJob is deleted","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","RayCluster":{"name":"del-seq-gcz64","namespace":"default"}}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"deleteClusterResources","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","isClusterDeleted":false}
{"level":"info","ts":"2025-12-14T09:42:29.013+0800","logger":"controllers.RayJob","msg":"All applicable deletion rules have been processed.","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"a9595cd2-df62-4291-b8b7-3e47a772ae6e","deletionMechanism":"DeletionRules"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteWorkers","condition":{"jobStatus":"FAILED","ttlSeconds":10}}}
{"level":"info","ts":"2025-12-14T09:42:29.015+0800","logger":"controllers.RayJob","msg":"Skipping completed deletion rule","RayJob":{"name":"del-seq","namespace":"default"},"reconcileID":"7ea2a0cf-1f04-4978-b819-ff7f4784b50e","deletionMechanism":"DeletionRules","rule":{"policy":"DeleteCluster","condition":{"jobDeploymentStatus":"Failed","ttlSeconds":10}}}If I'm mistaken, please let me know. Thanks a lot! |
You are right. My mistake. Forgot that it will first fetch all the rules that are matched. Thanks for the explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing! For the first one, let's wait for Rueian's reply.
Actually, the current total time for the RayJob e2e tests is around 38 mins, which doesn't exceed 40 mins. We can increase it if needed in the future. It should be fine for now.
Thanks! LGTM.
Tag @rueian for the second pass. Thanks!
| WithPolicy(rayv1.DeleteNone). | ||
| WithCondition(rayv1ac.DeletionCondition(). | ||
| WithJobDeploymentStatus(rayv1.JobDeploymentStatusFailed). | ||
| WithTTLSeconds(10)), // 10 second TTL for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can set TTL to shorter time here to speed up? As we wait after TTL to do the assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me. Using a shorter TTL would allow a faster requeue with a smaller nextRequeueTime. How about setting it to 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Nary,
I revert the commit for testing shorter TTL since we need to check resource preservation for a short period of time after the condition is matched. For example:
kuberay/ray-operator/test/e2erayjob/rayjob_deletion_strategy_test.go
Lines 81 to 94 in 615fa88
| g.Consistently(func(gg Gomega) { | |
| cluster, err := GetRayCluster(test, namespace.Name, rayClusterName) | |
| gg.Expect(err).NotTo(HaveOccurred()) | |
| gg.Expect(cluster).NotTo(BeNil()) | |
| workerPods, err := GetWorkerPods(test, cluster) | |
| gg.Expect(err).NotTo(HaveOccurred()) | |
| gg.Expect(workerPods).ToNot(BeEmpty()) | |
| headPod, err := GetHeadPod(test, cluster) | |
| gg.Expect(err).NotTo(HaveOccurred()) | |
| gg.Expect(headPod).NotTo(BeNil()) | |
| jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name) | |
| gg.Expect(err).NotTo(HaveOccurred()) | |
| gg.Expect(jobObj).NotTo(BeNil()) | |
| }, 8*time.Second, 2*time.Second).Should(Succeed()) // Check every 2s for 8s |
| jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(jobObj).NotTo(BeNil()) | ||
| cluster, err := GetRayCluster(test, namespace.Name, rayClusterName) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(cluster).NotTo(BeNil()) | ||
| workerPods, err := GetWorkerPods(test, cluster) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(workerPods).ToNot(BeEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I saw this check used in multiple places. Maybe we can extract this into a verifyAllResourcesExist function for simplification? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. How about we refactor this part and handle the cleanup logic in a follow-up PR? That way, we can keep this PR more focused and avoid introducing too many changes at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG!
| // Cleanup: delete RayJob to release cluster and pods. | ||
| LogWithTimestamp(test.T(), "Cleaning up RayJob %s/%s after DeleteNone scenario", rayJob.Namespace, rayJob.Name) | ||
| err = test.Client().Ray().RayV1().RayJobs(rayJob.Namespace).Delete(test.Ctx(), rayJob.Name, metav1.DeleteOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| g.Eventually(func() error { _, err := GetRayJob(test, rayJob.Namespace, rayJob.Name); return err }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue())) | ||
| g.Eventually(func() error { | ||
| _, err := GetRayCluster(test, namespace.Name, rayClusterName) | ||
| return err | ||
| }, TestTimeoutMedium).Should(WithTransform(k8serrors.IsNotFound, BeTrue())) | ||
| LogWithTimestamp(test.T(), "Cleanup after DeleteNone scenario complete") | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to be in this PR, but I think we can probably extract the cleanup to helper function, then we can put the clean up to defer block after the rayjob creation to make test easier to follow
cc @rueian to confirm if we want to do this
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you.
| // JobDeploymentStatus is the terminal status of the RayJob deployment that triggers this condition. | ||
| // For the initial implementation, only "Failed" is supported. | ||
| // +kubebuilder:validation:Enum=Failed | ||
| // +optional | ||
| JobDeploymentStatus *JobDeploymentStatus `json:"jobDeploymentStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we only support Failed option.
| WithPolicy(rayv1.DeleteNone). | ||
| WithCondition(rayv1ac.DeletionCondition(). | ||
| WithJobDeploymentStatus(rayv1.JobDeploymentStatusFailed). | ||
| WithTTLSeconds(10)), // 10 second TTL for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me. Using a shorter TTL would allow a faster requeue with a smaller nextRequeueTime. How about setting it to 2?
| jobObj, err := GetRayJob(test, rayJob.Namespace, rayJob.Name) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(jobObj).NotTo(BeNil()) | ||
| cluster, err := GetRayCluster(test, namespace.Name, rayClusterName) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(cluster).NotTo(BeNil()) | ||
| workerPods, err := GetWorkerPods(test, cluster) | ||
| gg.Expect(err).NotTo(HaveOccurred()) | ||
| gg.Expect(workerPods).ToNot(BeEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. How about we refactor this part and handle the cleanup logic in a follow-up PR? That way, we can keep this PR more focused and avoid introducing too many changes at once.
machichima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thank you
Signed-off-by: JiangJiaWei1103 <[email protected]>
This reverts commit 0588f35. We need to pass consistency checks for resource preservation. Signed-off-by: JiangJiaWei1103 <[email protected]>
Why are these changes needed?
The current
deletionStrategyrelies exclusively on the terminal states ofJobStatus(SUCCEEDEDorFAILED). However, there are several scenarios in which a user-deployed RayJob ends up withJobStatus == ""(JobStatusNew) whileJobDeploymentStatus == "Failed". In these cases, the associated resources (e.g.,RayJob,RayCluster, etc.) remain stuck and are never cleaned up, resulting in indefinite resource consumption.Changes
JobDeploymentStatusfield toDeletionConditionFailedonlyJobStatusandJobDeploymentStatuswithinDeletionConditionImplementation Details
To determine which field the user specifies, we use pointers instead of raw values. Both
JobStatusandJobDeploymentStatushave empty strings as their zero values, which correspond to a "new" state. Usingnilallows us to reliably distinguish between "unspecified" and "explicitly set," avoiding unintended ambiguity.Related issue number
Closes #4233.
Checks