[E2E] [RayCronJob] add e2e test for suspend behavior#4349
[E2E] [RayCronJob] add e2e test for suspend behavior#4349rueian merged 10 commits intoray-project:masterfrom
Conversation
| //+kubebuilder:printcolumn:name="age",type="date",JSONPath=".metadata.creationTimestamp",priority=0 | ||
| //+kubebuilder:printcolumn:name="suspend",type=boolean,JSONPath=".spec.suspend",priority=0 | ||
|
|
||
| // +genclient |
There was a problem hiding this comment.
Good catch!
Maybe we can also add the following annotations as well:
// +kubebuilder:storageversion
// +kubebuilder:resource:categories=all
There was a problem hiding this comment.
- // +kubebuilder:storageversion: marker to indicate the GVK that should be used to store data by the API server.
- // +kubebuilder:resource:categories=all: Puts the CRD into the all category, so it shows up in
kubectl get all.
Ref:
https://book.kubebuilder.io/reference/markers/crd
https://book.kubebuilder.io/reference/generating-crd#multiple-versions
Cool!
There was a problem hiding this comment.
thanks for the suggestion~
| g.Expect(err).NotTo(HaveOccurred()) | ||
| ownerUID := rcj.UID | ||
|
|
||
| // No RayJob should be created |
There was a problem hiding this comment.
nit:
| // No RayJob should be created | |
| // No RayJob should be created | |
| LogWithTimestamp(test.T(), "Waiting to ensure no RayJobs are created while suspended") |
| return n | ||
| }, 130*time.Second, 5*time.Second).Should(Equal(0)) | ||
|
|
||
| // Resume |
There was a problem hiding this comment.
nit:
| // Resume | |
| LogWithTimestamp(test.T(), "Resuming RayCronJob %s/%s", rayCronJob.Namespace, rayCronJob.Name) |
| if err != nil { | ||
| return -1 | ||
| } | ||
| return n |
There was a problem hiding this comment.
To preserve error:
| if err != nil { | |
| return -1 | |
| } | |
| return n | |
| g.Expect(err).NotTo(HaveOccurred()) | |
| return n |
|
|
||
| // Spec.suspend should be false | ||
| g.Eventually(func() bool { | ||
| rcj, err := test.Client().Ray().RayV1().RayCronJobs(namespace.Name).Get(test.Ctx(), rayCronJob.Name, metav1.GetOptions{}) |
There was a problem hiding this comment.
Maybe we can replace this with a util function similar to GetRayJob.
|
cursor review |
Future-Outlier
left a comment
There was a problem hiding this comment.
cc @machichima @CheyuWu @justinyeh1995 @fscnick to take a look, thank you!
| func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration { | ||
| return rayv1ac.RayCronJob(name, namespace). | ||
| WithSpec( | ||
| rayv1ac.RayCronJobSpec(). | ||
| WithSchedule(schedule). | ||
| WithJobTemplate( | ||
| rayv1ac.RayJobSpec(). | ||
| WithEntrypoint("sleep 1"). | ||
| WithRayClusterSpec( | ||
| rayv1ac.RayClusterSpec(). | ||
| WithHeadGroupSpec( | ||
| rayv1ac.HeadGroupSpec(). | ||
| WithTemplate( | ||
| corev1ac.PodTemplateSpec(). | ||
| WithSpec( | ||
| corev1ac.PodSpec(). | ||
| WithContainers( | ||
| corev1ac.Container(). | ||
| WithName("ray-head"). | ||
| WithImage(GetRayImage()). | ||
| WithResources( | ||
| corev1ac.ResourceRequirements(). | ||
| WithRequests(corev1.ResourceList{ | ||
| corev1.ResourceCPU: resource.MustParse("500m"), | ||
| corev1.ResourceMemory: resource.MustParse("500Mi"), | ||
| }), | ||
| ), | ||
| ), | ||
| ), | ||
| ), | ||
| ), | ||
| ), | ||
| ), | ||
| ) | ||
| } |
There was a problem hiding this comment.
nit: for readability, would it make sense to construct it bottom-up using intermediate variables?
For example
func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration {
headContainer := corev1ac.Container().
WithName("ray-head").
WithImage(GetRayImage()).
WithResources(
corev1ac.ResourceRequirements().
WithRequests(corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("500m"),
corev1.ResourceMemory: resource.MustParse("500Mi"),
}),
)
podTmpl := corev1ac.PodTemplateSpec().
WithSpec(corev1ac.PodSpec().WithContainers(headContainer))
cluster := ...
job := ...
spec := ...
return rayv1ac.RayCronJob(name, namespace).WithSpec(spec)
}There was a problem hiding this comment.
I think we can simply use:
func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration {
return rayv1ac.RayCronJob(name, namespace).
WithSpec(
rayv1ac.RayCronJobSpec().
WithSchedule(schedule).
WithJobTemplate(
rayv1ac.RayJobSpec().
WithEntrypoint("sleep 1").
WithRayClusterSpec(NewRayClusterSpec()),
),
)
}
There was a problem hiding this comment.
That's a great suggestion, thanks!
There was a problem hiding this comment.
thanks for the suggestion~
| return -1 | ||
| } | ||
| return n | ||
| }, 130*time.Second, 5*time.Second).Should(Equal(0)) |
There was a problem hiding this comment.
nit:
| }, 130*time.Second, 5*time.Second).Should(Equal(0)) | |
| }, 130*time.Second, 5*time.Second).Should(BeZero()) |
| err = test.Client().Ray().RayV1().RayCronJobs(namespace.Name).Delete(test.Ctx(), rayCronJob.Name, metav1.DeleteOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| LogWithTimestamp(test.T(), "Deleted RayCronJob %s/%s successfully", rayCronJob.Namespace, rayCronJob.Name) | ||
| }) |
There was a problem hiding this comment.
Should we assert that the related resource has been deleted correctly?
There was a problem hiding this comment.
In this test Delete() is only cleanup; the assertions focus on suspend/resume affecting RayJob creation. The RayJob suspend e2e asserts cascade deletion because that’s part of its feature contract. Happy to add an Eventually(Get).Should(BeNotFound()) after Delete for explicitness if you prefer.
what do you think ?
There was a problem hiding this comment.
Moreover, it could make sure this test exits safe and sound without leaving something to interfere others.
However, some tests do it and some don't. I guess it might not be mandatory. Feel free to do it or not.
|
We also need to add it to https://github.com/ray-project/kuberay/blob/master/.buildkite/test-e2e.yml so that the CI can run it. Could you please add |
Future-Outlier
left a comment
There was a problem hiding this comment.
cc @JiangJiaWei1103 @win5923 @fsNick @machichima to do final pass
machichima
left a comment
There was a problem hiding this comment.
LGTM! Just a small nit
…guration Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: AndySung320 <71032763+AndySung320@users.noreply.github.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
91df5ac to
9cc2e90
Compare
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Future-Outlier
left a comment
There was a problem hiding this comment.
I'm ok with current's implementation, but will it be better if we use label selector to filter RayJob?
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering
|
label selector could work, but I chose ownerRef UID filtering for correctness. |
But we will have unique namespace, right? So this will still work |
|
I'm wondering that should we optimize for the current test scenario (where label selector is simpler), or consider potential future extensions? |
make sense |
85d3b26 to
eff3830
Compare
Why are these changes needed?
This PR adds an e2e test covering RayCronJob suspend/resume behavior:
spec.suspend=true, the RayCronJob controller should not create any new RayJobs even after the scheduled time passes.spec.suspend=false, the controller should resume scheduling and start creating RayJobs again.To run this e2e test in CI, we also enable the RayCronJob feature gate in the test/CI-only operator overrides (Helm values override and kustomize test override).
Note: RayCronJob previously did not have
// +genclient, so the typed client/applyconfiguration for RayCronJob was not generated. The PR adds+genclientand commits the resulting regenerated clientset/applyconfiguration/informer/lister code so the test can compile and run consistently.Related issue number
Closes #4323
Checks