Fix: Move replica validation logic to right place.#4307
Fix: Move replica validation logic to right place.#4307rueian merged 3 commits intoray-project:masterfrom
Conversation
d1d7b57 to
eb654d7
Compare
|
Thanks for your effort. cc @Future-Outlier |
|
I have gone through #4116 but it was inactive and no changes were being made after the merging of volcano pr. Many changes are required since volcano pr's merging, that's why I opened up this PR. |
|
Gotcha. Let's wait for maintainers' reply. I'll help review, thank you. |
|
@Future-Outlier Just wanted to follow up - do you prefer closing #4116 and continuing the work in this PR? |
Hi, @kash2104 I just left a comment to #4116 and ask her if she have time to finish the work. |
529b021 to
bc5decd
Compare
Future-Outlier
left a comment
There was a problem hiding this comment.
cc @400Ping @justinyeh1995 @machichima to review
|
cursor review |
| if *workerGroup.MinReplicas > *workerGroup.MaxReplicas { | ||
| return fmt.Errorf("worker group %s has minReplicas %d greater than maxReplicas %d", workerGroup.GroupName, *workerGroup.MinReplicas, *workerGroup.MaxReplicas) | ||
| } | ||
| } |
There was a problem hiding this comment.
Should we also check if workerGroup.Replicas lies within the workerGroup.MinReplicas and workerGroup.MaxReplicas`?
There was a problem hiding this comment.
@machichima Here in thevalidation.go, we are just checking whether the values for min and max replicas aren't incorrect or impossible before pod creation happens but the actual logic of number of replicas is moved to util.go
So I think that here in validation.go, we won't be needing this check.
There was a problem hiding this comment.
Just to confirm: validation.go is strictly for ensuring the min and max replica configurations are valid, while the actual clipping logic resides in util.go. Therefore, we don't need to re-verify if the replicas fall within the [min, max] range here. Please let me know if I've misunderstood anything. Thanks!
| if nodeGroup.MinReplicas != nil { | ||
| minReplicas = *nodeGroup.MinReplicas | ||
| } |
There was a problem hiding this comment.
Maybe we can use ptr.Deref, which is used in other places as well
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 830 to 831 in 270a143
ad6fd8a to
7d73dd1
Compare
| // Test 3: `WorkerGroupSpec.Replicas` is not nil but is more than maxReplicas. | ||
| replicas = int32(6) | ||
| workerGroupSpec.Replicas = &replicas | ||
| assert.Equal(t, GetWorkerGroupDesiredReplicas(ctx, workerGroupSpec), maxReplicas) | ||
|
|
||
| // Test 4: `WorkerGroupSpec.Replicas` is not nil but is less than minReplicas. | ||
| replicas = int32(0) | ||
| workerGroupSpec.Replicas = &replicas | ||
| assert.Equal(t, GetWorkerGroupDesiredReplicas(ctx, workerGroupSpec), minReplicas) |
There was a problem hiding this comment.
Should we keep those checks? I think this part is not moved to validation
There was a problem hiding this comment.
The part that we have moved to the validation keeps in check that the desired state of the replicas are within the correct bounds or not after the pods have been created. So these tests were checking that logic by using the GetWorkerGroupDesiredReplicas and we have moved this logic to validation.go and thus have added the tests for them in validation_test.go as well.
Thus these checks have removed from util_test.go.
There was a problem hiding this comment.
Please correct me if I'm wrong, I think in validation we only check if maxReplicas and minReplicas are set correctly (comment: #4307 (comment)). The test here is for checking if the replicas will be clamped to max or min replica, so I think we should include them?
There was a problem hiding this comment.
I agree on including them back.
| { | ||
| name: "replicas smaller than minReplicas when autoscaling disabled", | ||
| spec: func() rayv1.RayClusterSpec { | ||
| s := createSpec() | ||
| s.WorkerGroupSpecs = []rayv1.WorkerGroupSpec{ | ||
| { | ||
| GroupName: "worker-group-3", | ||
| Template: podTemplateSpec(nil, nil), | ||
| Replicas: ptr.To(int32(1)), | ||
| MinReplicas: ptr.To(int32(2)), | ||
| MaxReplicas: ptr.To(int32(5)), | ||
| }, | ||
| } | ||
| return s | ||
| }(), | ||
| expectError: false, | ||
| }, |
There was a problem hiding this comment.
I think this will not raise error because we did the clamp it to minReplicas in util.go? Could we please add comment or update the test name to briefly explain why this will not raise error? As usually we would expect error being raised in this case
There was a problem hiding this comment.
@machichima Should I add the comment for this test case or something else is recommended?
There was a problem hiding this comment.
Adding a comment here sounds good to me!
|
For the failed rayjob e2e test in CI, please try to merge master. The error should go away |
559f0be to
f4261cf
Compare
|
@machichima Have fixed the tests. |
|
cc @JiangJiaWei1103 to also take a look |
JiangJiaWei1103
left a comment
There was a problem hiding this comment.
LGTM, thanks for your effort!
| if *workerGroup.MinReplicas > *workerGroup.MaxReplicas { | ||
| return fmt.Errorf("worker group %s has minReplicas %d greater than maxReplicas %d", workerGroup.GroupName, *workerGroup.MinReplicas, *workerGroup.MaxReplicas) | ||
| } | ||
| } |
There was a problem hiding this comment.
Just to confirm: validation.go is strictly for ensuring the min and max replica configurations are valid, while the actual clipping logic resides in util.go. Therefore, we don't need to re-verify if the replicas fall within the [min, max] range here. Please let me know if I've misunderstood anything. Thanks!
|
@JiangJiaWei1103 You have understood it correctly. |
|
cc @Future-Outlier for final round |
Future-Outlier
left a comment
There was a problem hiding this comment.
Can we add these cases?
| Test Case | Condition | Expected Result |
|---|---|---|
| 1 | autoscaling=false, minReplicas=nil, maxReplicas=5 |
Error: must set both |
| 2 | autoscaling=false, minReplicas=1, maxReplicas=nil |
Error: must set both |
| 3 | autoscaling=false, minReplicas=nil, maxReplicas=nil |
Error: must set both |
| 4 | minReplicas=-1 |
Error: negative minReplicas |
| 5 | maxReplicas=-1 |
Error: negative maxReplicas |
1. Remove validation logic from GetWorkerGroupDesiredReplicas (utils.go) and add this logic to ValidateRayClusterSpec (validation.go). 2. Remove unnecessary tests from TestGetWorkerGroupDesiredReplicas. 3. Remove the unused ctx.
This is added since we moved the validation logic.
|
@Future-Outlier Have added the test cases. |
Future-Outlier
left a comment
There was a problem hiding this comment.
LGTM, cc @rueian @andrewsykim to merge.
Signed-off-by: Future-Outlier <[email protected]>
Why are these changes needed?
These changes are done to check the validation logic before rayCluster pod creation. It moves the replica validation logic as well as removes the redundant tests. Along with this, unit test are added since we moved the logic from
utils.gotovalidation.go.Related issue number
Closes #4101
Checks