-
Notifications
You must be signed in to change notification settings - Fork 664
RayJob Volcano Integration #3972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc3811c to
d591688
Compare
cf98064 to
f820ab8
Compare
Signed-off-by: Troy Chiu <[email protected]>
Signed-off-by: win5923 <[email protected]>
26af624 to
ace94b2
Compare
c10c53e to
9cd200c
Compare
a96d3a4 to
dd38daa
Compare
Signed-off-by: win5923 <[email protected]>
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Outdated
Show resolved
Hide resolved
f4c9d27 to
a716f20
Compare
…notations Signed-off-by: win5923 <[email protected]>
a716f20 to
42479c2
Compare
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Outdated
Show resolved
Hide resolved
Signed-off-by: win5923 <[email protected]>
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go
Outdated
Show resolved
Hide resolved
Signed-off-by: win5923 <[email protected]>
00e5d6c to
a64c011
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Volcano scheduler support for RayJob CRD, enabling gang scheduling to ensure Ray pods and submitter pods are scheduled together as a unit. This prevents partial scheduling issues where only some pods of a RayJob get scheduled.
- Extends the existing Volcano batch scheduler to support RayJob objects in addition to RayCluster
- Implements PodGroup creation for RayJob resources with proper resource calculation including submitter pod resources
- Adds comprehensive test coverage for RayJob Volcano integration with different submission modes
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| ray-operator/controllers/ray/utils/util.go | Exports SumResourceList function for use in Volcano scheduler |
| ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go | Adds RayJob support to Volcano scheduler with gang scheduling logic |
| ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler_test.go | Adds comprehensive test coverage for RayJob Volcano integration |
| ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml | Provides sample configuration for testing RayJob with Volcano scheduler |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // MinMember intentionally excludes the submitter pod to avoid a startup deadlock | ||
| // (submitter waits for cluster; gang would wait for submitter). We still add the | ||
| // submitter's resource requests into MinResources so capacity is reserved. |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment explains the design decision well, but could be clearer about what 'gang would wait for submitter' means. Consider expanding to explain that the gang scheduler would wait for all pods including the submitter to be schedulable before scheduling any, creating a circular dependency.
| // MinMember intentionally excludes the submitter pod to avoid a startup deadlock | |
| // (submitter waits for cluster; gang would wait for submitter). We still add the | |
| // submitter's resource requests into MinResources so capacity is reserved. | |
| // MinMember intentionally excludes the submitter pod to avoid a startup deadlock. | |
| // If the submitter pod were included in MinMember, the gang scheduler would wait for | |
| // all pods—including the submitter—to be schedulable before scheduling any of them. | |
| // This creates a circular dependency: the submitter pod waits for the cluster to be ready, | |
| // but the cluster cannot be scheduled until the submitter is also schedulable. To avoid this, | |
| // we exclude the submitter from MinMember, but still add its resource requests into MinResources | |
| // so that capacity is reserved for it. |
| podGroup := volcanoschedulingv1beta1.PodGroup{} | ||
| if err := v.cli.Get(ctx, types.NamespacedName{Namespace: owner.GetNamespace(), Name: podGroupName}, &podGroup); err != nil { | ||
| if !errors.IsNotFound(err) { | ||
| logger.Error(err, "failed to get PodGroup", "podGroupName", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace()) |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.
| } | ||
|
|
||
| logger.Error(err, "Pod group CREATE error!", "PodGroup.Error", err) | ||
| logger.Error(err, "failed to create PodGroup", "name", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace()) |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.
| podGroup.Spec.MinResources = &totalResource | ||
| if err := v.cli.Update(ctx, &podGroup); err != nil { | ||
| logger.Error(err, "Pod group UPDATE error!", "podGroup", podGroupName) | ||
| logger.Error(err, "failed to update PodGroup", "name", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace()) |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.
| // MinMember intentionally excludes the submitter pod to avoid a startup deadlock | ||
| // (submitter waits for cluster; gang would wait for submitter). We still add the | ||
| // submitter's resource requests into MinResources so capacity is reserved. | ||
| if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this for SidecarMode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in SidecarMode, when calculating the head pod's resource, we will call the function CalculatePodResource, and this func will iterate all containers in the head pod spec like thisfor _, container := range podSpec.Containers.
so I think sidecar mode will work, but if we can get a screenshot about a test for sidecar mode I will appreciate it.
cc @win5923
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, @rueian is right.
when using sidecar mode, the Min Resources's CPU shoulde be 3.5, but this is not correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks nice!
but I think we have to
- calculate the correct resource in the sidecar mode
- test gangscheduling behavior before this PR get merged.
- add gangscheduling label in the example, and test k8s job mode, http mode, and sidecar mode.
flowchart TD
%% User creates RayJob
A[User creates RayJob] --> B{Check RayJob Labels}
B --> C["ray.io/gang-scheduling-enabled = \"true\""]
B --> D["volcano.sh/queue-name = \"default\""]
B --> E["ray.io/priority-class-name = \"high\""]
%% RayJob Controller processes RayJob
C --> F[RayJob Controller]
D --> F
E --> F
F --> G[constructRayClusterForRayJob]
G --> H[Copy RayJob Labels to RayCluster]
G --> I[Copy RayJob Annotations to RayCluster]
%% Batch Scheduler Manager intervention
H --> J{BatchSchedulerManager Check}
I --> J
J --> K[VolcanoBatchScheduler.DoBatchSchedulingOnSubmission]
%% Volcano scheduler handles RayJob
K --> L[handleRayJob]
L --> M[calculatePodGroupParams]
M --> N[Calculate MinMember and MinResources]
N --> O[MinMember = Head + Workers]
N --> P[MinResources = Head + Workers + Submitter]
%% Create PodGroup
O --> Q[syncPodGroup]
P --> Q
Q --> R[Create PodGroup CRD]
R --> S["PodGroup Name: ray-jobname-pg"]
R --> T["MinMember: Head + Workers"]
R --> U["MinResources: Total Resources"]
R --> V["Queue: volcano.sh/queue-name"]
R --> W["PriorityClassName: ray.io/priority-class-name"]
%% RayCluster creation
S --> X[RayCluster Created]
T --> X
U --> X
V --> X
W --> X
%% RayCluster Controller handles Pods
X --> Y[RayCluster Controller]
Y --> Z[buildHeadPod]
Y --> AA[buildWorkerPod]
%% Add Volcano metadata to each Pod
Z --> BB[VolcanoBatchScheduler.AddMetadataToPod]
AA --> BB
BB --> CC[Set Pod Annotations]
CC --> DD["scheduling.volcano.sh/group-name = ray-jobname-pg"]
CC --> EE["batch.volcano.sh/task-spec = groupName"]
BB --> FF[Set Pod Labels]
FF --> GG["volcano.sh/queue-name = \"default\""]
BB --> HH[Set Pod Spec]
HH --> II["schedulerName = \"volcano\""]
HH --> JJ["priorityClassName = \"high\""]
%% RayJob Controller creates Kubernetes Job
F --> KK[Create Kubernetes Job]
KK --> LL[Kubernetes Job Controller]
LL --> MM[Create Job Pod]
%% Add Volcano metadata to Job Pod
MM --> NN[VolcanoBatchScheduler.AddMetadataToPod]
NN --> OO[Set Job Pod Annotations]
OO --> PP["scheduling.volcano.sh/group-name = ray-jobname-pg"]
OO --> QQ["batch.volcano.sh/task-spec = \"submitter\""]
NN --> RR[Set Job Pod Labels]
RR --> SS["volcano.sh/queue-name = \"default\""]
NN --> TT[Set Job Pod Spec]
TT --> UU["schedulerName = \"volcano\""]
TT --> VV["priorityClassName = \"high\""]
%% Pod scheduling
DD --> WW[All Pods submitted to Volcano]
EE --> WW
GG --> WW
II --> WW
JJ --> WW
PP --> WW
QQ --> WW
SS --> WW
UU --> WW
VV --> WW
WW --> XX[Volcano Gang Scheduler]
XX --> YY[Check PodGroup status]
YY --> ZZ[Wait for PodGroup resources]
ZZ --> AAA[Check MinMember and MinResources]
AAA --> BBB[Schedule all Pods simultaneously]
%% Final state
BBB --> CCC[Ray Cluster Running]
CCC --> DDD[Job Pod Executing]
DDD --> EEE[Submit Ray Job to Ray Cluster]
EEE --> FFF[Execute Ray Job]
%% Style definitions
classDef userAction fill:#e1f5fe
classDef controller fill:#f3e5f5
classDef scheduler fill:#e8f5e8
classDef pod fill:#fff3e0
classDef volcano fill:#ffebee
classDef podgroup fill:#f0f8ff
classDef job fill:#f0f8ff
class A userAction
class F,G,Y,LL controller
class K,L,M,N,Q,XX scheduler
class Z,AA,MM pod
class BB,CC,DD,EE,FF,GG,HH,II,JJ,NN,OO,PP,QQ,RR,SS,TT,UU,VV volcano
class R,S,T,U,V,W podgroup
class KK,DDD,EEE job
| labels: | ||
| ray.io/scheduler-name: volcano | ||
| volcano.sh/queue-name: kuberay-test-queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @win5923
can we add ray.io/gang-scheduling-enabled: "true" in the example and test them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, adding ray.io/gang-scheduling-enabled: "true" does not have any effect. This only works with YuniKorn or the Scheduler plugin.
kuberay/ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Lines 120 to 123 in c6bafa3
| func (y *YuniKornScheduler) isGangSchedulingEnabled(obj metav1.Object) bool { | |
| _, exist := obj.GetLabels()[utils.RayGangSchedulingEnabled] | |
| return exist | |
| } |
kuberay/ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go
Lines 109 to 112 in c6bafa3
| func (k *KubeScheduler) isGangSchedulingEnabled(app *rayv1.RayCluster) bool { | |
| _, exist := app.Labels[utils.RayGangSchedulingEnabled] | |
| return exist | |
| } |
And I think this is a breaking change if we add this check. We should also update the doc to mention that starting from version 1.5.0, users need to add ray.io/gang-scheduling-enabled: "true" to enable gang scheduling for Volcano. https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that volcano's default scheduler configmap is gang scheduling enabled!
so in the future, if the user want to disable it, we might need to tell them to edit the configmap or figure out some way to control it by adding more information in our CR.
thank you!!
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack| // MinMember intentionally excludes the submitter pod to avoid a startup deadlock | ||
| // (submitter waits for cluster; gang would wait for submitter). We still add the | ||
| // submitter's resource requests into MinResources so capacity is reserved. | ||
| if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, @rueian is right.
when using sidecar mode, the Min Resources's CPU shoulde be 3.5, but this is not correct here.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will request changes until all comments are resolved
Signed-off-by: win5923 <[email protected]>
ae2941f to
b8cb224
Compare
| labels: | ||
| ray.io/scheduler-name: volcano | ||
| volcano.sh/queue-name: kuberay-test-queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that volcano's default scheduler configmap is gang scheduling enabled!
so in the future, if the user want to disable it, we might need to tell them to edit the configmap or figure out some way to control it by adding more information in our CR.
thank you!!
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack| if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode || rayJob.Spec.SubmissionMode == rayv1.SidecarMode { | ||
| submitterTemplate := common.GetSubmitterTemplate(&rayJob.Spec, rayJob.Spec.RayClusterSpec) | ||
| submitterResource := utils.CalculatePodResource(submitterTemplate.Spec) | ||
| totalResourceList = append(totalResourceList, submitterResource) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result for this code is correct, but the behavior is not.
for k8s mode, we should get submitter's information from submitterTemplate
for sidecar mode, we should get submitter's information from GetDefaultSubmitterContainer, since we use this function currently.
update:
I've discussed offline with @win5923
I am writing a commit to fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Younikorn also uses GetSubmitterTemplate for sidecar mode. Would you like to update the Younikorn part as well (newTaskGroupsFromRayJobSpec)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I can do it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @rueian
I have tested the Yunikorn integration in both Kubernetes job mode and sidecar mode and can confirm that no code changes are required. The existing logic correctly handles both scenarios.
Kubernetes Job Mode
The function newTaskGroupsFromRayJobSpec is ultimately called by AddMetadataToChildResource. Within the RayJob controller, AddMetadataToChildResource is only invoked when the RayJob is configured for Kubernetes job mode, as seen in these two locations:
1st place:
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 582 to 606 in c6bafa3
| func (r *RayJobReconciler) createK8sJobIfNeed(ctx context.Context, rayJobInstance *rayv1.RayJob, rayClusterInstance *rayv1.RayCluster) error { | |
| logger := ctrl.LoggerFrom(ctx) | |
| job := &batchv1.Job{} | |
| namespacedName := common.RayJobK8sJobNamespacedName(rayJobInstance) | |
| if err := r.Client.Get(ctx, namespacedName, job); err != nil { | |
| if errors.IsNotFound(err) { | |
| submitterTemplate, err := getSubmitterTemplate(rayJobInstance, rayClusterInstance) | |
| if err != nil { | |
| return err | |
| } | |
| if r.options.BatchSchedulerManager != nil { | |
| if scheduler, err := r.options.BatchSchedulerManager.GetScheduler(); err == nil { | |
| scheduler.AddMetadataToChildResource(ctx, rayJobInstance, &submitterTemplate, utils.RayNodeSubmitterGroupLabelValue) | |
| } else { | |
| return err | |
| } | |
| } | |
| return r.createNewK8sJob(ctx, rayJobInstance, submitterTemplate) | |
| } | |
| return err | |
| } | |
| logger.Info("The submitter Kubernetes Job for RayJob already exists", "Kubernetes Job", job.Name) | |
| return nil | |
| } |
2nd place:
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 949 to 956 in c6bafa3
| if r.options.BatchSchedulerManager != nil && rayJobInstance.Spec.SubmissionMode == rayv1.K8sJobMode { | |
| if scheduler, err := r.options.BatchSchedulerManager.GetScheduler(); err == nil { | |
| // Group name is only used for individual pods to specify their task group ("headgroup", "worker-group-1", etc.). | |
| // RayCluster contains multiple groups, so we pass an empty string. | |
| scheduler.AddMetadataToChildResource(ctx, rayJobInstance, rayClusterInstance, "") | |
| } else { | |
| return nil, err | |
| } |
That's why it behaves correctly.
Because of this, the Yunikorn-specific logic is correctly applied only when the RayJob creates a Kubernetes Job, and it behaves as expected.
Sidecar Mode
In sidecar mode, the submitter container is added to the Ray head pod, which is part of the RayCluster specification. When the RayCluster controller reconciles the RayCluster custom resource, it calculates the task groups for the head and worker pods. At that point, the head pod correctly contains both the Ray head container and the submitter sidecar container, ensuring their resources are accounted for in the task group calculation, as handled by the logic here:
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 978 to 1016 in c6bafa3
| func (r *RayJobReconciler) constructRayClusterForRayJob(rayJobInstance *rayv1.RayJob, rayClusterName string) (*rayv1.RayCluster, error) { | |
| labels := make(map[string]string, len(rayJobInstance.Labels)) | |
| for key, value := range rayJobInstance.Labels { | |
| labels[key] = value | |
| } | |
| labels[utils.RayOriginatedFromCRNameLabelKey] = rayJobInstance.Name | |
| labels[utils.RayOriginatedFromCRDLabelKey] = utils.RayOriginatedFromCRDLabelValue(utils.RayJobCRD) | |
| rayCluster := &rayv1.RayCluster{ | |
| ObjectMeta: metav1.ObjectMeta{ | |
| Labels: labels, | |
| Annotations: rayJobInstance.Annotations, | |
| Name: rayClusterName, | |
| Namespace: rayJobInstance.Namespace, | |
| }, | |
| Spec: *rayJobInstance.Spec.RayClusterSpec.DeepCopy(), | |
| } | |
| // Set the ownership in order to do the garbage collection by k8s. | |
| if err := ctrl.SetControllerReference(rayJobInstance, rayCluster, r.Scheme); err != nil { | |
| return nil, err | |
| } | |
| // Inject a submitter container into the head Pod in SidecarMode. | |
| if rayJobInstance.Spec.SubmissionMode == rayv1.SidecarMode { | |
| sidecar, err := getSubmitterContainer(rayJobInstance, rayCluster) | |
| if err != nil { | |
| return nil, err | |
| } | |
| rayCluster.Spec.HeadGroupSpec.Template.Spec.Containers = append( | |
| rayCluster.Spec.HeadGroupSpec.Template.Spec.Containers, sidecar) | |
| // In K8sJobMode, the submitter Job relies on the K8s Job backoffLimit API to restart if it fails. | |
| // This mainly handles WebSocket connection failures caused by transient network issues. | |
| // In SidecarMode, however, the submitter container shares the same network namespace as the Ray dashboard, | |
| // so restarts are no longer needed. | |
| rayCluster.Spec.HeadGroupSpec.Template.Spec.RestartPolicy = corev1.RestartPolicyNever | |
| } | |
| return rayCluster, nil | |
| } |
Signed-off-by: Future-Outlier <[email protected]>
| corev1.ResourceCPU: submitterContainer.Resources.Requests[corev1.ResourceCPU], | ||
| corev1.ResourceMemory: submitterContainer.Resources.Requests[corev1.ResourceMemory], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take all the resource types into account here? We'd better not assume there are only CPU and memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fix it here, thank you!
cf6b48b
Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Rueian <[email protected]>




Why are these changes needed?
RayJob Volcano support: Adds Volcano scheduler support for RayJob CRD.Gang scheduling: Ensures Ray pods and submitter pod are scheduled together as a unit, preventing partial scheduling issues.E2E
volcano:PodGroup
Queue
Testing RayJob HTTPMode
Related issue number
Closes #1580
Checks