Skip to content

Conversation

@win5923
Copy link
Collaborator

@win5923 win5923 commented Aug 18, 2025

Why are these changes needed?

  • RayJob Volcano support: Adds Volcano scheduler support for RayJob CRD.
  • Gang scheduling: Ensures Ray pods and submitter pod are scheduled together as a unit, preventing partial scheduling issues.

E2E

  1. Deploy the KubeRay operator with the batch-scheduler volcano:
./ray-operator/bin/manager -leader-election-namespace default -use-kubernetes-proxy -batch-scheduler=volcano
  1. Create a RayJob with a head node (1 CPU + 2Gi of RAM), two workers (1 CPU + 1Gi of RAM each) and one submitter pod (0.5 CPU + 200Mi of RAM), for a total of 3500m CPU and 4296Mi of RAM
kubectl apply -f ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
  1. Add an additional RayJob with the same configuration but with a different name
sed 's/rayjob-sample-0/rayjob-sample-1/' ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml | kubectl apply -f-
  1. All the pods stuck on pending for new RayJob
image

PodGroup

  • ray-rayjob-sample-0-pg:
$ k get podgroup ray-rayjob-sample-0-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:16:14Z"
  generation: 3
  name: ray-rayjob-sample-0-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-0
    uid: e7652cc7-7593-4bd1-8ab1-bc043e62d7e5
  resourceVersion: "8779"
  uid: 84247ace-fcb5-4bce-9e18-b33e3769b941
spec:
  minMember: 3
  minResources:
    cpu: 3500m
    memory: 4296Mi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:16:15Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: 6ccaf1db-e4f6-4cfa-ad71-f3abf039e03c
    type: Scheduled
  phase: Running
  running: 1
  • ray-rayjob-sample-1-pg:
$ k get podgroup ray-rayjob-sample-1-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:17:54Z"
  generation: 2
  name: ray-rayjob-sample-1-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-1
    uid: 3a98a4fe-19a5-4f36-9ba3-ebd252c5a267
  resourceVersion: "9080"
  uid: 0dde7617-6fa7-4867-97f6-2deb965170a1
spec:
  minMember: 3
  minResources:
    cpu: 3500m
    memory: 4296Mi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:17:55Z"
    message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
      3 minAvailable; Pending: 3 Unschedulable'
    reason: NotEnoughResources
    status: "True"
    transitionID: cfb01bbc-c53b-42e4-9b02-8e56c46b8e6c
    type: Unschedulable
  phase: Pending

Queue

$ k get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}
  creationTimestamp: "2025-09-25T15:17:54Z"
  generation: 2
  name: kuberay-test-queue
  resourceVersion: "9089"
  uid: 2690f4ca-aa29-4812-aa21-3d0228dfa271
spec:
  capability:
    cpu: 4
    memory: 6Gi
  parent: root
  reclaimable: true
  weight: 1
status:
  allocated:
    cpu: "3"
    memory: 4Gi
    pods: "3"
  reservation: {}
  state: Open

Testing RayJob HTTPMode

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample-2
  labels:
    ray.io/scheduler-name: volcano
    volcano.sh/queue-name: kuberay-test-queue
spec:
  submissionMode: HTTPMode
$ k get podgroup ray-rayjob-sample-2-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-25T15:20:15Z"
  generation: 2
  name: ray-rayjob-sample-2-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-2
    uid: cc971b5a-347f-4e12-bc24-e6a65710f8d8
  resourceVersion: "9342"
  uid: e87b6399-74fa-4b42-9d88-3413c5b84865
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-25T15:20:16Z"
    message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
      3 minAvailable; Pending: 3 Unschedulable'
    reason: NotEnoughResources
    status: "True"
    transitionID: c5c9e5c2-1bd4-4b0e-b405-78331ea6caf1
    type: Unschedulable
  phase: Pending

Related issue number

Closes #1580

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 changed the title [POC] RayJob Volcano Integration RayJob Volcano Integration Aug 18, 2025
@win5923 win5923 force-pushed the rayjob-volcano branch 8 times, most recently from bc3811c to d591688 Compare September 11, 2025 16:10
@win5923 win5923 marked this pull request as ready for review September 11, 2025 16:23
@win5923 win5923 marked this pull request as draft September 22, 2025 13:59
@win5923 win5923 force-pushed the rayjob-volcano branch 7 times, most recently from 26af624 to ace94b2 Compare September 23, 2025 15:53
@win5923 win5923 marked this pull request as ready for review September 23, 2025 15:54
@win5923 win5923 force-pushed the rayjob-volcano branch 3 times, most recently from c10c53e to 9cd200c Compare September 23, 2025 16:37
Signed-off-by: win5923 <[email protected]>
@win5923 win5923 requested a review from troychiu October 4, 2025 17:46
@troychiu
Copy link
Collaborator

troychiu commented Oct 5, 2025

cc @Future-Outlier @rueian

Signed-off-by: win5923 <[email protected]>
@rueian rueian requested a review from Copilot October 7, 2025 20:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Volcano scheduler support for RayJob CRD, enabling gang scheduling to ensure Ray pods and submitter pods are scheduled together as a unit. This prevents partial scheduling issues where only some pods of a RayJob get scheduled.

  • Extends the existing Volcano batch scheduler to support RayJob objects in addition to RayCluster
  • Implements PodGroup creation for RayJob resources with proper resource calculation including submitter pod resources
  • Adds comprehensive test coverage for RayJob Volcano integration with different submission modes

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
ray-operator/controllers/ray/utils/util.go Exports SumResourceList function for use in Volcano scheduler
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler.go Adds RayJob support to Volcano scheduler with gang scheduling logic
ray-operator/controllers/ray/batchscheduler/volcano/volcano_scheduler_test.go Adds comprehensive test coverage for RayJob Volcano integration
ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml Provides sample configuration for testing RayJob with Volcano scheduler

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +78 to +80
// MinMember intentionally excludes the submitter pod to avoid a startup deadlock
// (submitter waits for cluster; gang would wait for submitter). We still add the
// submitter's resource requests into MinResources so capacity is reserved.
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains the design decision well, but could be clearer about what 'gang would wait for submitter' means. Consider expanding to explain that the gang scheduler would wait for all pods including the submitter to be schedulable before scheduling any, creating a circular dependency.

Suggested change
// MinMember intentionally excludes the submitter pod to avoid a startup deadlock
// (submitter waits for cluster; gang would wait for submitter). We still add the
// submitter's resource requests into MinResources so capacity is reserved.
// MinMember intentionally excludes the submitter pod to avoid a startup deadlock.
// If the submitter pod were included in MinMember, the gang scheduler would wait for
// all pods—including the submitter—to be schedulable before scheduling any of them.
// This creates a circular dependency: the submitter pod waits for the cluster to be ready,
// but the cluster cannot be scheduled until the submitter is also schedulable. To avoid this,
// we exclude the submitter from MinMember, but still add its resource requests into MinResources
// so that capacity is reserved for it.

Copilot uses AI. Check for mistakes.
podGroup := volcanoschedulingv1beta1.PodGroup{}
if err := v.cli.Get(ctx, types.NamespacedName{Namespace: owner.GetNamespace(), Name: podGroupName}, &podGroup); err != nil {
if !errors.IsNotFound(err) {
logger.Error(err, "failed to get PodGroup", "podGroupName", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace())
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.

Copilot uses AI. Check for mistakes.
}

logger.Error(err, "Pod group CREATE error!", "PodGroup.Error", err)
logger.Error(err, "failed to create PodGroup", "name", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace())
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.

Copilot uses AI. Check for mistakes.
podGroup.Spec.MinResources = &totalResource
if err := v.cli.Update(ctx, &podGroup); err != nil {
logger.Error(err, "Pod group UPDATE error!", "podGroup", podGroupName)
logger.Error(err, "failed to update PodGroup", "name", podGroupName, "ownerKind", utils.GetCRDType(owner.GetLabels()[utils.RayOriginatedFromCRDLabelKey]), "ownerName", owner.GetName(), "ownerNamespace", owner.GetNamespace())
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer dereference if owner.GetLabels() returns nil. The code should check if labels exist before accessing the map.

Copilot uses AI. Check for mistakes.
// MinMember intentionally excludes the submitter pod to avoid a startup deadlock
// (submitter waits for cluster; gang would wait for submitter). We still add the
// submitter's resource requests into MinResources so capacity is reserved.
if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this for SidecarMode?

Copy link
Member

@Future-Outlier Future-Outlier Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in SidecarMode, when calculating the head pod's resource, we will call the function CalculatePodResource, and this func will iterate all containers in the head pod spec like thisfor _, container := range podSpec.Containers.
so I think sidecar mode will work, but if we can get a screenshot about a test for sidecar mode I will appreciate it.

cc @win5923

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, @rueian is right.
when using sidecar mode, the Min Resources's CPU shoulde be 3.5, but this is not correct here.

Image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rueian!
I’ve already added the check for SidecarMode. ae2941f

image

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice!
but I think we have to

  1. calculate the correct resource in the sidecar mode
  2. test gangscheduling behavior before this PR get merged.
  3. add gangscheduling label in the example, and test k8s job mode, http mode, and sidecar mode.
flowchart TD
    %% User creates RayJob
    A[User creates RayJob] --> B{Check RayJob Labels}
    B --> C["ray.io/gang-scheduling-enabled = \"true\""]
    B --> D["volcano.sh/queue-name = \"default\""]
    B --> E["ray.io/priority-class-name = \"high\""]
    
    %% RayJob Controller processes RayJob
    C --> F[RayJob Controller]
    D --> F
    E --> F
    
    F --> G[constructRayClusterForRayJob]
    G --> H[Copy RayJob Labels to RayCluster]
    G --> I[Copy RayJob Annotations to RayCluster]
    
    %% Batch Scheduler Manager intervention
    H --> J{BatchSchedulerManager Check}
    I --> J
    J --> K[VolcanoBatchScheduler.DoBatchSchedulingOnSubmission]
    
    %% Volcano scheduler handles RayJob
    K --> L[handleRayJob]
    L --> M[calculatePodGroupParams]
    M --> N[Calculate MinMember and MinResources]
    N --> O[MinMember = Head + Workers]
    N --> P[MinResources = Head + Workers + Submitter]
    
    %% Create PodGroup
    O --> Q[syncPodGroup]
    P --> Q
    Q --> R[Create PodGroup CRD]
    R --> S["PodGroup Name: ray-jobname-pg"]
    R --> T["MinMember: Head + Workers"]
    R --> U["MinResources: Total Resources"]
    R --> V["Queue: volcano.sh/queue-name"]
    R --> W["PriorityClassName: ray.io/priority-class-name"]
    
    %% RayCluster creation
    S --> X[RayCluster Created]
    T --> X
    U --> X
    V --> X
    W --> X
    
    %% RayCluster Controller handles Pods
    X --> Y[RayCluster Controller]
    Y --> Z[buildHeadPod]
    Y --> AA[buildWorkerPod]
    
    %% Add Volcano metadata to each Pod
    Z --> BB[VolcanoBatchScheduler.AddMetadataToPod]
    AA --> BB
    
    BB --> CC[Set Pod Annotations]
    CC --> DD["scheduling.volcano.sh/group-name = ray-jobname-pg"]
    CC --> EE["batch.volcano.sh/task-spec = groupName"]
    
    BB --> FF[Set Pod Labels]
    FF --> GG["volcano.sh/queue-name = \"default\""]
    
    BB --> HH[Set Pod Spec]
    HH --> II["schedulerName = \"volcano\""]
    HH --> JJ["priorityClassName = \"high\""]
    
    %% RayJob Controller creates Kubernetes Job
    F --> KK[Create Kubernetes Job]
    KK --> LL[Kubernetes Job Controller]
    LL --> MM[Create Job Pod]
    
    %% Add Volcano metadata to Job Pod
    MM --> NN[VolcanoBatchScheduler.AddMetadataToPod]
    NN --> OO[Set Job Pod Annotations]
    OO --> PP["scheduling.volcano.sh/group-name = ray-jobname-pg"]
    OO --> QQ["batch.volcano.sh/task-spec = \"submitter\""]
    
    NN --> RR[Set Job Pod Labels]
    RR --> SS["volcano.sh/queue-name = \"default\""]
    
    NN --> TT[Set Job Pod Spec]
    TT --> UU["schedulerName = \"volcano\""]
    TT --> VV["priorityClassName = \"high\""]
    
    %% Pod scheduling
    DD --> WW[All Pods submitted to Volcano]
    EE --> WW
    GG --> WW
    II --> WW
    JJ --> WW
    PP --> WW
    QQ --> WW
    SS --> WW
    UU --> WW
    VV --> WW
    
    WW --> XX[Volcano Gang Scheduler]
    XX --> YY[Check PodGroup status]
    YY --> ZZ[Wait for PodGroup resources]
    ZZ --> AAA[Check MinMember and MinResources]
    AAA --> BBB[Schedule all Pods simultaneously]
    
    %% Final state
    BBB --> CCC[Ray Cluster Running]
    CCC --> DDD[Job Pod Executing]
    DDD --> EEE[Submit Ray Job to Ray Cluster]
    EEE --> FFF[Execute Ray Job]
    
    %% Style definitions
    classDef userAction fill:#e1f5fe
    classDef controller fill:#f3e5f5
    classDef scheduler fill:#e8f5e8
    classDef pod fill:#fff3e0
    classDef volcano fill:#ffebee
    classDef podgroup fill:#f0f8ff
    classDef job fill:#f0f8ff
    
    class A userAction
    class F,G,Y,LL controller
    class K,L,M,N,Q,XX scheduler
    class Z,AA,MM pod
    class BB,CC,DD,EE,FF,GG,HH,II,JJ,NN,OO,PP,QQ,RR,SS,TT,UU,VV volcano
    class R,S,T,U,V,W podgroup
    class KK,DDD,EEE job
Loading

Comment on lines +15 to +17
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: kuberay-test-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @win5923
can we add ray.io/gang-scheduling-enabled: "true" in the example and test them?

Copy link
Collaborator Author

@win5923 win5923 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, adding ray.io/gang-scheduling-enabled: "true" does not have any effect. This only works with YuniKorn or the Scheduler plugin.

func (y *YuniKornScheduler) isGangSchedulingEnabled(obj metav1.Object) bool {
_, exist := obj.GetLabels()[utils.RayGangSchedulingEnabled]
return exist
}

func (k *KubeScheduler) isGangSchedulingEnabled(app *rayv1.RayCluster) bool {
_, exist := app.Labels[utils.RayGangSchedulingEnabled]
return exist
}

And I think this is a breaking change if we add this check. We should also update the doc to mention that starting from version 1.5.0, users need to add ray.io/gang-scheduling-enabled: "true" to enable gang scheduling for Volcano. https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that volcano's default scheduler configmap is gang scheduling enabled!
so in the future, if the user want to disable it, we might need to tell them to edit the configmap or figure out some way to control it by adding more information in our CR.

thank you!!

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

// MinMember intentionally excludes the submitter pod to avoid a startup deadlock
// (submitter waits for cluster; gang would wait for submitter). We still add the
// submitter's resource requests into MinResources so capacity is reserved.
if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, @rueian is right.
when using sidecar mode, the Min Resources's CPU shoulde be 3.5, but this is not correct here.

Image

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will request changes until all comments are resolved

Signed-off-by: win5923 <[email protected]>
Comment on lines +15 to +17
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: kuberay-test-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that volcano's default scheduler configmap is gang scheduling enabled!
so in the future, if the user want to disable it, we might need to tell them to edit the configmap or figure out some way to control it by adding more information in our CR.

thank you!!

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Comment on lines 81 to 85
if rayJob.Spec.SubmissionMode == rayv1.K8sJobMode || rayJob.Spec.SubmissionMode == rayv1.SidecarMode {
submitterTemplate := common.GetSubmitterTemplate(&rayJob.Spec, rayJob.Spec.RayClusterSpec)
submitterResource := utils.CalculatePodResource(submitterTemplate.Spec)
totalResourceList = append(totalResourceList, submitterResource)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result for this code is correct, but the behavior is not.
for k8s mode, we should get submitter's information from submitterTemplate
for sidecar mode, we should get submitter's information from GetDefaultSubmitterContainer, since we use this function currently.

update:
I've discussed offline with @win5923
I am writing a commit to fix this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Younikorn also uses GetSubmitterTemplate for sidecar mode. Would you like to update the Younikorn part as well (newTaskGroupsFromRayJobSpec)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I can do it now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidecar mode (only 2 items in task-groups) (the head pod resource = head container + submitter container)

image image

k8s mode (3 items in task-groups (head pod, worker pod, and submmiter pod))

image

Copy link
Member

@Future-Outlier Future-Outlier Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rueian
I have tested the Yunikorn integration in both Kubernetes job mode and sidecar mode and can confirm that no code changes are required. The existing logic correctly handles both scenarios.

Kubernetes Job Mode

The function newTaskGroupsFromRayJobSpec is ultimately called by AddMetadataToChildResource. Within the RayJob controller, AddMetadataToChildResource is only invoked when the RayJob is configured for Kubernetes job mode, as seen in these two locations:

1st place:

func (r *RayJobReconciler) createK8sJobIfNeed(ctx context.Context, rayJobInstance *rayv1.RayJob, rayClusterInstance *rayv1.RayCluster) error {
logger := ctrl.LoggerFrom(ctx)
job := &batchv1.Job{}
namespacedName := common.RayJobK8sJobNamespacedName(rayJobInstance)
if err := r.Client.Get(ctx, namespacedName, job); err != nil {
if errors.IsNotFound(err) {
submitterTemplate, err := getSubmitterTemplate(rayJobInstance, rayClusterInstance)
if err != nil {
return err
}
if r.options.BatchSchedulerManager != nil {
if scheduler, err := r.options.BatchSchedulerManager.GetScheduler(); err == nil {
scheduler.AddMetadataToChildResource(ctx, rayJobInstance, &submitterTemplate, utils.RayNodeSubmitterGroupLabelValue)
} else {
return err
}
}
return r.createNewK8sJob(ctx, rayJobInstance, submitterTemplate)
}
return err
}
logger.Info("The submitter Kubernetes Job for RayJob already exists", "Kubernetes Job", job.Name)
return nil
}

2nd place:
if r.options.BatchSchedulerManager != nil && rayJobInstance.Spec.SubmissionMode == rayv1.K8sJobMode {
if scheduler, err := r.options.BatchSchedulerManager.GetScheduler(); err == nil {
// Group name is only used for individual pods to specify their task group ("headgroup", "worker-group-1", etc.).
// RayCluster contains multiple groups, so we pass an empty string.
scheduler.AddMetadataToChildResource(ctx, rayJobInstance, rayClusterInstance, "")
} else {
return nil, err
}

That's why it behaves correctly.

Because of this, the Yunikorn-specific logic is correctly applied only when the RayJob creates a Kubernetes Job, and it behaves as expected.

Sidecar Mode

In sidecar mode, the submitter container is added to the Ray head pod, which is part of the RayCluster specification. When the RayCluster controller reconciles the RayCluster custom resource, it calculates the task groups for the head and worker pods. At that point, the head pod correctly contains both the Ray head container and the submitter sidecar container, ensuring their resources are accounted for in the task group calculation, as handled by the logic here:

func (r *RayJobReconciler) constructRayClusterForRayJob(rayJobInstance *rayv1.RayJob, rayClusterName string) (*rayv1.RayCluster, error) {
labels := make(map[string]string, len(rayJobInstance.Labels))
for key, value := range rayJobInstance.Labels {
labels[key] = value
}
labels[utils.RayOriginatedFromCRNameLabelKey] = rayJobInstance.Name
labels[utils.RayOriginatedFromCRDLabelKey] = utils.RayOriginatedFromCRDLabelValue(utils.RayJobCRD)
rayCluster := &rayv1.RayCluster{
ObjectMeta: metav1.ObjectMeta{
Labels: labels,
Annotations: rayJobInstance.Annotations,
Name: rayClusterName,
Namespace: rayJobInstance.Namespace,
},
Spec: *rayJobInstance.Spec.RayClusterSpec.DeepCopy(),
}
// Set the ownership in order to do the garbage collection by k8s.
if err := ctrl.SetControllerReference(rayJobInstance, rayCluster, r.Scheme); err != nil {
return nil, err
}
// Inject a submitter container into the head Pod in SidecarMode.
if rayJobInstance.Spec.SubmissionMode == rayv1.SidecarMode {
sidecar, err := getSubmitterContainer(rayJobInstance, rayCluster)
if err != nil {
return nil, err
}
rayCluster.Spec.HeadGroupSpec.Template.Spec.Containers = append(
rayCluster.Spec.HeadGroupSpec.Template.Spec.Containers, sidecar)
// In K8sJobMode, the submitter Job relies on the K8s Job backoffLimit API to restart if it fails.
// This mainly handles WebSocket connection failures caused by transient network issues.
// In SidecarMode, however, the submitter container shares the same network namespace as the Ray dashboard,
// so restarts are no longer needed.
rayCluster.Spec.HeadGroupSpec.Template.Spec.RestartPolicy = corev1.RestartPolicyNever
}
return rayCluster, nil
}

Signed-off-by: Future-Outlier <[email protected]>
Comment on lines 93 to 94
corev1.ResourceCPU: submitterContainer.Resources.Requests[corev1.ResourceCPU],
corev1.ResourceMemory: submitterContainer.Resources.Requests[corev1.ResourceMemory],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take all the resource types into account here? We'd better not assume there are only CPU and memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fix it here, thank you!
cf6b48b

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Rueian <[email protected]>
@rueian rueian merged commit 362da3d into ray-project:master Oct 9, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] RayJob Volcano integration

6 participants