Skip to content
111 changes: 111 additions & 0 deletions ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: kuberay-test-queue
spec:
weight: 1
capability:
cpu: 4
memory: 6Gi
---
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample-0
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: kuberay-test-queue
Comment on lines +15 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @win5923
can we add ray.io/gang-scheduling-enabled: "true" in the example and test them?

Copy link
Collaborator Author

@win5923 win5923 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, adding ray.io/gang-scheduling-enabled: "true" does not have any effect. This only works with YuniKorn or the Scheduler plugin.

func (y *YuniKornScheduler) isGangSchedulingEnabled(obj metav1.Object) bool {
_, exist := obj.GetLabels()[utils.RayGangSchedulingEnabled]
return exist
}

func (k *KubeScheduler) isGangSchedulingEnabled(app *rayv1.RayCluster) bool {
_, exist := app.Labels[utils.RayGangSchedulingEnabled]
return exist
}

And I think this is a breaking change if we add this check. We should also update the doc to mention that starting from version 1.5.0, users need to add ray.io/gang-scheduling-enabled: "true" to enable gang scheduling for Volcano. https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that volcano's default scheduler configmap is gang scheduling enabled!
so in the future, if the user want to disable it, we might need to tell them to edit the configmap or figure out some way to control it by adding more information in our CR.

thank you!!

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

spec:
entrypoint: python /home/ray/samples/sample_code.py
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
counter_name: "test_counter"
rayClusterSpec:
rayVersion: '2.46.0'
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.46.0
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
- name: code-sample
configMap:
name: ray-job-code-sample
items:
- key: sample_code.py
path: sample_code.py
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: small-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.46.0
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "1"
memory: "1Gi"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
sample_code.py: |
import ray
import os
import requests

ray.init()

@ray.remote
class Counter:
def __init__(self):
# Used to verify runtimeEnv
self.name = os.getenv("counter_name")
assert self.name == "test_counter"
self.counter = 0

def inc(self):
self.counter += 1

def get_counter(self):
return "{} got {}".format(self.name, self.counter)

counter = Counter.remote()

for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))

# Verify that the correct runtime env was used for the job.
assert requests.__version__ == "2.26.0"
Loading
Loading