-
Notifications
You must be signed in to change notification settings - Fork 663
[RayJob] Yunikorn Integration #3948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rueian
merged 60 commits into
ray-project:master
from
owenowenisme:rayjob-yunikorn-integration
Sep 22, 2025
Merged
Changes from 6 commits
Commits
Show all changes
60 commits
Select commit
Hold shift + click to select a range
6c0d860
modify batch scheduler interface to support CRD other than RayCluster
troychiu ebf0cb3
update kai-scheduler to fit batchscheduler interface
owenowenisme 7d7823b
rename GetSchedulerForCluster to GetScheduler
owenowenisme 9e12f53
update
owenowenisme f5b7df1
update
owenowenisme 7d8c4e8
rename funcs
owenowenisme a9b6b95
update
owenowenisme 0e9cbd8
add unit test
owenowenisme 4eb38e5
update
owenowenisme 27ccfa2
update sample yaml
owenowenisme 7bf11d1
Update ray-operator/controllers/ray/rayjob_controller.go
owenowenisme 1abda9c
Update ray-operator/controllers/ray/rayjob_controller.go
owenowenisme da6efc2
remove redundant update rayjob
owenowenisme 51a2d1e
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme 0def3ae
update yaml
owenowenisme 6c77276
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme ce99f4e
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme 4a89804
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme 469e4a8
add logger back
owenowenisme 07e1f0d
rename AddMetadataToChildResourceFromRayJob to AddMetadataToChildReso…
owenowenisme 0bea9a1
add more unit test
owenowenisme e7e7f20
update unit test
owenowenisme d44b8be
Apply suggestion from @Future-Outlier
owenowenisme 3393d9f
Apply suggestion from @Future-Outlier
owenowenisme e28caa5
Apply suggestion from @Future-Outlier
owenowenisme 22f2679
Apply suggestion from @Future-Outlier
owenowenisme dcc6f1c
Apply suggestion from @Future-Outlier
owenowenisme 668d6a2
Apply suggestion from @Future-Outlier
owenowenisme f1e0c6b
remove comment in sample yaml
owenowenisme 7a2d474
update job controller to only use rayjob controller to set taskgroup …
owenowenisme c2f92a1
Merge remote-tracking branch 'upstream/master' into rayjob-yunikorn-i…
owenowenisme ce9a513
rename AddMetadataToChildResourceFromRayCluster to AddMetadataToPodFr…
owenowenisme 0551ac1
split AddMetadataToClildResources into 2 functions for RayCluster and…
owenowenisme 8c6d088
Update ray-operator/controllers/ray/batchscheduler/interface/interfac…
owenowenisme effd7bb
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme 12a1b0c
update log
owenowenisme c7e88b4
Merge remote-tracking branch 'upstream/master' into rayjob-yunikorn-i…
owenowenisme ee9a61b
Merge remote-tracking branch 'upstream/master' into rayjob-yunikorn-i…
owenowenisme 2b39ede
simplify interface
owenowenisme 1a414f0
add comment
owenowenisme 25eebf2
rename func
owenowenisme 110ba38
rename func
owenowenisme 83954d9
rename AddMetadataToChildResources to AddMetadataToChildResource
owenowenisme adf84e6
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme afa537f
remove redundant annotation existing check
owenowenisme 68e6cbe
add check if label exist before populate to child
owenowenisme 8f9aef5
rename RayClusterGangSchedulingEnabled
owenowenisme 74df79d
simplify propagateTaskGroupsAnnotation logic
owenowenisme 0d03a21
update
owenowenisme 2fe091e
update comment
owenowenisme cd09b3e
simplfy annotation propagation logic
owenowenisme 2c6e649
Update ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_…
owenowenisme 2646d20
Merge remote-tracking branch 'upstream/master' into rayjob-yunikorn-i…
owenowenisme ff38824
use-metav1-obj-instead-of-client-object
owenowenisme 69a0531
resolve comment of pattern consistency
owenowenisme 9a6c707
remove repetitive code in create task group
owenowenisme 7cc7c2d
Merge remote-tracking branch 'upstream/master' into rayjob-yunikorn-i…
owenowenisme 0646149
fix unit test
owenowenisme 3066c41
add comment back
owenowenisme 906ab83
Refactor RayJob submitter template handling
owenowenisme File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
157 changes: 157 additions & 0 deletions
157
ray-operator/config/samples/ray-job.yunikorn-scheduler.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| apiVersion: ray.io/v1 | ||
| kind: RayJob | ||
| metadata: | ||
| name: rayjob-yunikorn-scheduler | ||
| labels: | ||
| ray.io/gang-scheduling-enabled: "true" | ||
| yunikorn.apache.org/app-id: test-yunikorn-job-0 | ||
| yunikorn.apache.org/queue: root.test | ||
| spec: | ||
| # submissionMode specifies how RayJob submits the Ray job to the RayCluster. | ||
| # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job. | ||
| # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster. | ||
| # submissionMode: "K8sJobMode" | ||
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| entrypoint: python /home/ray/samples/sample_code.py | ||
| # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false. | ||
| # shutdownAfterJobFinishes: false | ||
|
|
||
| # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes. | ||
| # ttlSecondsAfterFinished: 10 | ||
|
|
||
| # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before | ||
| # KubeRay actively tries to terminate the RayJob; value must be positive integer. | ||
| # activeDeadlineSeconds: 120 | ||
|
|
||
| # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string. | ||
| # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details. | ||
| # (New in KubeRay version 1.0.) | ||
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| runtimeEnvYAML: | | ||
| pip: | ||
| - requests==2.26.0 | ||
| - pendulum==2.1.2 | ||
| env_vars: | ||
| counter_name: "test_counter" | ||
|
|
||
| # Suspend specifies whether the RayJob controller should create a RayCluster instance. | ||
| # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false. | ||
| # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created. | ||
| # suspend: false | ||
|
|
||
| # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller. | ||
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| rayClusterSpec: | ||
| rayVersion: '2.46.0' # should match the Ray version in the image of the containers | ||
| # Ray head pod template | ||
| headGroupSpec: | ||
| # The `rayStartParams` are used to configure the `ray start` command. | ||
| # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. | ||
| # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. | ||
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| rayStartParams: {} | ||
| #pod template | ||
owenowenisme marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| template: | ||
| spec: | ||
| containers: | ||
| - name: ray-head | ||
| image: rayproject/ray:2.46.0 | ||
| ports: | ||
| - containerPort: 6379 | ||
| name: gcs-server | ||
| - containerPort: 8265 # Ray dashboard | ||
| name: dashboard | ||
| - containerPort: 10001 | ||
| name: client | ||
| resources: | ||
| limits: | ||
| cpu: "1" | ||
| requests: | ||
| cpu: "200m" | ||
| volumeMounts: | ||
| - mountPath: /home/ray/samples | ||
| name: code-sample | ||
| volumes: | ||
| # You set volumes at the Pod level, then mount them into containers inside that Pod | ||
| - name: code-sample | ||
| configMap: | ||
| # Provide the name of the ConfigMap you want to mount. | ||
| name: ray-job-code-sample | ||
| # An array of keys from the ConfigMap to create as files | ||
| items: | ||
| - key: sample_code.py | ||
| path: sample_code.py | ||
| workerGroupSpecs: | ||
| # the pod replicas in this group typed worker | ||
| - replicas: 1 | ||
| minReplicas: 1 | ||
| maxReplicas: 5 | ||
| # logical group name, for this called small-group, also can be functional | ||
| groupName: small-group | ||
| # The `rayStartParams` are used to configure the `ray start` command. | ||
| # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. | ||
| # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. | ||
| rayStartParams: {} | ||
| #pod template | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' | ||
| image: rayproject/ray:2.46.0 | ||
| resources: | ||
| limits: | ||
| cpu: "1" | ||
| requests: | ||
| cpu: "200m" | ||
|
|
||
| # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster. | ||
| # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container. | ||
| # submitterPodTemplate: | ||
| # spec: | ||
| # restartPolicy: Never | ||
| # containers: | ||
| # - name: my-custom-rayjob-submitter-pod | ||
| # image: rayproject/ray:2.46.0 | ||
| # # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"] | ||
| # resources: | ||
| # limits: | ||
| # cpu: "1" | ||
| # requests: | ||
| # cpu: "200m" | ||
|
|
||
|
|
||
|
|
||
| ######################Ray code sample################################# | ||
| # this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example | ||
| # it is mounted into the container and executed to show the Ray job at work | ||
| --- | ||
owenowenisme marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: ray-job-code-sample | ||
| data: | ||
| sample_code.py: | | ||
| import ray | ||
| import os | ||
| import requests | ||
|
|
||
| ray.init() | ||
|
|
||
| @ray.remote | ||
| class Counter: | ||
| def __init__(self): | ||
| # Used to verify runtimeEnv | ||
| self.name = os.getenv("counter_name") | ||
| assert self.name == "test_counter" | ||
| self.counter = 0 | ||
|
|
||
| def inc(self): | ||
| self.counter += 1 | ||
|
|
||
| def get_counter(self): | ||
| return "{} got {}".format(self.name, self.counter) | ||
|
|
||
| counter = Counter.remote() | ||
|
|
||
| for _ in range(5): | ||
| ray.get(counter.inc.remote()) | ||
| print(ray.get(counter.get_counter.remote())) | ||
|
|
||
| # Verify that the correct runtime env was used for the job. | ||
| assert requests.__version__ == "2.26.0" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to set
yunikorn.apache.org/app-idandRayJob.Nameto the same name, so that when renaming, theapp-idis updated together. This makes it easier for users to understand when a newly createdRayJobis stuck in the Accepted state but not running yet.Ref: https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/yunikorn.html#step-4-use-apache-yunikorn-for-gang-scheduling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just updated.