-
Notifications
You must be signed in to change notification settings - Fork 7k
[Doc] Adding docs for Kuberay KAI scheduler integration #54857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 51 commits
ead34c7
7024252
156d5df
3a87992
8627c3b
0fdb725
970bc33
0f21e8d
bf27b38
91bf42f
f01878d
9794ed3
87ae6ca
dace493
5497faf
55105b4
fa6ad0d
4a91176
add44b1
03019ee
36e0327
ca1524b
017a827
77ee7f5
62afa32
399552c
d40dedf
59135cd
1afba05
57faa86
75d2a09
5cc1127
31100eb
2892aa1
f35b579
7300e7d
f4dc60c
ba27377
b08eaca
2df426f
156c996
6f55166
3c1cdda
48ba71e
3217d90
dedab5c
edd90a0
e3d11e2
077c5b1
2447938
ed97293
28a1ef5
28b4d68
aad0131
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,212 @@ | ||||||
| (kuberay-kai-scheduler)= | ||||||
| # Gang scheduling, queue priority, and GPU sharing for RayClusters using KAI Scheduler | ||||||
|
|
||||||
| This guide demonstrates how to use KAI Scheduler for setting up hierarchical queues with quotas, gang scheduling, and GPU sharing using RayClusters. | ||||||
|
|
||||||
|
|
||||||
| ## KAI Scheduler | ||||||
|
|
||||||
| [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are: | ||||||
| - **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation (bin packing) or increasing resiliency and load balancing (spread scheduling) | ||||||
| - **GPU sharing**: Allow KAI to pack multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. | ||||||
| - **Workload autoscaling**: Scale Ray replicas or workers within min/max while respecting gang constraints | ||||||
| - **Cluster autoscaling**: Compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) | ||||||
| - **Workload priorities**: Prioritize Ray workloads effectively within queues | ||||||
| - **Hierarchical queues and fairness**: Two-level queues with quotas, over-quota weights, limits, and equitable resource distribution between queues using DRF | ||||||
| and many more. | ||||||
| For more details and key features, see [the documentation](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#key-features). | ||||||
|
|
||||||
| ### Core components | ||||||
|
|
||||||
| 1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They are vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically. | ||||||
|
|
||||||
| **How PodGrouper works:** | ||||||
| ``` | ||||||
| RayCluster "distributed-training": | ||||||
| ├── Head Pod: 1 GPU | ||||||
| └── Worker Group: 4 × 0.5 GPU = 2 GPUs | ||||||
| Total Group Requirement: 3 GPUs | ||||||
|
|
||||||
| PodGrouper schedules all 5 pods (1 head + 4 workers) together or none at all. | ||||||
| ``` | ||||||
|
|
||||||
| 2. **Queues**: Queues enforce fairness in resource distribution using: | ||||||
|
|
||||||
| - Quota: The baseline amount of resources guaranteed to the queue. The scheduler allocates quotas first to ensure fairness. | ||||||
| - Queue priority: Determines the order in which queues receive resources beyond their quota. The scheduler serves the higher-priority queues first. | ||||||
| - Over-quota weight: Controls how the scheduler divides surplus resources among queues within the same priority level. Queues with higher weights receive a larger share of the extra resources. | ||||||
| - Limit: Defines the maximum resources that the queue can consume. | ||||||
|
|
||||||
| You can arrange queues hierarchically for organizations with multiple teams, for example, departments with multiple teams. | ||||||
|
|
||||||
| ## [Prerequisites](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#prerequisites) | ||||||
|
|
||||||
| * Kubernetes cluster with GPU nodes | ||||||
| * NVIDIA GPU Operator | ||||||
| * kubectl configured to access your cluster | ||||||
|
|
||||||
| ## Step 1: Install KAI Scheduler | ||||||
|
|
||||||
| Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. | ||||||
|
|
||||||
| ```bash | ||||||
| # Install KAI Scheduler | ||||||
| helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true" | ||||||
| ``` | ||||||
|
|
||||||
| ## Step 2: Install the KubeRay operator with KAI Scheduler as the batch scheduler | ||||||
|
|
||||||
| Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration: | ||||||
|
|
||||||
| ```bash | ||||||
| --set batchScheduler.name=kai-scheduler | ||||||
| ``` | ||||||
|
|
||||||
| ## Step 3: Create KAI Scheduler Queues | ||||||
|
|
||||||
| Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: scheduling.run.ai/v2 | ||||||
| kind: Queue | ||||||
| metadata: | ||||||
| name: department-1 | ||||||
| spec: | ||||||
| #priority: 100 (optional) | ||||||
| resources: | ||||||
| cpu: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
| gpu: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
| memory: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
| --- | ||||||
| apiVersion: scheduling.run.ai/v2 | ||||||
| kind: Queue | ||||||
| metadata: | ||||||
| name: team-a | ||||||
| spec: | ||||||
| #priority: 200 (optional) | ||||||
| parentQueue: department-1 | ||||||
| resources: | ||||||
| cpu: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
| gpu: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
| memory: | ||||||
| quota: -1 | ||||||
| limit: -1 | ||||||
| overQuotaWeight: 1 | ||||||
|
|
||||||
| ``` | ||||||
|
|
||||||
| Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once. | ||||||
|
|
||||||
| ## Step 4: Gang scheduling with KAI Scheduler | ||||||
|
|
||||||
| The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository: | ||||||
|
|
||||||
| ```yaml | ||||||
| metadata: | ||||||
| name: raycluster-sample | ||||||
| labels: | ||||||
| kai.scheduler/queue: team-a # This is the essential configuration. | ||||||
| ``` | ||||||
|
|
||||||
| Apply this RayCluster with queues: | ||||||
|
|
||||||
| ```bash | ||||||
| curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml | ||||||
|
|
||||||
| kubectl apply -f ray-cluster.kai-scheduler.yaml | ||||||
|
|
||||||
| #Verify queues are created | ||||||
| kubectl get queues | ||||||
|
|
||||||
| # Watch the pods get scheduled | ||||||
| kubectl get pods -w | ||||||
| ``` | ||||||
|
|
||||||
| ## Set priorities for workloads | ||||||
|
|
||||||
| In Kubernetes, assigning different priorities to workloads ensures efficient resource management, minimizes service disruption, and supports better scaling. By prioritizing workloads, KAI Scheduler schedules jobs according to their assigned priority. When sufficient resources aren't available for a workload, the scheduler can preempt lower-priority workloads to free up resources for higher-priority ones. This approach ensures the scheduler always prioritizes that mission-critical services in resource allocation. | ||||||
|
|
||||||
| KAI scheduler deployment comes with several predefined priority classes: | ||||||
|
|
||||||
| - train (50) - use for preemptible training workloads | ||||||
| - build-preemptible (75) - use for preemptible build/interactive workloads | ||||||
| - build (100) - use for build/interactive workloads (non-preemptible) | ||||||
| - inference (125) - use for inference workloads (non-preemptible) | ||||||
|
|
||||||
| You can submit the same workload above with a specific priority. Modify the above example into a build class workload: | ||||||
|
|
||||||
| ```yaml | ||||||
| labels: | ||||||
| kai.scheduler/queue: team-a # This is the essential configuration. | ||||||
| priorityClassName: build # Here you can specify the priority class (optional) | ||||||
|
||||||
| priorityClassName: build # Here you can specify the priority class (optional) | |
| priorityClassName: build # Specify the priority class (optional) |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: RayCluster Priority Class Misplacement
The priorityClassName field is incorrectly placed in metadata.labels. For RayClusters, priorityClassName belongs in the pod template spec (e.g., spec.headGroupSpec.template.spec and spec.workerGroupSpecs[].template.spec), not as a label. This placement means the priority class won't be applied to the pods.
EkinKarabulut marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Priority Class Misplacement in RayCluster
The priorityClassName field is incorrectly placed in the RayCluster's metadata.labels section. This field belongs in the pod template spec (e.g., headGroupSpec.template.spec.priorityClassName and workerGroupSpecs[].template.spec.priorityClassName) to ensure the priority class is applied to pods.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Priority Class Misplaced in RayCluster Metadata
The priorityClassName is incorrectly specified as a label in the RayCluster metadata. This field belongs in the pod template spec for both head and worker groups, and its current placement prevents the priority class from being applied to the pods.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Priority Class Misplacement in RayCluster
The priorityClassName field is currently set as a label on the RayCluster metadata. In Kubernetes, priorityClassName belongs in the pod template spec (e.g., spec.headGroupSpec.template.spec.priorityClassName), so its current placement means the priority class won't be applied to the pods.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: RayCluster Priority Class Misplacement
The priorityClassName is incorrectly placed in the RayCluster's metadata.labels section. Kubernetes expects this field within the pod template spec (e.g., spec.headGroupSpec.template.spec.priorityClassName), so its current location prevents the priority class from being applied to the Ray pods.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: RayCluster Priority Class Placement Error
The priorityClassName is incorrectly placed under metadata.labels in the RayCluster example. Kubernetes expects priorityClassName within the pod template spec (spec.template.spec.priorityClassName) for it to be applied to pods, so it won't function as intended here.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @EkinKarabulut, could you make priorityClassName be spec level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rueian KAI Scheduler reads priority classes from workload labels (metadata.labels.priorityClassName) rather than pod specs, which allows it to assign priority to entire workloads. This is consistent with KAI Scheduler's official documentation and examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice! I will put a comment saying that it should not be the priorityClassName in the pod spec.
rueian marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
EkinKarabulut marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
EkinKarabulut marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to put the output of the command to help the reader verify the expect output easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to put the output of the command to help the reader verify the expect output easily.