-
Notifications
You must be signed in to change notification settings - Fork 7k
[Doc] Adding docs for Kuberay KAI scheduler integration #54857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jjyao
merged 54 commits into
ray-project:master
from
EkinKarabulut:docs/kai-scheduler-kuberay
Oct 23, 2025
Merged
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
ead34c7
docs: Adding docs for Kuberay KAI scheduler integration
EkinKarabulut 7024252
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 156d5df
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 3a87992
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 8627c3b
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 0fdb725
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 970bc33
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 0f21e8d
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut bf27b38
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 91bf42f
Updating the KAI explanation
EkinKarabulut f01878d
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 9794ed3
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 87ae6ca
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut dace493
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 5497faf
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 55105b4
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut fa6ad0d
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 4a91176
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut add44b1
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 03019ee
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 36e0327
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut ca1524b
fix: added links to install
EkinKarabulut 017a827
fix: wording
EkinKarabulut 77ee7f5
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 62afa32
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 399552c
Apply suggestion from @angelinalg
EkinKarabulut d40dedf
Apply suggestion from @angelinalg
EkinKarabulut 59135cd
Apply suggestion from @angelinalg
EkinKarabulut 1afba05
Apply suggestion from @angelinalg
EkinKarabulut 57faa86
Apply suggestion from @angelinalg
EkinKarabulut 75d2a09
Apply suggestion from @angelinalg
EkinKarabulut 5cc1127
Apply suggestion from @angelinalg
EkinKarabulut 31100eb
Apply suggestion from @angelinalg
EkinKarabulut 2892aa1
Apply suggestion from @angelinalg
EkinKarabulut f35b579
Apply suggestion from @angelinalg
EkinKarabulut 7300e7d
Apply suggestion from @angelinalg
EkinKarabulut f4dc60c
Apply suggestion from @angelinalg
EkinKarabulut ba27377
Apply suggestion from @angelinalg
EkinKarabulut b08eaca
Apply suggestion from @angelinalg
EkinKarabulut 2df426f
Apply suggestion from @angelinalg
EkinKarabulut 156c996
Apply suggestion from @angelinalg
EkinKarabulut 6f55166
Apply suggestion from @angelinalg
EkinKarabulut 3c1cdda
Apply suggestion from @angelinalg
EkinKarabulut 48ba71e
Apply suggestion from @angelinalg
EkinKarabulut 3217d90
Edits on suggestions
EkinKarabulut dedab5c
Making the suggested changes for the docs, adding curl commands for y…
EkinKarabulut edd90a0
Merge branch 'master' into docs/kai-scheduler-kuberay
jjyao e3d11e2
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 077c5b1
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 2447938
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut ed97293
Deleting the yaml example of gpu-sharing
EkinKarabulut 28a1ef5
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
EkinKarabulut 28b4d68
Merge branch 'master' into docs/kai-scheduler-kuberay
jjyao aad0131
Update doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
rueian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
212 changes: 212 additions & 0 deletions
212
doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,212 @@ | ||
| (kuberay-kai-scheduler)= | ||
| # Gang scheduling, queue priority, and GPU sharing for RayClusters using KAI Scheduler | ||
|
|
||
| This guide demonstrates how to use KAI Scheduler for setting up hierarchical queues with quotas, gang scheduling, and GPU sharing using RayClusters. | ||
|
|
||
|
|
||
| ## KAI Scheduler | ||
|
|
||
| [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are: | ||
| - **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation (bin packing) or increasing resiliency and load balancing (spread scheduling) | ||
| - **GPU sharing**: Allow KAI to pack multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. | ||
| - **Workload autoscaling**: Scale Ray replicas or workers within min/max while respecting gang constraints | ||
| - **Cluster autoscaling**: Compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) | ||
| - **Workload priorities**: Prioritize Ray workloads effectively within queues | ||
| - **Hierarchical queues and fairness**: Two-level queues with quotas, over-quota weights, limits, and equitable resource distribution between queues using DRF | ||
| and many more. | ||
| For more details and key features, see [the documentation](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#key-features). | ||
|
|
||
| ### Core components | ||
|
|
||
| 1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They are vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically. | ||
|
|
||
| **How PodGrouper works:** | ||
| ``` | ||
| RayCluster "distributed-training": | ||
| ├── Head Pod: 1 GPU | ||
| └── Worker Group: 4 × 0.5 GPU = 2 GPUs | ||
| Total Group Requirement: 3 GPUs | ||
|
|
||
| PodGrouper schedules all 5 pods (1 head + 4 workers) together or none at all. | ||
| ``` | ||
|
|
||
| 2. **Queues**: Queues enforce fairness in resource distribution using: | ||
|
|
||
| - Quota: The baseline amount of resources guaranteed to the queue. The scheduler allocates quotas first to ensure fairness. | ||
| - Queue priority: Determines the order in which queues receive resources beyond their quota. The scheduler serves the higher-priority queues first. | ||
| - Over-quota weight: Controls how the scheduler divides surplus resources among queues within the same priority level. Queues with higher weights receive a larger share of the extra resources. | ||
| - Limit: Defines the maximum resources that the queue can consume. | ||
|
|
||
| You can arrange queues hierarchically for organizations with multiple teams, for example, departments with multiple teams. | ||
|
|
||
| ## [Prerequisites](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#prerequisites) | ||
|
|
||
| * Kubernetes cluster with GPU nodes | ||
| * NVIDIA GPU Operator | ||
| * kubectl configured to access your cluster | ||
|
|
||
| ## Step 1: Install KAI Scheduler | ||
|
|
||
| Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. | ||
|
|
||
| ```bash | ||
| # Install KAI Scheduler | ||
| helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true" | ||
| ``` | ||
|
|
||
| ## Step 2: Install the KubeRay operator with KAI Scheduler as the batch scheduler | ||
|
|
||
| Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration: | ||
|
|
||
| ```bash | ||
| --set batchScheduler.name=kai-scheduler | ||
| ``` | ||
|
|
||
| ## Step 3: Create KAI Scheduler Queues | ||
|
|
||
| Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs: | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: department-1 | ||
| spec: | ||
| #priority: 100 (optional) | ||
| resources: | ||
| cpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| gpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| memory: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| --- | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: team-a | ||
| spec: | ||
| #priority: 200 (optional) | ||
| parentQueue: department-1 | ||
| resources: | ||
| cpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| gpu: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| memory: | ||
| quota: -1 | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
|
|
||
| ``` | ||
|
|
||
| Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once. | ||
|
|
||
| ## Step 4: Gang scheduling with KAI Scheduler | ||
|
|
||
| The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository: | ||
|
|
||
| ```yaml | ||
| metadata: | ||
| name: raycluster-sample | ||
| labels: | ||
| kai.scheduler/queue: team-a # This is the essential configuration. | ||
| ``` | ||
|
|
||
| Apply this RayCluster with queues: | ||
|
|
||
| ```bash | ||
| curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml | ||
|
|
||
| kubectl apply -f ray-cluster.kai-scheduler.yaml | ||
|
|
||
| #Verify queues are created | ||
| kubectl get queues | ||
|
|
||
| # Watch the pods get scheduled | ||
| kubectl get pods -w | ||
| ``` | ||
|
|
||
| ## Set priorities for workloads | ||
|
|
||
| In Kubernetes, assigning different priorities to workloads ensures efficient resource management, minimizes service disruption, and supports better scaling. By prioritizing workloads, KAI Scheduler schedules jobs according to their assigned priority. When sufficient resources aren't available for a workload, the scheduler can preempt lower-priority workloads to free up resources for higher-priority ones. This approach ensures the scheduler always prioritizes that mission-critical services in resource allocation. | ||
|
|
||
| KAI scheduler deployment comes with several predefined priority classes: | ||
|
|
||
| - train (50) - use for preemptible training workloads | ||
| - build-preemptible (75) - use for preemptible build/interactive workloads | ||
| - build (100) - use for build/interactive workloads (non-preemptible) | ||
| - inference (125) - use for inference workloads (non-preemptible) | ||
|
|
||
| You can submit the same workload above with a specific priority. Modify the above example into a build class workload: | ||
|
|
||
| ```yaml | ||
| labels: | ||
| kai.scheduler/queue: team-a # This is the essential configuration. | ||
| priorityClassName: build # Here you can specify the priority class in metadata.labels (optional) | ||
| ``` | ||
| See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/priority) for more information. | ||
|
|
||
| ## Step 5: Submitting Ray workers with GPU sharing | ||
|
|
||
| This example creates two workers that share a single GPU (0.5 each, with time-slicing) within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)): | ||
|
|
||
| ```bash | ||
| curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml | ||
|
|
||
| kubectl apply -f ray-cluster.kai-gpu-sharing.yaml | ||
|
|
||
| # Watch the pods get scheduled | ||
| kubectl get pods -w | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better to put the output of the command to help the reader verify the expect output easily. |
||
| ``` | ||
|
|
||
| Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches (e.g., MPS), see the [the KAI documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/gpu-sharing). | ||
|
|
||
| ### Verify GPU sharing is working | ||
|
|
||
| To confirm that GPU sharing is working correctly, use these commands: | ||
|
|
||
| ```bash | ||
| # 1. Check GPU fraction annotations and shared GPU groups | ||
| kubectl get pods -l ray.io/cluster=raycluster-half-gpu -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group" | ||
| ``` | ||
|
|
||
| You should see both worker pods on the same node with `GPU-FRACTION: 0.5` and the same `GPU-GROUP` ID: | ||
|
|
||
| ```bash | ||
| NAME NODE GPU-FRACTION GPU-GROUP | ||
| raycluster-half-gpu-head ip-xxx-xx-xx-xxx <none> <none> | ||
| raycluster-half-gpu-shared-gpu-worker-67tvw ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76 | ||
| raycluster-half-gpu-shared-gpu-worker-v5tpp ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76 | ||
EkinKarabulut marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| This shows that both workers have the same `NVIDIA_VISIBLE_DEVICES` (same physical GPU) and `GPU-FRACTION: 0.50`. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Check for missing queue labels | ||
|
|
||
| If pods remain in `Pending` state, the most common issue is missing queue labels. | ||
|
|
||
| Check operator logs for KAI Scheduler errors and look for error messages like: | ||
|
|
||
| ```bash | ||
| "Queue label missing from RayCluster; pods will remain pending" | ||
| ``` | ||
| **Solution**: Ensure your RayCluster has the queue label that exists in the cluster: | ||
|
|
||
| ```yaml | ||
| metadata: | ||
| labels: | ||
| kai.scheduler/queue: default # Add this label | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to put the output of the command to help the reader verify the expect output easily.