Skip to content

Commit 0b5b80d

Browse files
EkinKarabulutgemini-code-assist[bot]angelinalgfscnickjjyao
authored
[Doc] Adding docs for Kuberay KAI scheduler integration (#54857)
Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]>
1 parent 4f497a6 commit 0b5b80d

File tree

2 files changed

+214
-0
lines changed

2 files changed

+214
-0
lines changed

doc/source/cluster/kubernetes/k8s-ecosystem.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ k8s-ecosystem/ingress
99
k8s-ecosystem/metrics-references
1010
k8s-ecosystem/prometheus-grafana
1111
k8s-ecosystem/pyspy
12+
k8s-ecosystem/kai-scheduler
1213
k8s-ecosystem/volcano
1314
k8s-ecosystem/yunikorn
1415
k8s-ecosystem/kueue
@@ -20,6 +21,7 @@ k8s-ecosystem/scheduler-plugins
2021
* {ref}`kuberay-metrics-references`
2122
* {ref}`kuberay-prometheus-grafana`
2223
* {ref}`kuberay-pyspy-integration`
24+
* {ref}`kuberay-kai-scheduler`
2325
* {ref}`kuberay-volcano`
2426
* {ref}`kuberay-yunikorn`
2527
* {ref}`kuberay-kueue`
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
(kuberay-kai-scheduler)=
2+
# Gang scheduling, queue priority, and GPU sharing for RayClusters using KAI Scheduler
3+
4+
This guide demonstrates how to use KAI Scheduler for setting up hierarchical queues with quotas, gang scheduling, and GPU sharing using RayClusters.
5+
6+
7+
## KAI Scheduler
8+
9+
[KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are:
10+
- **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation (bin packing) or increasing resiliency and load balancing (spread scheduling)
11+
- **GPU sharing**: Allow KAI to pack multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.
12+
- **Workload autoscaling**: Scale Ray replicas or workers within min/max while respecting gang constraints
13+
- **Cluster autoscaling**: Compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter)
14+
- **Workload priorities**: Prioritize Ray workloads effectively within queues
15+
- **Hierarchical queues and fairness**: Two-level queues with quotas, over-quota weights, limits, and equitable resource distribution between queues using DRF
16+
and many more.
17+
For more details and key features, see [the documentation](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#key-features).
18+
19+
### Core components
20+
21+
1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They are vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically.
22+
23+
**How PodGrouper works:**
24+
```
25+
RayCluster "distributed-training":
26+
├── Head Pod: 1 GPU
27+
└── Worker Group: 4 × 0.5 GPU = 2 GPUs
28+
Total Group Requirement: 3 GPUs
29+
30+
PodGrouper schedules all 5 pods (1 head + 4 workers) together or none at all.
31+
```
32+
33+
2. **Queues**: Queues enforce fairness in resource distribution using:
34+
35+
- Quota: The baseline amount of resources guaranteed to the queue. The scheduler allocates quotas first to ensure fairness.
36+
- Queue priority: Determines the order in which queues receive resources beyond their quota. The scheduler serves the higher-priority queues first.
37+
- Over-quota weight: Controls how the scheduler divides surplus resources among queues within the same priority level. Queues with higher weights receive a larger share of the extra resources.
38+
- Limit: Defines the maximum resources that the queue can consume.
39+
40+
You can arrange queues hierarchically for organizations with multiple teams, for example, departments with multiple teams.
41+
42+
## [Prerequisites](https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#prerequisites)
43+
44+
* Kubernetes cluster with GPU nodes
45+
* NVIDIA GPU Operator
46+
* kubectl configured to access your cluster
47+
48+
## Step 1: Install KAI Scheduler
49+
50+
Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command.
51+
52+
```bash
53+
# Install KAI Scheduler
54+
helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true"
55+
```
56+
57+
## Step 2: Install the KubeRay operator with KAI Scheduler as the batch scheduler
58+
59+
Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration:
60+
61+
```bash
62+
--set batchScheduler.name=kai-scheduler
63+
```
64+
65+
## Step 3: Create KAI Scheduler Queues
66+
67+
Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs:
68+
69+
```yaml
70+
apiVersion: scheduling.run.ai/v2
71+
kind: Queue
72+
metadata:
73+
name: department-1
74+
spec:
75+
#priority: 100 (optional)
76+
resources:
77+
cpu:
78+
quota: -1
79+
limit: -1
80+
overQuotaWeight: 1
81+
gpu:
82+
quota: -1
83+
limit: -1
84+
overQuotaWeight: 1
85+
memory:
86+
quota: -1
87+
limit: -1
88+
overQuotaWeight: 1
89+
---
90+
apiVersion: scheduling.run.ai/v2
91+
kind: Queue
92+
metadata:
93+
name: team-a
94+
spec:
95+
#priority: 200 (optional)
96+
parentQueue: department-1
97+
resources:
98+
cpu:
99+
quota: -1
100+
limit: -1
101+
overQuotaWeight: 1
102+
gpu:
103+
quota: -1
104+
limit: -1
105+
overQuotaWeight: 1
106+
memory:
107+
quota: -1
108+
limit: -1
109+
overQuotaWeight: 1
110+
111+
```
112+
113+
Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
114+
115+
## Step 4: Gang scheduling with KAI Scheduler
116+
117+
The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository:
118+
119+
```yaml
120+
metadata:
121+
name: raycluster-sample
122+
labels:
123+
kai.scheduler/queue: team-a # This is the essential configuration.
124+
```
125+
126+
Apply this RayCluster with queues:
127+
128+
```bash
129+
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml
130+
131+
kubectl apply -f ray-cluster.kai-scheduler.yaml
132+
133+
#Verify queues are created
134+
kubectl get queues
135+
136+
# Watch the pods get scheduled
137+
kubectl get pods -w
138+
```
139+
140+
## Set priorities for workloads
141+
142+
In Kubernetes, assigning different priorities to workloads ensures efficient resource management, minimizes service disruption, and supports better scaling. By prioritizing workloads, KAI Scheduler schedules jobs according to their assigned priority. When sufficient resources aren't available for a workload, the scheduler can preempt lower-priority workloads to free up resources for higher-priority ones. This approach ensures the scheduler always prioritizes that mission-critical services in resource allocation.
143+
144+
KAI scheduler deployment comes with several predefined priority classes:
145+
146+
- train (50) - use for preemptible training workloads
147+
- build-preemptible (75) - use for preemptible build/interactive workloads
148+
- build (100) - use for build/interactive workloads (non-preemptible)
149+
- inference (125) - use for inference workloads (non-preemptible)
150+
151+
You can submit the same workload above with a specific priority. Modify the above example into a build class workload:
152+
153+
```yaml
154+
labels:
155+
kai.scheduler/queue: team-a # This is the essential configuration.
156+
priorityClassName: build # Here you can specify the priority class in metadata.labels (optional)
157+
```
158+
See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/priority) for more information.
159+
160+
## Step 5: Submitting Ray workers with GPU sharing
161+
162+
This example creates two workers that share a single GPU (0.5 each, with time-slicing) within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)):
163+
164+
```bash
165+
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml
166+
167+
kubectl apply -f ray-cluster.kai-gpu-sharing.yaml
168+
169+
# Watch the pods get scheduled
170+
kubectl get pods -w
171+
```
172+
173+
Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches (e.g., MPS), see the [the KAI documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/gpu-sharing).
174+
175+
### Verify GPU sharing is working
176+
177+
To confirm that GPU sharing is working correctly, use these commands:
178+
179+
```bash
180+
# 1. Check GPU fraction annotations and shared GPU groups
181+
kubectl get pods -l ray.io/cluster=raycluster-half-gpu -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group"
182+
```
183+
184+
You should see both worker pods on the same node with `GPU-FRACTION: 0.5` and the same `GPU-GROUP` ID:
185+
186+
```bash
187+
NAME NODE GPU-FRACTION GPU-GROUP
188+
raycluster-half-gpu-head ip-xxx-xx-xx-xxx <none> <none>
189+
raycluster-half-gpu-shared-gpu-worker-67tvw ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76
190+
raycluster-half-gpu-shared-gpu-worker-v5tpp ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76
191+
```
192+
193+
This shows that both workers have the same `NVIDIA_VISIBLE_DEVICES` (same physical GPU) and `GPU-FRACTION: 0.50`.
194+
195+
## Troubleshooting
196+
197+
### Check for missing queue labels
198+
199+
If pods remain in `Pending` state, the most common issue is missing queue labels.
200+
201+
Check operator logs for KAI Scheduler errors and look for error messages like:
202+
203+
```bash
204+
"Queue label missing from RayCluster; pods will remain pending"
205+
```
206+
**Solution**: Ensure your RayCluster has the queue label that exists in the cluster:
207+
208+
```yaml
209+
metadata:
210+
labels:
211+
kai.scheduler/queue: default # Add this label
212+
```

0 commit comments

Comments
 (0)