Skip to content

Commit 693c021

Browse files
authored
[Doc][KubeRay] add minimum version requirement on kai-scheduler (#58161)
## Description kai-scheduler supports gang scheduling at [v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)). But gang scheduling doesn't work at v0.9.4. However, it works again at v0.10.0-rc1. ## Related issues ## Additional information The reason might be as follow. The `numOfHosts` is taken into consideration at v0.9.3. https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 The snippet of code is missing at v0.9.4. https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140 Then, it shows up at v0.10.0-rc1. https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 --------- Signed-off-by: fscnick <[email protected]>
1 parent f229d86 commit 693c021

File tree

1 file changed

+5
-8
lines changed

1 file changed

+5
-8
lines changed

doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,25 +44,22 @@ You can arrange queues hierarchically for organizations with multiple teams, for
4444
* Kubernetes cluster with GPU nodes
4545
* NVIDIA GPU Operator
4646
* kubectl configured to access your cluster
47-
48-
## Step 1: Install KAI Scheduler
49-
50-
Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command.
47+
* Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. It's recommended to choose v0.10.0 or higher version.
5148

5249
```bash
5350
# Install KAI Scheduler
5451
helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true"
5552
```
5653

57-
## Step 2: Install the KubeRay operator with KAI Scheduler as the batch scheduler
54+
## Step 1: Install the KubeRay operator with KAI Scheduler as the batch scheduler
5855

5956
Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration:
6057

6158
```bash
6259
--set batchScheduler.name=kai-scheduler
6360
```
6461

65-
## Step 3: Create KAI Scheduler Queues
62+
## Step 2: Create KAI Scheduler Queues
6663

6764
Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs:
6865

@@ -112,7 +109,7 @@ spec:
112109

113110
Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
114111

115-
## Step 4: Gang scheduling with KAI Scheduler
112+
## Step 3: Gang scheduling with KAI Scheduler
116113

117114
The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository:
118115

@@ -175,7 +172,7 @@ You can submit the same workload above with a specific priority. Modify the abov
175172
```
176173
See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/priority) for more information.
177174
178-
## Step 5: Submitting Ray workers with GPU sharing
175+
## Step 4: Submitting Ray workers with GPU sharing
179176
180177
This example creates two workers that share a single GPU (0.5 each, with time-slicing) within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)):
181178

0 commit comments

Comments
 (0)