[Doc][KubeRay] add minimum version requirement on kai-scheduler (#58161)

fscnick · web-flow · commit 693c021c7fbb · 2025-11-04T10:59:40.000-08:00
## Description kai-scheduler supports gang scheduling at [v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)). But gang scheduling doesn't work at v0.9.4. However, it works again at v0.10.0-rc1. ## Related issues ## Additional information The reason might be as follow. The `numOfHosts` is taken into consideration at v0.9.3. https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 The snippet of code is missing at v0.9.4. https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140 Then, it shows up at v0.10.0-rc1. https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 --------- Signed-off-by: fscnick <fscnick.dev@gmail.com>
diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md b/doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
@@ -44,25 +44,22 @@ You can arrange queues hierarchically for organizations with multiple teams, for
 * Kubernetes cluster with GPU nodes
 * NVIDIA GPU Operator 
 * kubectl configured to access your cluster
-
-## Step 1: Install KAI Scheduler
-
-Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command.
+* Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. It's recommended to choose v0.10.0 or higher version.
 
 ```bash
 # Install KAI Scheduler
 helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true"
 ```
 
-## Step 2: Install the KubeRay operator with KAI Scheduler as the batch scheduler
+## Step 1: Install the KubeRay operator with KAI Scheduler as the batch scheduler
 
 Follow the official KubeRay operator [installation documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/kuberay-operator-installation.html#kuberay-operator-installation) and add the following configuration to enable KAI Scheduler integration:
 
 ```bash
 --set batchScheduler.name=kai-scheduler
 ```
 
-## Step 3: Create KAI Scheduler Queues
+## Step 2: Create KAI Scheduler Queues
 
 Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs: 
 
@@ -112,7 +109,7 @@ spec:
 
 Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
 
-## Step 4: Gang scheduling with KAI Scheduler
+## Step 3: Gang scheduling with KAI Scheduler
 
 The key pattern is to add the queue label to your RayCluster. [Here's a basic example](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml) from the KubeRay repository:
 
@@ -175,7 +172,7 @@ You can submit the same workload above with a specific priority. Modify the abov
 ```
 See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/priority) for more information.
 
-## Step 5: Submitting Ray workers with GPU sharing 
+## Step 4: Submitting Ray workers with GPU sharing 
 
 This example creates two workers that share a single GPU (0.5 each, with time-slicing) within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)):