Skip to content

Conversation

@EkinKarabulut
Copy link
Contributor

Why are these changes needed?

Adding the docs for kuberay KAI-Scheduler integration PR.
ray-project/kuberay#3886

Related issue number

ray-project/kuberay#3886

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • N/A I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • N/A I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Doc tests

@EkinKarabulut EkinKarabulut requested review from a team as code owners July 23, 2025 13:10
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @EkinKarabulut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new documentation to guide users through integrating KubeRay with the KAI Scheduler. The aim is to empower users with advanced Kubernetes scheduling capabilities for their Ray clusters, focusing on efficient resource management, workload prioritization, and optimized GPU utilization.

Highlights

  • New Documentation Added: I've added comprehensive documentation detailing the integration of KubeRay with NVIDIA's KAI Scheduler. This new guide provides users with instructions on how to leverage KAI Scheduler's advanced features for Ray clusters on Kubernetes.
  • KAI Scheduler Capabilities: The documentation covers key KAI Scheduler functionalities, including gang scheduling (ensuring all Ray cluster components are scheduled together), hierarchical queue management with quotas and priorities for resource allocation, and fractional GPU sharing to maximize GPU utilization.
  • Practical Implementation Guide: The new guide includes step-by-step instructions for installing KAI Scheduler and configuring the KubeRay operator to use it. It also provides practical YAML examples for creating KAI Scheduler queues, applying gang scheduling to RayClusters, setting workload priorities, and demonstrating GPU sharing for Ray workers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new documentation for integrating KubeRay with the KAI Scheduler. The new page provides a good overview and examples. My review focuses on improving the correctness and clarity of the code snippets and instructions to ensure users can follow them without issues. I've identified some critical and high-severity issues where commands would fail or configurations are incorrect, along with several medium-severity suggestions to improve the overall quality of the documentation.


```

Apply this RayCluster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The user is instructed to run kubectl apply -f ray-cluster.kai-scheduler.yaml below, but there's no instruction to save the preceding YAML definition into that file. This can be confusing for users. Please add a note to save the YAML content to a file.

memory: "2Gi"
```

```bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to a previous comment, the user is instructed to run kubectl apply below, but there's no instruction to save the YAML definition into a file. Please add a note to save the YAML content to a file (e.g., ray-cluster.kai-gpu-sharing.yaml).

@ray-gardener ray-gardener bot added community-contribution Contributed by the community docs An issue or change related to documentation core Issues that should be addressed in Ray Core labels Jul 23, 2025
EkinKarabulut and others added 6 commits July 28, 2025 09:53
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
@EkinKarabulut EkinKarabulut force-pushed the docs/kai-scheduler-kuberay branch from 882fc90 to 0fdb725 Compare July 28, 2025 07:54
EkinKarabulut and others added 3 commits July 28, 2025 09:57
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: EkinKarabulut <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented Jul 28, 2025

Will review and merge this after the kuberay PR is merged.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 12, 2025
Signed-off-by: EkinKarabulut <[email protected]>
@EkinKarabulut EkinKarabulut force-pushed the docs/kai-scheduler-kuberay branch from b6da9d6 to 91bf42f Compare August 18, 2025 11:15
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Aug 18, 2025
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style nits that we would appreciate you addressing. Generally we like to avoid using passive voice for clarity. Thank you for adding to the documentation and apologies for the delay.


[KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are:
- **Bin-packing & Spread Scheduling**: Optimize node usage either by minimizing fragmentation (bin-packing) or increasing resiliency and load balancing (spread scheduling)
- **GPU Sharing**: Allow multiple Ray workloads from across teams to be packed on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **GPU Sharing**: Allow multiple Ray workloads from across teams to be packed on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.
- **GPU sharing**: Allow Ray to pack multiple workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @EkinKarabulut
do you have time contribute this?

@Future-Outlier Future-Outlier self-assigned this Sep 26, 2025
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @fscnick for review together, thank you!

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@EkinKarabulut EkinKarabulut force-pushed the docs/kai-scheduler-kuberay branch 3 times, most recently from 9b6173d to cf0e524 Compare October 19, 2025 11:58
@EkinKarabulut EkinKarabulut force-pushed the docs/kai-scheduler-kuberay branch from cf0e524 to 2447938 Compare October 19, 2025 12:05
@fscnick
Copy link
Contributor

fscnick commented Oct 20, 2025

Thanks for addressing the feedback. LGTM

cc @Future-Outlier

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Oct 20, 2025
cursor[bot]

This comment was marked as outdated.

@jjyao jjyao changed the title docs: Adding docs for Kuberay KAI scheduler integration [Doc] Adding docs for Kuberay KAI scheduler integration Oct 23, 2025
@jjyao jjyao merged commit 0b5b80d into ray-project:master Oct 23, 2025
6 checks passed
iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull request Oct 24, 2025
…54857)

Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Rueian <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…54857)

Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Rueian <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…54857)

Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Rueian <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…54857)

Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Rueian <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants