-
Notifications
You must be signed in to change notification settings - Fork 7k
[Doc] Adding docs for Kuberay KAI scheduler integration #54857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc] Adding docs for Kuberay KAI scheduler integration #54857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @EkinKarabulut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces new documentation to guide users through integrating KubeRay with the KAI Scheduler. The aim is to empower users with advanced Kubernetes scheduling capabilities for their Ray clusters, focusing on efficient resource management, workload prioritization, and optimized GPU utilization.
Highlights
- New Documentation Added: I've added comprehensive documentation detailing the integration of KubeRay with NVIDIA's KAI Scheduler. This new guide provides users with instructions on how to leverage KAI Scheduler's advanced features for Ray clusters on Kubernetes.
- KAI Scheduler Capabilities: The documentation covers key KAI Scheduler functionalities, including gang scheduling (ensuring all Ray cluster components are scheduled together), hierarchical queue management with quotas and priorities for resource allocation, and fractional GPU sharing to maximize GPU utilization.
- Practical Implementation Guide: The new guide includes step-by-step instructions for installing KAI Scheduler and configuring the KubeRay operator to use it. It also provides practical YAML examples for creating KAI Scheduler queues, applying gang scheduling to RayClusters, setting workload priorities, and demonstrating GPU sharing for Ray workers.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds new documentation for integrating KubeRay with the KAI Scheduler. The new page provides a good overview and examples. My review focuses on improving the correctness and clarity of the code snippets and instructions to ensure users can follow them without issues. I've identified some critical and high-severity issues where commands would fail or configurations are incorrect, along with several medium-severity suggestions to improve the overall quality of the documentation.
|
|
||
| ``` | ||
|
|
||
| Apply this RayCluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| memory: "2Gi" | ||
| ``` | ||
|
|
||
| ```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
882fc90 to
0fdb725
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: EkinKarabulut <[email protected]>
|
Will review and merge this after the kuberay PR is merged. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: EkinKarabulut <[email protected]>
b6da9d6 to
91bf42f
Compare
angelinalg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style nits that we would appreciate you addressing. Generally we like to avoid using passive voice for clarity. Thank you for adding to the documentation and apologies for the delay.
|
|
||
| [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are: | ||
| - **Bin-packing & Spread Scheduling**: Optimize node usage either by minimizing fragmentation (bin-packing) or increasing resiliency and load balancing (spread scheduling) | ||
| - **GPU Sharing**: Allow multiple Ray workloads from across teams to be packed on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **GPU Sharing**: Allow multiple Ray workloads from across teams to be packed on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. | |
| - **GPU sharing**: Allow Ray to pack multiple workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time. |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @EkinKarabulut
do you have time contribute this?
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @fscnick for review together, thank you!
9b6173d to
cf0e524
Compare
Signed-off-by: EkinKarabulut <[email protected]>
…aml files Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: fscnick <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: fscnick <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: fscnick <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
cf0e524 to
2447938
Compare
Signed-off-by: EkinKarabulut <[email protected]>
Co-authored-by: fscnick <[email protected]> Signed-off-by: EkinKarabulut <[email protected]>
|
Thanks for addressing the feedback. LGTM |
Signed-off-by: Rueian <[email protected]>
…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>
…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]> Signed-off-by: xgui <[email protected]>
…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]>
…54857) Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
Why are these changes needed?
Adding the docs for kuberay KAI-Scheduler integration PR.
ray-project/kuberay#3886
Related issue number
ray-project/kuberay#3886
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.