Skip to content

Conversation

@EduardDurech
Copy link
Contributor

platform.machine() isn't accurate for sbsa-linux, see my comments and review #5746 and fzyzcjy/torch_memory_saver#19

$ python -c "import platform; print(platform.machine())"
> aarch64

But $CUDA_HOME/[lib64 -> targets/sbsa-linux/lib] NOT targets/aarch64-linux/lib

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @EduardDurech, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness and portability of the CUDA runtime library loading process. It moves away from a potentially inaccurate architecture-dependent path to a flexible, search-based approach, ensuring that the libcudart.so.12 library can be correctly located and loaded across various system configurations, particularly on ARM-based Linux systems where platform.machine() might not align with CUDA's target directory structure.

Highlights

  • Improved CUDA Library Discovery: Replaced the static CUDA library path construction, which relied on platform.machine(), with a dynamic and more robust discovery mechanism. This addresses issues where platform.machine() might not accurately reflect the necessary CUDA target architecture (e.g., aarch64 vs. sbsa-linux).
  • Robust CUDA Home Detection: Introduced a _find_cuda_home utility function, adapted from torch/utils/cpp_extension.py, to reliably locate the CUDA installation directory. This function checks environment variables (CUDA_HOME, CUDA_PATH), the location of nvcc, and falls back to a default path (/usr/local/cuda).
  • Flexible Library Path Resolution: Enhanced the search for libcudart.so.12 by first checking standard lib and lib64 directories within the identified CUDA home. If not found there, it performs a recursive search (rglob) to locate the library in subdirectories, accommodating diverse CUDA installation layouts (e.g., targets/sbsa-linux/lib).
  • Error Handling: Added a RuntimeError to be raised if libcudart.so.12 cannot be found after all attempts to locate the CUDA library directory, providing clearer feedback on missing dependencies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the CUDA library discovery mechanism by making it more robust and independent of the system architecture reported by platform.machine(). This is a valuable change. I've identified a couple of issues in the implementation: a critical bug that could cause a crash if the CUDA library isn't found in an expected path, and a logic flaw in how it searches for the library directory. My review comments include specific suggestions to fix these issues and enhance the code's reliability.

@EduardDurech
Copy link
Contributor Author

@FlamingoPg
Copy link
Collaborator

Fix AMD HW in sgl-kernel & fix lint.
Others LGTM. @EduardDurech

@EduardDurech
Copy link
Contributor Author

Fix AMD HW in sgl-kernel & fix lint. Others LGTM. @EduardDurech

Talking with Yusheng about this, I don't have a device to test on, will lint after

@yushengsu-thu
Copy link
Contributor

@EduardDurech is there any compile command and test case? I can try to test it on AMD GPU

@hubertlu-tw
Copy link
Collaborator

@EduardDurech
We also have a few kernels in sgl-kernel for AMD GPUs. The current code changes will break AMD's path: https://github.com/sgl-project/sglang/actions/runs/16751284526/job/48500120214?pr=8813

CC: @yushengsu-thu @HaiShaw

@EduardDurech EduardDurech force-pushed the fix/cuda_arch_independent branch from a98a922 to a17a6a4 Compare September 9, 2025 10:37
@EduardDurech
Copy link
Contributor Author

@FlamingoPg can you ci

@hubertlu-tw
Copy link
Collaborator

@EduardDurech can you do this to fix lint errors?

pip install pre-commit
pre-commit run --all-files

@EduardDurech
Copy link
Contributor Author

@hubertlu-tw sorry forgot

Copy link
Collaborator

@hubertlu-tw hubertlu-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EduardDurech
Copy link
Contributor Author

@FlamingoPg ci seems to have issues not related to PR

@zhyncs zhyncs self-assigned this Sep 17, 2025
@zhyncs zhyncs merged commit a77564e into sgl-project:main Sep 17, 2025
56 of 100 checks passed
HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants