ci: run setup_ld_library_path before install_sglang_kernel#24141
Conversation
install_sglang_kernel imports torch to detect cu version. Without LD_LIBRARY_PATH pointing at the pip-managed nvidia/*/lib paths, the import fails with libcusparseLt.so.0 not found on hosts where cusparselt is only available via the nvidia-cusparselt-cu13 wheel.
There was a problem hiding this comment.
Code Review
This pull request moves the setup_ld_library_path call to an earlier stage in the CI installation script to resolve a torch import issue. However, feedback indicates that removing the call from its original position might lead to an incomplete LD_LIBRARY_PATH for dependencies installed later in the process, such as NVIDIA packages. It is recommended to keep both calls to ensure all paths are correctly captured and exported to the environment.
| fix_nvidia_deps | ||
| install_test_tools | ||
| prepare_runner | ||
| setup_ld_library_path |
There was a problem hiding this comment.
Moving setup_ld_library_path to run before install_sglang_kernel correctly fixes the import torch issue. However, removing the call from this location may introduce a new problem.
Functions that run after the new call location, such as install_extra_deps and fix_nvidia_deps, install additional NVIDIA packages (nvidia-cuda-nvrtc, nvidia-cudnn-cu*, etc.). If setup_ld_library_path is not run again after these packages are installed, their library paths will be missing from LD_LIBRARY_PATH. This will result in an incomplete library path being exported to GITHUB_ENV, potentially causing failures in verify_imports or subsequent CI steps.
To ensure LD_LIBRARY_PATH is always complete, please keep this call to setup_ld_library_path. The function is safe to call multiple times; it will simply prepend the full, updated set of library paths.
…arselt-cu13
Reordering setup_ld_library_path didn't fix the underlying issue — on the
failing runners, nvidia-cusparselt-cu13 0.8.0 is registered in pip metadata
but libcusparseLt.so.0 is physically missing from
$site-packages/nvidia/cusparselt/lib/. No LD_LIBRARY_PATH adjustment finds
a file that's not on disk.
Two changes:
1. Read torch's CUDA tag from the local-version label (e.g. 2.9.1+cu130
→ cu130) via pip show, instead of `import torch` which dlopens
libcusparseLt and fails when the file is missing.
2. After install_sglang, if libcusparseLt.so.0 is missing but the wheel
metadata claims it exists, force-reinstall nvidia-cusparselt-cu13.
This restores the file before any later torch import.
Reverts the main() reorder from the first attempt — that wasn't the bug.
Summary
nvidia-cusparselt-cu13was bumped on PyPI from 0.9.0 → 0.9.1 on 2026-04-29 18:36 UTC. On the affected runners, uv's upgrade install partially failed: it wrote the new0.9.1dist-info (registering the package as installed) but did not extract the bundlednvidia/cusparselt/lib/libcusparseLt.so.0. Because the dist-info looks intact, every subsequentuv pip installskips it as "already satisfied," so the broken state is sticky andimport torchfails withlibcusparseLt.so.0: cannot open shared object file.Fix: at the end of
install_sglang, if the wheel metadata is present but the.sois missing on disk, force-reinstall the wheel.Failure example: https://github.com/sgl-project/sglang/actions/runs/25158915002/job/73748785757
Test plan
libcusparseLt.so.0ImportError