Skip to content

ci: run setup_ld_library_path before install_sglang_kernel#24141

Merged
Kangyan-Zhou merged 3 commits into
mainfrom
alison/fix-libcusparselt-import-ordering
Apr 30, 2026
Merged

ci: run setup_ld_library_path before install_sglang_kernel#24141
Kangyan-Zhou merged 3 commits into
mainfrom
alison/fix-libcusparselt-import-ordering

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented Apr 30, 2026

Summary

nvidia-cusparselt-cu13 was bumped on PyPI from 0.9.0 → 0.9.1 on 2026-04-29 18:36 UTC. On the affected runners, uv's upgrade install partially failed: it wrote the new 0.9.1 dist-info (registering the package as installed) but did not extract the bundled nvidia/cusparselt/lib/libcusparseLt.so.0. Because the dist-info looks intact, every subsequent uv pip install skips it as "already satisfied," so the broken state is sticky and import torch fails with libcusparseLt.so.0: cannot open shared object file.

Fix: at the end of install_sglang, if the wheel metadata is present but the .so is missing on disk, force-reinstall the wheel.

Failure example: https://github.com/sgl-project/sglang/actions/runs/25158915002/job/73748785757

Test plan

  • Trigger PR Test on a 1-GPU runner (5090 or h100); verify install dependency completes without libcusparseLt.so.0 ImportError
  • Confirm the WARNING fires on a runner where the file is missing, and is silent on a runner where it's present

install_sglang_kernel imports torch to detect cu version. Without
LD_LIBRARY_PATH pointing at the pip-managed nvidia/*/lib paths, the
import fails with libcusparseLt.so.0 not found on hosts where
cusparselt is only available via the nvidia-cusparselt-cu13 wheel.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the setup_ld_library_path call to an earlier stage in the CI installation script to resolve a torch import issue. However, feedback indicates that removing the call from its original position might lead to an incomplete LD_LIBRARY_PATH for dependencies installed later in the process, such as NVIDIA packages. It is recommended to keep both calls to ensure all paths are correctly captured and exported to the environment.

fix_nvidia_deps
install_test_tools
prepare_runner
setup_ld_library_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving setup_ld_library_path to run before install_sglang_kernel correctly fixes the import torch issue. However, removing the call from this location may introduce a new problem.

Functions that run after the new call location, such as install_extra_deps and fix_nvidia_deps, install additional NVIDIA packages (nvidia-cuda-nvrtc, nvidia-cudnn-cu*, etc.). If setup_ld_library_path is not run again after these packages are installed, their library paths will be missing from LD_LIBRARY_PATH. This will result in an incomplete library path being exported to GITHUB_ENV, potentially causing failures in verify_imports or subsequent CI steps.

To ensure LD_LIBRARY_PATH is always complete, please keep this call to setup_ld_library_path. The function is safe to call multiple times; it will simply prepend the full, updated set of library paths.

…arselt-cu13

Reordering setup_ld_library_path didn't fix the underlying issue — on the
failing runners, nvidia-cusparselt-cu13 0.8.0 is registered in pip metadata
but libcusparseLt.so.0 is physically missing from
$site-packages/nvidia/cusparselt/lib/. No LD_LIBRARY_PATH adjustment finds
a file that's not on disk.

Two changes:
  1. Read torch's CUDA tag from the local-version label (e.g. 2.9.1+cu130
     → cu130) via pip show, instead of `import torch` which dlopens
     libcusparseLt and fails when the file is missing.
  2. After install_sglang, if libcusparseLt.so.0 is missing but the wheel
     metadata claims it exists, force-reinstall nvidia-cusparselt-cu13.
     This restores the file before any later torch import.

Reverts the main() reorder from the first attempt — that wasn't the bug.
@Kangyan-Zhou Kangyan-Zhou merged commit dc395bc into main Apr 30, 2026
100 of 101 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the alison/fix-libcusparselt-import-ordering branch April 30, 2026 17:55
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants