Skip to content

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958

Merged
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order
May 21, 2026
Merged

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order

Conversation

@Kangyan-Zhou

@Kangyan-Zhou Kangyan-Zhou commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Root cause

nvidia-cutlass-dsl[cu13] has additive PyPI extras — installing it pulls in both nvidia-cutlass-dsl-libs-base AND nvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:

Path -libs-base -libs-cu13
cutlass/_mlir/dialects/_gpu_ops_gen.py calls super().__init__(self.build_generic(...)) (new-style single object) calls super().__init__(OPERATION_NAME, REGIONS, ...) (old-style positional)
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so pybind11 binding only accepts (operation: object) pybind11 binding only accepts positional args

Each wheel's .py is paired with a .so that has the matching API. If install order leaves the .py from one wheel and the .so from the other (which can happen via uv's install ordering), you get the hard TypeError seen in CI:

File ".../cutlass/_mlir/dialects/_gpu_ops_gen.py", line 1357, in __init__
    super().__init__(self.OPERATION_NAME, self._ODS_REGIONS, ...)
TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, operation: object) -> None

This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cutecute.compile.

Empirical evidence

Tested all 4 combinations on an H200 devbox by manually cp-ing wheel contents into site-packages:

.py from .so from Smoke test (gpu.GPUModuleOp(StringAttr, loc=loc))
-libs-base -libs-base ✅ PASS
-libs-cu13 -libs-cu13 ✅ PASS
-libs-cu13 -libs-base FAIL — exact CI TypeError, byte-for-byte
-libs-base -libs-cu13 ✅ PASS

Three of four states work. Only the mismatched .py=cu13 + .so=base breaks.

Fix

After install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last to guarantee both .py and .so come from the same wheel (BOTH-cu13 state):

$PIP_CMD install --force-reinstall --no-deps \
  "nvidia-cutlass-dsl-libs-cu13==${CUTLASS_DSL_VERSION}" \
  $PIP_INSTALL_SUFFIX

Version parsed from pyproject.toml to stay in sync. Skips for non-CU13 runners (only -libs-base installed there, no conflict possible).

Validation on devbox

  1. TypeError fix: forced BAD state on H200 devbox with UV_LINK_MODE=copy (matches CI), ran force_reinstall_cutlass_dsl_libs_cu13 — smoke test went FAIL → PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test/registered/lora/test_lora_qwen3_8b_logprob_diff.py against the fix on the same devbox — both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.

Related PRs / supersedes

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026
…-mix TypeError

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:

  cutlass/_mlir/dialects/_gpu_ops_gen.py
  cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so

Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:

  -libs-base: super().__init__(self.build_generic(...))  (new-style)
  -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)

If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:

  TypeError: __init__(): incompatible function arguments.
    1. __init__(self, operation: object) -> None

surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.

Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.

Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).

Verified on an H200 devbox:
  1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
     -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
     -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
     The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
     regression from sgl-project#25743.
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_libs_install_order branch from de055b3 to 1a0dbf2 Compare May 21, 2026 07:14

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.

return
fi

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.

Suggested change
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "")

@Kangyan-Zhou

Copy link
Copy Markdown
Collaborator Author

@mmangkad I think your suggestion is correct, thanks for sharing it!

@Kangyan-Zhou

Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
  version probe doesn't depend on the working directory the script is
  launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
  TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
  re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
  install-order bug.
@mmangkad

mmangkad commented May 21, 2026

Copy link
Copy Markdown
Collaborator

@mmangkad I think your suggestion is correct, thanks for sharing it!

Yeah that was the issue because the order of install matters, not the version. Could we include the upgrade back to 4.5.1 here? I just saw it

@Kangyan-Zhou Kangyan-Zhou merged commit caa9f08 into sgl-project:main May 21, 2026
253 of 332 checks passed
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
alisonshao added a commit that referenced this pull request May 27, 2026
#25958's wheel-mix fix (force_reinstall_cutlass_dsl_libs_cu13) is what
solves the install-time TypeError; the accompanying 4.5.0->4.5.1 bump
isn't required for the fix and reintroduces a runtime regression.

py-spy on a hanging b200 test (DeepSeek-V3.2-NVFP4 + DSA + EAGLE) shows
the scheduler stuck in fp4_gemm autotune at:

  cutlass/cute/nvgpu/tcgen05/mma.py:557
    -> findsource (inspect.py:997)
    -> [flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py loop body]

The per-kernel-emission inspect.findsource walk is O(N) over loaded
modules and never finishes in 4.5.1 within the 30-min step budget.
4.5.0 doesn't hit this path (per main running this test cleanly).

Holding at 4.5.0 keeps us aligned with the prior team-wide revert
(#25938) while keeping the install-order safety helper from #25958.
mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
…-mix TypeError (sgl-project#25958)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-fastfail dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants