Skip to content

Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main#25743

Closed
fzyzcjy wants to merge 1 commit into
mainfrom
tom/revert-25690-cutedsl
Closed

Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main#25743
fzyzcjy wants to merge 1 commit into
mainfrom
tom/revert-25690-cutedsl

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented May 19, 2026

🤖 Opened autonomously by Claude Code acting on Tom's behalf. All the diagnostic work below — CI failure triage on #25647, the 9-probe bisect, this revert, and the /rerun-test request — was performed by the agent without human-in-the-loop edits. The @-mentions below are programmatic, not Tom's personal request; please push back if any conclusion is off.

This reverts #25690 ([Fix] Try to fix error caused by latest cutedsl packages — merged 2026-05-18 by @Fridge003 / @hnyls2002).

#25690 introduced a CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) regression in the LoRA Qwen3-8B forward path during CUDA graph capture. Bisect evidence (9 probes, narrowed ba214ef3d3..a7b3ced334 to a single commit): #25647 (comment).

Failing test on main: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py::TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy (extra-a-test-1-gpu-large lane).

This PR's purpose is diagnostic: re-running the failing test via /rerun-test here should now PASS, confirming the revert restores main. It's not a merge candidate yet — the underlying motivation of #25690 (cutedsl wrapper/binding ABI mismatch with GPUModuleOp signature TypeError) still needs a proper fix from @Fridge003 / @hnyls2002.

cc @Fridge003 @hnyls2002

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Result of the diagnostic /rerun-test on this revert PR.

Result: PASS ✅ — revert restores the LoRA Qwen3-8B CUDA path

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py run: 26077407201success.

Combined with the negative-control PR #25744 which FAILS with the same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint on plain upstream/main, this gives bidirectional evidence that b79e4b1e68 (#25690) is the root cause:

PR Branch /rerun-test verdict
#25743 (revert of #25690) tom/revert-25690-cutedsl PASS
#25744 (no-revert, plain main + 1-line touch) tom/probe-lora-bug-25690 FAIL ❌ (same fingerprint)

Bisect thread: #25647 (comment).

cc @Fridge003 @hnyls2002 — leaving this revert open as a starting point if you want to rebase a follow-up fix onto it; closing otherwise once you've seen this.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Second-run double-confirmation on the revert branch.

Second run: PASS ✅ — revert is reproducibly good

rerun-test run 2 on tom/revert-25690-cutedsl: 26078646438success. Matches first run 26077407201.

Run Result
#1 PASS ✅
#2 PASS ✅

Two-in-two-out for the revert. No flake risk.

Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request May 21, 2026
…l_dependency.sh

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base AND
-libs-cu13 are installed together, writing to the same site-packages
paths with conflicting content. This causes a GPUModuleOp TypeError at
kernel-compile time (vllm-project/vllm#40082).

The correct libs package depends on the GPU family, not just CUDA version:

  Blackwell (IS_BLACKWELL=1, CU13):
    -libs-cu13 must win. It carries the sm_110 arch alias that the
    CUDA-12.9-built -libs-base wheel lacks.
    Fix: purge -libs-base, force-reinstall -libs-cu13.

  Non-Blackwell CU13 (H100, H200):
    -libs-base must win. Forcing only -libs-cu13 introduces a
    CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression in LoRA CUDA-graph
    capture (sgl-project#25743).
    Fix: purge -libs-cu13, force-reinstall -libs-base.

  Non-CU13: only -libs-base installed (no [cu13] extra), no conflict.

Add fix_cutlass_dsl_libs() called from main() after download_flashinfer_cache,
mirroring the position of the original purge_cutlass_libs_base() from sgl-project#25690.
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request May 21, 2026
PR sgl-project#25576 bumped nvidia-cutlass-dsl[cu13] from 4.5.0 to 4.5.1. The bump
exposed a latent file-level conflict between -libs-base and -libs-cu13
(both written by the additive [cu13] extra) as a hard GPUModuleOp
TypeError on H100: -libs-cu13's pybind11 binding changed to the new
MLIR-style ((operation: object)) without a matching bump to the Python
wrapper in nvidia-cutlass-dsl, so loading -libs-cu13's .so makes the
wrapper's old-style super().__init__() call fail.

Two changes:

1. Revert the version bump (4.5.1 -> 4.5.0). At 4.5.0 both .so files
   expose a compatible binding, so the same coexistence no longer crashes.
   This removes the active TypeError on H100 and on the CUDA-13 Docker
   image for non-Blackwell users.

2. Add fix_cutlass_dsl_libs() to ci_install_dependency.sh, called from
   main() after download_flashinfer_cache. The function picks the right
   libs package per GPU family even at 4.5.0 to avoid two independent
   regressions that the silent conflict could still hit:

     Blackwell (IS_BLACKWELL=1, CU13):
       Purge -libs-base, force-reinstall -libs-cu13 so its files take
       precedence. -libs-base is CUDA-12.9-built and lacks the sm_110
       arch alias that GB300/B200 need at cutlass import time.

     Non-Blackwell CU13 (H100, H200):
       Purge -libs-cu13, force-reinstall -libs-base. -libs-cu13 carries
       a CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression in LoRA CUDA-
       graph capture on sm_90 (sgl-project#25743 / reverted by sgl-project#25756).

     Non-CU13: no-op (only -libs-base ever installed).
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request May 21, 2026
…-mix TypeError

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:

  cutlass/_mlir/dialects/_gpu_ops_gen.py
  cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so

Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:

  -libs-base: super().__init__(self.build_generic(...))  (new-style)
  -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)

If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:

  TypeError: __init__(): incompatible function arguments.
    1. __init__(self, operation: object) -> None

surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.

Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.

Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).

Verified on an H200 devbox:
  1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
     -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
     -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
     The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
     regression from sgl-project#25743.
alisonshao pushed a commit that referenced this pull request May 21, 2026
…-mix TypeError

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:

  cutlass/_mlir/dialects/_gpu_ops_gen.py
  cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so

Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:

  -libs-base: super().__init__(self.build_generic(...))  (new-style)
  -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)

If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:

  TypeError: __init__(): incompatible function arguments.
    1. __init__(self, operation: object) -> None

surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.

Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.

Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).

Verified on an H200 devbox:
  1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
     -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
     -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
     The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
     regression from #25743.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant