Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main#25743
Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main#25743fzyzcjy wants to merge 1 commit into
Conversation
)" This reverts commit b79e4b1.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
|
🚀 |
Result: PASS ✅ — revert restores the LoRA Qwen3-8B CUDA path
Combined with the negative-control PR #25744 which FAILS with the same
Bisect thread: #25647 (comment). cc @Fridge003 @hnyls2002 — leaving this revert open as a starting point if you want to rebase a follow-up fix onto it; closing otherwise once you've seen this. |
Second run: PASS ✅ — revert is reproducibly good
Two-in-two-out for the revert. No flake risk. |
…l_dependency.sh nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base AND -libs-cu13 are installed together, writing to the same site-packages paths with conflicting content. This causes a GPUModuleOp TypeError at kernel-compile time (vllm-project/vllm#40082). The correct libs package depends on the GPU family, not just CUDA version: Blackwell (IS_BLACKWELL=1, CU13): -libs-cu13 must win. It carries the sm_110 arch alias that the CUDA-12.9-built -libs-base wheel lacks. Fix: purge -libs-base, force-reinstall -libs-cu13. Non-Blackwell CU13 (H100, H200): -libs-base must win. Forcing only -libs-cu13 introduces a CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression in LoRA CUDA-graph capture (sgl-project#25743). Fix: purge -libs-cu13, force-reinstall -libs-base. Non-CU13: only -libs-base installed (no [cu13] extra), no conflict. Add fix_cutlass_dsl_libs() called from main() after download_flashinfer_cache, mirroring the position of the original purge_cutlass_libs_base() from sgl-project#25690.
PR sgl-project#25576 bumped nvidia-cutlass-dsl[cu13] from 4.5.0 to 4.5.1. The bump exposed a latent file-level conflict between -libs-base and -libs-cu13 (both written by the additive [cu13] extra) as a hard GPUModuleOp TypeError on H100: -libs-cu13's pybind11 binding changed to the new MLIR-style ((operation: object)) without a matching bump to the Python wrapper in nvidia-cutlass-dsl, so loading -libs-cu13's .so makes the wrapper's old-style super().__init__() call fail. Two changes: 1. Revert the version bump (4.5.1 -> 4.5.0). At 4.5.0 both .so files expose a compatible binding, so the same coexistence no longer crashes. This removes the active TypeError on H100 and on the CUDA-13 Docker image for non-Blackwell users. 2. Add fix_cutlass_dsl_libs() to ci_install_dependency.sh, called from main() after download_flashinfer_cache. The function picks the right libs package per GPU family even at 4.5.0 to avoid two independent regressions that the silent conflict could still hit: Blackwell (IS_BLACKWELL=1, CU13): Purge -libs-base, force-reinstall -libs-cu13 so its files take precedence. -libs-base is CUDA-12.9-built and lacks the sm_110 arch alias that GB300/B200 need at cutlass import time. Non-Blackwell CU13 (H100, H200): Purge -libs-cu13, force-reinstall -libs-base. -libs-cu13 carries a CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression in LoRA CUDA- graph capture on sm_90 (sgl-project#25743 / reverted by sgl-project#25756). Non-CU13: no-op (only -libs-base ever installed).
…-mix TypeError
nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:
cutlass/_mlir/dialects/_gpu_ops_gen.py
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so
Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:
-libs-base: super().__init__(self.build_generic(...)) (new-style)
-libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)
If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:
TypeError: __init__(): incompatible function arguments.
1. __init__(self, operation: object) -> None
surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.
Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.
Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).
Verified on an H200 devbox:
1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
-> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
-> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
regression from sgl-project#25743.
…-mix TypeError
nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:
cutlass/_mlir/dialects/_gpu_ops_gen.py
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so
Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:
-libs-base: super().__init__(self.build_generic(...)) (new-style)
-libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)
If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:
TypeError: __init__(): incompatible function arguments.
1. __init__(self, operation: object) -> None
surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.
Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.
Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).
Verified on an H200 devbox:
1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
-> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
-> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
regression from #25743.
This reverts #25690 ([Fix] Try to fix error caused by latest cutedsl packages — merged 2026-05-18 by @Fridge003 / @hnyls2002).
#25690 introduced a
CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)regression in the LoRA Qwen3-8B forward path during CUDA graph capture. Bisect evidence (9 probes, narrowedba214ef3d3..a7b3ced334to a single commit): #25647 (comment).Failing test on
main:test/registered/lora/test_lora_qwen3_8b_logprob_diff.py::TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy(extra-a-test-1-gpu-large lane).This PR's purpose is diagnostic: re-running the failing test via
/rerun-testhere should now PASS, confirming the revert restores main. It's not a merge candidate yet — the underlying motivation of #25690 (cutedsl wrapper/binding ABI mismatch withGPUModuleOpsignature TypeError) still needs a proper fix from @Fridge003 / @hnyls2002.cc @Fridge003 @hnyls2002