Skip to content

Probe LoRA Qwen3-8B CUDA fail on plain main (negative control, NOT a fix)#25744

Closed
fzyzcjy wants to merge 1 commit into
mainfrom
tom/probe-lora-bug-25690
Closed

Probe LoRA Qwen3-8B CUDA fail on plain main (negative control, NOT a fix)#25744
fzyzcjy wants to merge 1 commit into
mainfrom
tom/probe-lora-bug-25690

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented May 19, 2026

Copy link
Copy Markdown
Collaborator

🤖 Opened autonomously by Claude Code acting on Tom's behalf. This is the negative-control half of a paired diagnostic experiment (sibling: #25743 which reverts #25690). The @-mentions below are programmatic, not Tom's personal request; please push back if any conclusion is off.

This PR does not revert anything. It only adds a one-line sentinel comment to python/sglang/version.py so the GitHub Actions paths-filter triggers and the PR isn't auto-closed for a zero-diff.

Purpose: confirm that the LoRA Qwen3-8B CUDA-graph illegal-address regression bisected to #25690 is reproducible on plain upstream/main with no other changes — i.e., the bug is not somehow specific to Tom's #25703#25728 scheduler refactor chain on the main-CI sandbox (#25647).

Expected result: /rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py here should FAIL with the same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint, while the same /rerun-test on the sibling revert PR #25743 should PASS — together that's bidirectional evidence for b79e4b1e68 as the root cause.

Bisect evidence: #25647 (comment).

cc @Fridge003 @hnyls2002


CI States

Latest PR Test (Base): ❌ Run #26077811447
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy

fzyzcjy commented May 19, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy

fzyzcjy commented May 19, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Result of the diagnostic /rerun-test on this no-revert negative-control PR.

Result: FAIL ❌ (expected) — bug reproduces on plain upstream/main

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py run: 26077826917failure. Same fingerprint as the original extra-a-test-1-gpu-large (0) failure on #25647:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Pair-PR table:

PR Branch /rerun-test verdict
#25743 (revert of #25690) tom/revert-25690-cutedsl PASS ✅
#25744 (this — no-revert, plain main + 1-line touch) tom/probe-lora-bug-25690 FAIL

Bidirectional confirmation that b79e4b1e68 (#25690) is the root cause — the bug is on plain upstream/main and is not specific to Tom's #25703#25728 refactor chain context.

Closing this PR — diagnostic only, no merge value. Full bisect + repro: #25647 (comment).

cc @Fridge003 @hnyls2002

@fzyzcjy

fzyzcjy commented May 19, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Second-run double-confirmation on the no-revert (negative-control) branch.

Second run: FAIL ❌ — reproducibly broken on plain main

rerun-test run 2 on tom/probe-lora-bug-25690: 26078647279failure, same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint.

Run Result
#1 FAIL ❌ (same fingerprint)
#2 FAIL ❌ (same fingerprint)

Two-in-two-out for the no-revert branch on plain upstream/main. No flake risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant