Skip to content

Sandbox: verify full main CI is green on latest main (do not merge)#25647

Open
fzyzcjy wants to merge 1 commit into
sgl-project:mainfrom
fzyzcjy:tom/sandbox-verify-main-ci
Open

Sandbox: verify full main CI is green on latest main (do not merge)#25647
fzyzcjy wants to merge 1 commit into
sgl-project:mainfrom
fzyzcjy:tom/sandbox-verify-main-ci

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented May 18, 2026

Summary

Sandbox PR — do not merge. Touches python/sglang/version.py with a no-op comment so paths-filter flips main_package=true and the full PR Test Base + PR Test Extra matrix dispatches.

Carries three labels so the workflow gates all pass:

Label Effect
run-ci Passes pr-gate.yml's require-run-ci gate
run-ci-extra Allows pr-test-extra.yml to run on this pull_request event
bypass-fastfail Makes the per-job check-pr-test-health action no-op (no cascade fast-fail when a single sibling fails on infra flake)

Purpose: verify upstream/main (f04c522534) is green end-to-end with the full CI surface (base stages + extra stages, no fast-fail cascade). This is the PR-side equivalent of the dispatched main CI; cleaner than gh workflow run because the dispatch interface cannot pass skip_pr_test_health_check.

Close this PR after the run completes — no source change is intended to land.

Test plan

  • pre-commit run --files python/sglang/version.py
  • PR Test Base dispatches and runs to completion
  • PR Test Extra dispatches and runs to completion
  • No check-pr-test-health cascade failures

CI States

Latest PR Test (Base): ❌ Run #27088945685
Latest PR Test (Extra): ✅ Run #27088945624

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 18, 2026

/tag-and-rerun-ci

@fzyzcjy fzyzcjy closed this May 18, 2026
@fzyzcjy fzyzcjy reopened this May 19, 2026
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

CI failure: base-b-test-1-gpu-large (1) (PR Test Base, B200, 80 GB)

Job log

Failing test: test/registered/spec/eagle/test_eagle_infer_b.py::TestEAGLEServerAdditional::test_radix_attention

Symptom: 11/12 EAGLE tests pass, then test_radix_attention fails with ConnectionRefusedError: [Errno 111] Connection refused on http://127.0.0.1:11000/generate. The server died — a 59 MB cuda-coredumps-run-1.zip artifact was produced (artifact 7073114703).

File ".../test/registered/spec/eagle/test_eagle_infer_b.py", line 104, in test_radix_attention
    run_radix_attention_test(self.base_url)
File ".../python/sglang/test/kits/radix_cache_server_kit.py", line 49, in run_radix_attention_test
    res = requests.post(base_url + "/generate", json=data)
...
urllib3.exceptions.NewConnectionError: ... [Errno 111] Connection refused
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure. 11/12 EAGLE tests on the same base_url passed and the server emitted a CUDA coredump during test_radix_attention — points to an EAGLE-specific server crash, almost certainly a flake unless it repeats.

Next step: leaving the run untouched to see whether other lanes hit the same EAGLE / coredump pattern. If this stays isolated, will classify as flake and /rerun-test test/registered/spec/eagle/test_eagle_infer_b.py.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

CI failure: extra-a-test-1-gpu-large (0) (PR Test Extra, NVIDIA)

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py::TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy

Symptom: scheduler dies during init with exit code -6 (SIGABRT — not SIGKILL/-9, so not the OS OOM-killer). Retry exhausted after the engine construction throws.

File ".../test/registered/lora/test_lora_qwen3_8b_logprob_diff.py", line 134, in test_lora_qwen3_8b_logprob_accuracy
    engine = sgl.Engine(...)
File ".../python/sglang/srt/entrypoints/engine.py", line 236, in __init__
    ) = self._launch_subprocesses(
File ".../python/sglang/srt/entrypoints/engine.py", line 856, in _launch_subprocesses
    scheduler_init_result.wait_for_ready()
File ".../python/sglang/srt/entrypoints/engine.py", line 651, in wait_for_ready
    infos = _wait_for_scheduler_ready(scheduler_pipe_readers, scheduler_procs)
File ".../python/sglang/srt/entrypoints/engine.py", line 1337, in _wait_for_scheduler_ready
    raise _scheduler_died_error(i, scheduler_procs[i])
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6). If exit code is -9 (SIGKILL), a common cause is the OS OOM killer. Run `dmesg -T | grep -i oom` to check.
...
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA extra-a-1-gpu-large lane. Exit -6 = SIGABRT during scheduler init — could be a CUDA-kernel crash, a model-loading assertion in the LoRA path, or transient infra. Posting a separate /rerun-test for this file to differentiate flake vs persistent.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

CI failure: base-b-test-1-gpu-small (5) (PR Test Base, NVIDIA, 32 GB)

Job log

Failing test: test/registered/core/test_srt_endpoint.py::TestSRTEndpoint::test_get_server_info_concurrent ("Make sure the concurrent get_server_info doesn't crash the server.")

Symptom: server returns non-JSON on concurrent /server_info calls because the server-side handler hits an AssertionError inside communicator.queueing_call. The client then dies with JSONDecodeError: Expecting value: line 1 column 1 (char 0), retries are exhausted, test errors.

Server-side traceback:

File ".../python/sglang/srt/entrypoints/http_server.py", line 635, in server_info
    await _global_state.tokenizer_manager.get_internal_state()
File ".../python/sglang/srt/managers/tokenizer_control_mixin.py", line 788, in get_internal_state
    await self.get_internal_state_communicator(req)
File ".../python/sglang/srt/managers/communicator.py", line 79, in __call__
    return await self.queueing_call(obj)
File ".../python/sglang/srt/managers/communicator.py", line 40, in queueing_call
    assert self._result_event is None
AssertionError

Client-side:

File ".../test/registered/core/test_srt_endpoint.py", line 635, in s
    server_info.json()
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA base-b-1-gpu-small (32 GB) lane. The assertion assert self._result_event is None in communicator.queueing_call is a concurrency race in the internal-state communicator — the test (test_get_server_info_concurrent) is specifically designed to catch exactly this class of bug. Smells like a real race, not a flake, but posting /rerun-test to confirm reproducibility before escalating.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/spec/eagle/test_eagle_infer_b.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_infer_b.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/core/test_srt_endpoint.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-5090 (1 test): ✅ View workflow run

cd test/ && python3 registered/core/test_srt_endpoint.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py result: ❌ FAIL (reproducible)

Rerun job log

The same test fails on rerun with the same stack as the original extra-a-test-1-gpu-large (0) failure → this is NOT a flake.

Actual root cause (the SIGABRT in extra-a-test-1-gpu-large (0) was just the post-mortem; pre-coredump output reveals):

coredump: Starting GPU coredump generation
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0

Triggered during CUDA graph capture of TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy. The C++/CUDA stack from the Python faulthandler after the coredump (the abort thread):

File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
File ".../python/sglang/srt/lora/layers.py", line 724 in forward
...
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 854 in _capture_one_stream
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 867 in capture
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Classification: real bug on main HEAD 4a451128…, in the LoRA layer's forward path under CUDA graph capture. Likely a bad index / out-of-bounds memory access in lora/layers.py:724 (or in the unquant apply at unquant.py:161) when running Qwen3-8B with LoRA. Two-run reproducibility on the same commit confirms it's not a transient flake.

(This is the main-CI sandbox PR; the failing commit 4a451128… = latest upstream/main + a no-op python/sglang/version.py comment touch, so this bug is on main proper.)

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/core/test_srt_endpoint.py result: ✅ PASS (flake)

Rerun job → SUCCESS.

The original failure on base-b-test-1-gpu-small (5) (test_get_server_info_concurrent, AssertionError self._result_event is None in communicator.queueing_call) did not reproduce. Classifying as flake — likely a transient race in the internal-state communicator under concurrent server_info that didn't hit the timing window on rerun. Not pursuing further.

Final per-file rerun verdicts on this main-CI sandbox:

File Original lane Rerun verdict
test/registered/spec/eagle/test_eagle_infer_b.py (test_radix_attention) base-b-test-1-gpu-large (1) ✅ PASS — flake
test/registered/core/test_srt_endpoint.py (test_get_server_info_concurrent) base-b-test-1-gpu-small (5) ✅ PASS — flake
test/registered/lora/test_lora_qwen3_8b_logprob_diff.py (test_lora_qwen3_8b_logprob_accuracy) extra-a-test-1-gpu-large (0) ❌ FAIL same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS during CUDA graph capture — real bug (bisecting next)

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: d90bc65e30 ([NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383) — pre-chain, HEAD-28)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26073712614FAIL
  • Tree date: 2026-05-19 (the commit on main directly preceding Tom's 23-commit refactor chain)

Verbatim CUDA error fingerprint:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS during CUDA graph capture as on c2a212bf… / 4a451128… (current main HEAD).

bug PRE-EXISTS Tom's chain. The 28 PRs between d90bc65e30 and current HEAD (PRs #25703#25728 — Tom's scheduler refactor chain — plus #25282 DeepSeek V4 host pool, #25596 LTX2 diffusion fix, #25699 PD/NIXL aux, #25689 spec_verify metric, #24710 RMSNorm dispatch) are NOT the cause.

Bisect bound moves to last-good < d90bc65e30. Next probe: ba214ef3d3 (file-move point, 5 days ago) in flight; also dispatching 229cadec04 (midpoint of ba214ef3d3..d90bc65e30, 2026-05-16) to narrow in parallel.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probes: ba214ef3d3 + 229cadec04

PROBE B: ba214ef3d3 (ci: tag-gated nightly migration — foundation + 40 whole-file moves (#24725) — file-move point, 2026-05-14)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26073728329PASS

PROBE C: 229cadec04 (Update logging for inplace setting in MoE layer (#25499) — midpoint of (ba214ef3d3..d90bc65e30), 2026-05-16)

→ Bisect bound collapses to bug introduced in 229cadec04..d90bc65e30 (92 commits, 2026-05-16 → 2026-05-19).

Next probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — midpoint of the new range, 2026-05-18) — in flight as run 26075022728.

Bisect state so far:

SHA Date Subject rerun-test verdict
ba214ef3d3 2026-05-14 tag-gated nightly migration — 40 whole-file moves PASS
229cadec04 2026-05-16 logging update for inplace setting in MoE layer PASS
c58b47bc86 2026-05-18 PoolStats dataclass move (in flight)
d90bc65e30 2026-05-19 [NPU] Fix TypeError in MLA index_head_dim FAIL
current HEAD 2026-05-19 (Tom's chain + 5 unrelated) FAIL

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — 2026-05-18)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26075022728PASS

→ Bisect bound collapses to bug introduced in c58b47bc86..d90bc65e30 (46 commits, 2026-05-18 → 2026-05-19).

Next probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — midpoint of the new range, 2026-05-18).

Bisect state:

SHA Date Verdict
ba214ef3d3 2026-05-14 PASS
229cadec04 2026-05-16 PASS
c58b47bc86 2026-05-18 PASS
f04c522534 2026-05-18 (in flight)
d90bc65e30 2026-05-19 FAIL

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — 2026-05-18)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26075390772PASS

→ Bisect bound collapses to bug introduced in f04c522534..d90bc65e30 (23 commits, both same-day 2026-05-18 / 2026-05-19).

Next probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — midpoint, 2026-05-18).

Bisect state:

SHA Date Verdict
ba214ef3d3 2026-05-14 PASS
229cadec04 2026-05-16 PASS
c58b47bc86 2026-05-18 PASS
f04c522534 2026-05-18 PASS
f5049709b3 2026-05-18 (in flight)
d90bc65e30 2026-05-19 FAIL

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — 2026-05-18)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26075730388PASS

→ Bisect bound: bug introduced in f5049709b3..d90bc65e30 (12 commits, 2026-05-18 → 2026-05-19).

Full range (suspicion-worthy commits highlighted):

1f185c6ba8 Support draft extend cuda graph for tokenspeed_mla attention backend (#25489)  ← CUDA graph
b7267e8fce [CI] Enable weight prefetch for 8-gpu-h200 basic tests (#25684)
9e3bb9a307 [Spec] fold can_run_cuda_graph into EagleVerifyOutput (#25566)                  ← CUDA graph
c904fdd20e ci: pr-states match renamed "PR Test Base" workflow_run (#25687)
6f892047ec [misc] Throw error when single batch overlap is enabled on Hopper (#25509)      ← Hopper
878e6b8886 [SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)          ← midpoint
745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)                ← cutedsl
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe: 878e6b8886 (midpoint).

Bisect state:

SHA Date Verdict
f5049709b3 2026-05-18 PASS ✅ (last good lower bound)
878e6b8886 2026-05-18 (in flight)
d90bc65e30 2026-05-19 FAIL (first bad upper bound)

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: 878e6b8886 ([SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685) — 2026-05-18)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26076099919PASS

→ Bisect bound: bug introduced in 878e6b8886..d90bc65e30 (6 commits).

Remaining range:

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← prime suspect (CUDA-DSL packages)
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe (also the midpoint): b79e4b1e68 — the cutedsl-packages fix. This was the most suspicious commit in the wider range too (touches CUDA-DSL builds; LoRA forward → quant unquant.apply → cuBLAS path is a plausible blast radius).

Bisect state:

SHA Date Verdict
878e6b8886 2026-05-18 PASS ✅ (last good)
b79e4b1e68 2026-05-18 (in flight — prime suspect)
d90bc65e30 2026-05-19 FAIL (first bad)

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

Bisect probe: b79e4b1e68 ([Fix] Try to fix error caused by latest cutedsl packages (#25690) — 2026-05-18)

  • File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
  • rerun-test run: 26076486815FAIL

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint as on HEAD:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

→ Bisect bound: bug introduced in 878e6b8886..b79e4b1e68 (3 commits inclusive of b79e4b1e68):

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← FAIL ❌

Next probe: 314dedf7c6 (midpoint of the 3-commit range, 2026-05-18).

  • If PASS → offender is b79e4b1e68 itself (the cutedsl fix).
  • If FAIL → offender is 745abd6cc0 (cutlass_moe_fp4) or 314dedf7c6 (SGLANG_CACHE_DIR env path).

Bisect state:

SHA Date Verdict
878e6b8886 2026-05-18 PASS ✅ (last good)
745abd6cc0 2026-05-18 (untested)
314dedf7c6 2026-05-18 (in flight)
b79e4b1e68 2026-05-18 FAIL ❌ (first bad upper bound)
d90bc65e30 2026-05-19 FAIL

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

🤖 Posted autonomously by Claude Code acting on Tom's behalf. The 9-probe bisect (PROBE_A..I) below was driven by the agent — each probe pushed a temp branch on upstream, dispatched rerun-test.yml against it, classified the result, and narrowed the range. The @-mentions are programmatic, not Tom's personal request; please push back if anything is off.

Bisect result: test_lora_qwen3_8b_logprob_diff.py regressed at b79e4b1e68 (PR #25690, [Fix] Try to fix error caused by latest cutedsl packages)

PROBE I (the deciding probe): 314dedf7c6 (Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)) → rerun-test 26076870779PASS

With 314dedf7c6 PASS and b79e4b1e68 FAIL on the immediately-following commit, the regression lands on b79e4b1e68 exactly.

Final bisect table

SHA Date Subject Verdict
ba214ef3d3 2026-05-14 tag-gated nightly migration — 40 whole-file moves PASS
229cadec04 2026-05-16 logging update for inplace setting in MoE layer PASS
c58b47bc86 2026-05-18 PoolStats dataclass move PASS
f04c522534 2026-05-18 [PD] Add conclude_state to fake KV backend PASS
f5049709b3 2026-05-18 eagle3 aux-layer-ids +1 offset fix PASS
878e6b8886 2026-05-18 [SP] Fix runtime_max_tokens_per_rank PASS
314dedf7c6 2026-05-18 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path PASS ✅ (last good)
b79e4b1e68 2026-05-18 [Fix] Try to fix error caused by latest cutedsl packages (#25690) FAIL ❌ (first bad)
d90bc65e30 2026-05-19 [NPU] Fix TypeError in MLA index_head_dim FAIL
current HEAD 2026-05-19 (Tom's chain + a handful of unrelated) FAIL

Offending change

  • PR: [Fix] Try to fix error caused by latest cutedsl packages  #25690 — [Fix] Try to fix error caused by latest cutedsl packages
  • Author: @Fridge003 (Co-authored-by @hnyls2002)
  • Merged: 2026-05-18 23:51 UTC
  • Diff: 21 +, 4 -. Touches python/pyproject.toml (switches flashinfer_python and nvidia-cutlass-dsl to the [cu13] extras variant) and scripts/ci/cuda/ci_install_dependency.sh (regex-update for [extras] notation + new purge_cutlass_libs_base() step that uninstalls nvidia-cutlass-dsl-libs-base then force-reinstalls nvidia-cutlass-dsl-libs-cu13).

The PR's own commit message explains the original bug it was fixing:

nvidia-cutlass-dsl[cu13] extras are additive on PyPI: requires_dist always pulls -libs-base AND -libs-cu13 when [cu13] is requested. Both wheels write to the same site-packages paths with different content, leaving the wrapper (cutlass.py, cu13 style) mismatched with the binding (_gpu_ops_gen.py, base style) -> GPUModuleOp signature TypeError.

The fix correctly purges -libs-base in the install script, but the LoRA Qwen3-8B forward path with CUDA graph capture now hits a kernel-side illegal address — so either the cu13 wheel's compiled kernel is broken for this path, or the purge_cutlass_libs_base step doesn't actually win in all install orderings.

Failure fingerprint (every FAIL probe + current HEAD)

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Python call stack at the abort thread:
  File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
  File ".../python/sglang/srt/lora/layers.py", line 724 in forward
  ...
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
  File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Reproduce

# Probe latest good (PASS):
git push upstream 314dedf7c6:refs/heads/tmp-good
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-good \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

# Probe first bad (FAIL):
git push upstream b79e4b1e68:refs/heads/tmp-bad
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-bad \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

cc @Fridge003 @hnyls2002 — could you take a look? This regression has been on main since 2026-05-18 and is currently surfacing as extra-a-test-1-gpu-large (0) on the main-CI sandbox.

Diagnostic revert PR opened for verification: #25743/rerun-test of the failing LoRA file is pending there.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Bidirectional confirmation of the bisect result via paired diagnostic PRs.

Bisect confirmed via paired diagnostic PRs

Two sibling PRs were opened to nail down b79e4b1e68 (#25690) as the root cause:

PR What it does /rerun-test LoRA file verdict Run
#25743 Reverts b79e4b1e68 PASS 26077407201
#25744 No revert; only a 1-line sentinel comment in python/sglang/version.py so the PR isn't auto-closed for 0-diff FAIL ❌ (same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint) 26077826917

Together with the per-commit bisect probes above, that's three independent lines of evidence:

  1. Walking from a known-good 2026-05-14 down to b79e4b1e68 (9 probes, all consistent with PASS-then-FAIL at the exact commit boundary).
  2. Revert-the-commit → PASS on the same test file.
  3. Don't-revert (plain main + harmless touch) → FAIL on the same test file with identical fingerprint.

The regression is unambiguously b79e4b1e68 (#25690) — independent of Tom's #25703#25728 chain.

cc @Fridge003 @hnyls2002 — could you take a look? Closing the two diagnostic PRs now.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Second-pair double-confirmation.

2×2 paired probe — both runs match expectation

A second rerun-test was dispatched on each of the two paired branches to rule out flake on either side:

First run Second run
#25743 (revert) branch tom/revert-25690-cutedsl PASS ✅ (26077407201) PASS ✅ (26078646438)
#25744 (no-revert) branch tom/probe-lora-bug-25690 FAIL ❌ (26077826917) FAIL ❌ (26078647279)

4-of-4 consistent with the bisect conclusion. b79e4b1e68 (#25690) is the regression — high confidence, no flake.

cc @Fridge003 @hnyls2002

@fzyzcjy fzyzcjy closed this May 19, 2026
@fzyzcjy fzyzcjy reopened this May 31, 2026
@fzyzcjy fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from 4a45112 to c6e27e0 Compare May 31, 2026 02:08
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaging the full main-CI run on this sandbox (head = current upstream/main, the just-landed state); classifying the one CUDA-lane failure and kicking a single-file rerun to confirm it's a flake. Please push back if any conclusion is off.

CUDA-lane failure: borderline GSM8K accuracy (likely flake)

AssertionError: 0.77 not greater than or equal to 0.775
[Gemma4 31B topk=1] score=0.7700 threshold=0.7750 avg_spec_accept_length=4.468558708959376
  • Classification: borderline accuracy flake. GSM8K is a 200-question eval, so 0.770 is exactly one question below the 0.775 bar (score std ≈ 0.42); run-to-run variance routinely crosses a 0.005 margin. The spec accept length (4.47) is healthy, so MTP itself is working. This PR is the no-diff main-verification sandbox, so the miss is a property of upstream/main, not introduced by any change here.

Other failing lanes are non-CUDA and not chased per lane policy (main-sandbox, no diff): stage-a-test-1-gpu-xpu (XPU), stage-b-test-1-gpu-small-amd-nondeterministic (AMD), stage-b-test-1-npu-a2 (NPU), and finish (cascade aggregation).

Next: rerunning this single file to confirm the flake.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

/rerun-test test/registered/spec/test_gemma4_mtp_31b_extra.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

Results for /rerun-test test/registered/spec/test_gemma4_mtp_31b_extra.py:

🚀 2-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/spec/test_gemma4_mtp_31b_extra.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Confirming the earlier gemma4-MTP GSM8K failure was a flake via the /rerun-test result plus a cross-branch data point. Please push back if any conclusion is off.

Flake confirmed: test_gemma4_mtp_31b_extra.py

Remaining red lanes are all non-CUDA / chronic and not gating for a main-verification sandbox: base-c-test-8-gpu-h20 (h20 — ignored per policy), XPU, NPU (stage-b-test-1-npu-a2, pr-test-npu-finish), AMD (amd-nondeterministic, mi35x-disaggregation-amd), and the finish cascade. The CUDA gate is otherwise green; ~20 CUDA jobs still finishing.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Logging the one newly-surfaced non-CUDA lane failure (Xeon CPU) with its fingerprint, classified as unrelated to this main-verification sandbox. Please push back if any conclusion is off.

Non-CUDA lane: build-test (xeon-gnr, base-b-test-cpu) — not gating

  • Failing file: test/registered/cpu/test_cpu_graph.py (exit 1). All other files in the job passed (test_soft_watchdog.py passed — its watchdog-timeout / py-spy lines are that test's intentional behavior, not an error).
  • Fingerprint:
UserWarning: Triton is not supported on current platform, roll back to CPU.
RuntimeError: Failed to parse benchmark output. prefill_latency=None, decode_throughput=None, decode_latency=None
[rank0]: AttributeError: 'NoneType' object has no attribute 'shape'
  • Classification: CPU-backend (Xeon) issue, not CUDA, and this is the no-diff main-verification sandbox — unrelated to anything here. Per lane policy (non-CUDA, not caused by a diff) it is not chased.

CUDA gate so far: green except the already-confirmed gemma4 GSM8K flake; a couple of CUDA h200 jobs still finishing. Other reds remain h20 (ignored) / NPU / AMD-mi35x / cascade finish aggregations.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 31, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final summary of the full main-CI verification run (head = current upstream/main, i.e. the just-landed 24-PR KV-canary chain). Closing the sandbox afterward for reuse. Please push back if any conclusion is off.

✅ CUDA gate GREEN — main verification complete

Head SHA c6e27e08fb (= upstream/main + the no-op sentinel). All CUDA lanes passed: base-b, base-c (incl. 8-gpu h100 / h200 / b200), extra-a, extra-b, piecewise-cuda-graph, etc.

The only CUDA red was a confirmed flake:

  • extra-a-test-2-gpu-largetest_gemma4_mtp_31b_extra.py GSM8K 0.770 vs 0.775 (1 question / 200). /rerun-test on this exact SHA passed (run); same test also green on tip-of-main.

Remaining red lanes are non-gating (non-CUDA / chronic / cascade), none related to the landed chain:

  • base-c-test-8-gpu-h20 → h20 (chronic, ignored per policy)
  • stage-a-test-1-gpu-xpu → XPU; stage-b-test-1-npu-a2 → NPU; stage-b/stage-c ...-amd, ...-mi35x-disaggregation-amd → AMD
  • build-test (xeon-gnr, base-b-test-cpu) → Xeon CPU test_cpu_graph.py benchmark-parse issue (CPU backend, unrelated)
  • finish / pr-test-finish / pr-test-extra-finish / pr-test-npu-finish → aggregation jobs cascading from the above

Conclusion: the KV-canary feature, landed on main via the 24-PR chain (#26798#26821), is CUDA-CI green. Closing this sandbox PR (do not merge) so it's ready for the next reuse.

@fzyzcjy fzyzcjy closed this May 31, 2026
@fzyzcjy fzyzcjy reopened this Jun 6, 2026
@fzyzcjy fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from c6e27e0 to 96c5c6e Compare June 6, 2026 01:12
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 6, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaging this sandbox main-CI round; will follow up once logs are fetchable. Please push back if any conclusion is off.

Round status (head 96c5c6e1db = main bf4f2ccc78 + sentinel):

Remaining ~95 jobs still running; will batch any reruns after the round lands.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 6, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Fetched the failing job log after the rate-limit reset and classified the failure. Please push back if any conclusion is off.

base-a-test-cpu (0) root cause: HF Hub rate-limit (infra flake, not code).

test/registered/unit/server_args/test_server_args.py failed because Hugging Face Hub returned 429 Too Many Requests for Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json, and the retry also could not connect (job log):

httpx.HTTPStatusError: Client error '429 Too Many Requests' for url 'https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json'
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
✗ FAILED: test/registered/unit/server_args/test_server_args.py (exit code 1)

Other reds this round: AMD lane (27 jobs — ongoing repo-wide AMD outage), NPU a2 (recurring perf flake), XPU (chronic runner infra). None CUDA, none code-related.

Plan: wait for the ~13 still-running jobs to land, then /rerun-failed-ci once.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 6, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Classified the second CUDA-lane failure of this round from the job log. Please push back if any conclusion is off.

base-c-test-4-gpu-h100 (3): marginal KL-divergence threshold exceedance (likely numeric flake).

test/registered/models_e2e/test_qwen3_next_models.py failed with (job log):

AssertionError: avg_kl_div=0.0015218577479656225 > threshold=0.001 for Qwen/Qwen3-Next-80B-A3B-Instruct test_input_output_logprobs_match_prefill_cache_hit_helper

Round summary (running=0): CUDA reds = this + base-a-test-cpu (0) (HF Hub 429, infra) + pr-test-finish cascade. Non-CUDA reds = AMD lane outage (27), NPU a2 perf flake, XPU chronic infra.

Next: one batched /rerun-failed-ci.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 6, 2026

/rerun-failed-ci

@fzyzcjy fzyzcjy closed this Jun 6, 2026
@fzyzcjy fzyzcjy reopened this Jun 7, 2026
@fzyzcjy fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from 96c5c6e to ffbe2e8 Compare June 7, 2026 09:42
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 7, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first CI failures of this verify-main run (head ffbe2e8 = main 0a190d1); classification below. Please push back if any conclusion is off.

stage-a-test-1-gpu-xpu / finish (job): runner-level infra failure during workspace cleanup, before any test ran:

##[error]File was unable to be removed Error: EACCES: permission denied, unlink '.../python/sglang.egg-info/PKG-INFO'

Classification: infra (self-hosted XPU runner permission residue), non-CUDA lane, unrelated to main's code. Not chasing per babysit policy; CUDA lanes remain the hard gate.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 7, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged all non-CUDA failures on this verify-main run (head ffbe2e8 = main 0a190d1, which includes merged #27445 + #27446). All failures are non-CUDA lanes in code paths the merged PRs do not touch (the PRs only change scheduler PP idle-gating in is_fully_idle + the scripted-runtime test harness). CUDA lanes remain green. Please push back if any conclusion is off.

Non-CUDA failures (not chasing per babysit policy — none CUDA, none related to the merged code):

Lane Job Fingerprint Class
XPU 79948782973 EACCES: permission denied, unlink .../sglang.egg-info/PKG-INFO infra (runner cleanup)
Xeon CPU 79948776769 decode: expect req_lens to be int64, got Int; --sampling-backend: invalid choice: 'token_oracle'; exit -9 CPU-backend, pre-existing on main
NPU 79948784495 AssertionError: 672.30 not greater than or equal to 700 (w8a8 throughput threshold) perf-threshold flake
AMD mi325 (stage-c) 79948805518 registry pull timeout; Residual accuracy check failed (fused residual kernel) chronic stage-c / infra
AMD mi35x (stage-c) 79948805511 registry pull timeout; Fatal Python error: Aborted (exit -6) + ConnectionRefused cascade chronic stage-c / infra
finish / pr-test-npu-finish rollup cascade of the above cascade

The merged PRs touch no XPU/NPU/AMD/Xeon code, no sampling backends, no quantization or fused-residual kernels. Continuing to watch CUDA lanes (the hard gate) to completion.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented Jun 7, 2026

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first (and so far only) CUDA-lane failure of this verify-main run; cross-branch evidence below shows it reproduces identically on an independent scheduled main run, and a pre-merge-main probe has been dispatched. Please push back if any conclusion is off.

CUDA failure: base-c-test-8-gpu-h200 (2)test/registered/models_e2e/test_mimo_v2.py

Job 79948809861 — server for XiaomiMiMo/MiMo-V2.5 (tp=8, dp=2, EAGLE MTP, fp8) crashes 2s after becoming HTTP-ready, during the first warmup generate:

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:109: _assert_async_cuda_kernel:
Assertion `index >= 152576 (out of range): VocabParallelEmbedding input id` failed.   (x4 ranks)
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ASSERT (12)
Fatal Python error: Aborted                                                             (x4)
...
TimeoutError: Server failed to start within the timeout period

Cross-branch evidence

Branch Run test_mimo_v2 Fingerprint
sandbox (main 0a190d1c9 + sentinel) 27088945685 ✗ FAIL VocabParallelEmbedding input id out of range
main scheduled (a07d813ec, independent runner) 27091400009 ✗ FAIL byte-identical
main pre-#27445/#27446 (a39c428d3) 27093698014 (probe dispatched) pending

Classification

Pre-existing main regression, deterministic (2/2 independent runs), unrelated to #27445/#27446: the merged PRs touch only scripted-runtime test harness files and PP idle-gating in is_fully_idle (short-circuited at pp_size==1; this server is pp=1). The failing path is the model-side out-of-range-token-id async assert (same family as the tp=1 fix in #27482) on MiMo-V2.5's first warmup forward. Will report the pre-merge probe result when it completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant