Sandbox: verify full main CI is green on latest main (do not merge) by fzyzcjy · Pull Request #25647 · sgl-project/sglang

fzyzcjy · 2026-05-18T12:20:46Z

Summary

Sandbox PR — do not merge. Touches python/sglang/version.py with a no-op comment so paths-filter flips main_package=true and the full PR Test Base + PR Test Extra matrix dispatches.

Carries three labels so the workflow gates all pass:

Label	Effect
`run-ci`	Passes `pr-gate.yml`'s `require-run-ci` gate
`run-ci-extra`	Allows `pr-test-extra.yml` to run on this `pull_request` event
`bypass-fastfail`	Makes the per-job `check-pr-test-health` action no-op (no cascade fast-fail when a single sibling fails on infra flake)

Purpose: verify upstream/main (f04c522534) is green end-to-end with the full CI surface (base stages + extra stages, no fast-fail cascade). This is the PR-side equivalent of the dispatched main CI; cleaner than gh workflow run because the dispatch interface cannot pass skip_pr_test_health_check.

Close this PR after the run completes — no source change is intended to land.

Test plan

pre-commit run --files python/sglang/version.py
PR Test Base dispatches and runs to completion
PR Test Extra dispatches and runs to completion
No check-pr-test-health cascade failures

CI States

Latest PR Test (Base): ❌ Run #27088945685
Latest PR Test (Extra): ✅ Run #27088945624

gemini-code-assist · 2026-05-18T12:20:50Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

fzyzcjy · 2026-05-18T12:20:57Z

/tag-and-rerun-ci

fzyzcjy · 2026-05-19T01:48:17Z

CI failure: `base-b-test-1-gpu-large (1)` (PR Test Base, B200, 80 GB)

Job log

Failing test: test/registered/spec/eagle/test_eagle_infer_b.py::TestEAGLEServerAdditional::test_radix_attention

Symptom: 11/12 EAGLE tests pass, then test_radix_attention fails with ConnectionRefusedError: [Errno 111] Connection refused on http://127.0.0.1:11000/generate. The server died — a 59 MB cuda-coredumps-run-1.zip artifact was produced (artifact 7073114703).

File ".../test/registered/spec/eagle/test_eagle_infer_b.py", line 104, in test_radix_attention
    run_radix_attention_test(self.base_url)
File ".../python/sglang/test/kits/radix_cache_server_kit.py", line 49, in run_radix_attention_test
    res = requests.post(base_url + "/generate", json=data)
...
urllib3.exceptions.NewConnectionError: ... [Errno 111] Connection refused
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure. 11/12 EAGLE tests on the same base_url passed and the server emitted a CUDA coredump during test_radix_attention — points to an EAGLE-specific server crash, almost certainly a flake unless it repeats.

Next step: leaving the run untouched to see whether other lanes hit the same EAGLE / coredump pattern. If this stays isolated, will classify as flake and /rerun-test test/registered/spec/eagle/test_eagle_infer_b.py.

fzyzcjy · 2026-05-19T02:32:06Z

CI failure: `extra-a-test-1-gpu-large (0)` (PR Test Extra, NVIDIA)

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py::TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy

Symptom: scheduler dies during init with exit code -6 (SIGABRT — not SIGKILL/-9, so not the OS OOM-killer). Retry exhausted after the engine construction throws.

File ".../test/registered/lora/test_lora_qwen3_8b_logprob_diff.py", line 134, in test_lora_qwen3_8b_logprob_accuracy
    engine = sgl.Engine(...)
File ".../python/sglang/srt/entrypoints/engine.py", line 236, in __init__
    ) = self._launch_subprocesses(
File ".../python/sglang/srt/entrypoints/engine.py", line 856, in _launch_subprocesses
    scheduler_init_result.wait_for_ready()
File ".../python/sglang/srt/entrypoints/engine.py", line 651, in wait_for_ready
    infos = _wait_for_scheduler_ready(scheduler_pipe_readers, scheduler_procs)
File ".../python/sglang/srt/entrypoints/engine.py", line 1337, in _wait_for_scheduler_ready
    raise _scheduler_died_error(i, scheduler_procs[i])
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6). If exit code is -9 (SIGKILL), a common cause is the OS OOM killer. Run `dmesg -T | grep -i oom` to check.
...
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA extra-a-1-gpu-large lane. Exit -6 = SIGABRT during scheduler init — could be a CUDA-kernel crash, a model-loading assertion in the LoRA path, or transient infra. Posting a separate /rerun-test for this file to differentiate flake vs persistent.

fzyzcjy · 2026-05-19T02:32:12Z

CI failure: `base-b-test-1-gpu-small (5)` (PR Test Base, NVIDIA, 32 GB)

Job log

Failing test: test/registered/core/test_srt_endpoint.py::TestSRTEndpoint::test_get_server_info_concurrent ("Make sure the concurrent get_server_info doesn't crash the server.")

Symptom: server returns non-JSON on concurrent /server_info calls because the server-side handler hits an AssertionError inside communicator.queueing_call. The client then dies with JSONDecodeError: Expecting value: line 1 column 1 (char 0), retries are exhausted, test errors.

Server-side traceback:

File ".../python/sglang/srt/entrypoints/http_server.py", line 635, in server_info
    await _global_state.tokenizer_manager.get_internal_state()
File ".../python/sglang/srt/managers/tokenizer_control_mixin.py", line 788, in get_internal_state
    await self.get_internal_state_communicator(req)
File ".../python/sglang/srt/managers/communicator.py", line 79, in __call__
    return await self.queueing_call(obj)
File ".../python/sglang/srt/managers/communicator.py", line 40, in queueing_call
    assert self._result_event is None
AssertionError

Client-side:

File ".../test/registered/core/test_srt_endpoint.py", line 635, in s
    server_info.json()
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Exception: retry() exceed maximum number of retries.

Classification: this PR is a main-CI sandbox (HEAD = latest upstream/main + a no-op python/sglang/version.py comment touch, labels run-ci + run-ci-extra + bypass-fastfail), so the failure IS a main failure on the NVIDIA base-b-1-gpu-small (32 GB) lane. The assertion assert self._result_event is None in communicator.queueing_call is a concurrency race in the internal-state communicator — the test (test_get_server_info_concurrent) is specifically designed to catch exactly this class of bug. Smells like a real race, not a flake, but posting /rerun-test to confirm reproducibility before escalating.

fzyzcjy · 2026-05-19T02:37:09Z

/rerun-test test/registered/spec/eagle/test_eagle_infer_b.py

github-actions · 2026-05-19T02:37:29Z

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/spec/eagle/test_eagle_infer_b.py

fzyzcjy · 2026-05-19T02:50:15Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T02:50:17Z

/rerun-test test/registered/core/test_srt_endpoint.py

github-actions · 2026-05-19T02:50:37Z

🚀 1-gpu-5090 (1 test): ✅ View workflow run

cd test/ && python3 registered/core/test_srt_endpoint.py

github-actions · 2026-05-19T02:50:45Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T02:57:02Z

`/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` result: ❌ FAIL (reproducible)

Rerun job log

The same test fails on rerun with the same stack as the original extra-a-test-1-gpu-large (0) failure → this is NOT a flake.

Actual root cause (the SIGABRT in extra-a-test-1-gpu-large (0) was just the post-mortem; pre-coredump output reveals):

coredump: Starting GPU coredump generation
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0

Triggered during CUDA graph capture of TestLoRAQwen3_8BLogprobDiff::test_lora_qwen3_8b_logprob_accuracy. The C++/CUDA stack from the Python faulthandler after the coredump (the abort thread):

File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
File ".../python/sglang/srt/lora/layers.py", line 724 in forward
...
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 854 in _capture_one_stream
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 867 in capture
File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Classification: real bug on main HEAD 4a451128…, in the LoRA layer's forward path under CUDA graph capture. Likely a bad index / out-of-bounds memory access in lora/layers.py:724 (or in the unquant apply at unquant.py:161) when running Qwen3-8B with LoRA. Two-run reproducibility on the same commit confirms it's not a transient flake.

(This is the main-CI sandbox PR; the failing commit 4a451128… = latest upstream/main + a no-op python/sglang/version.py comment touch, so this bug is on main proper.)

fzyzcjy · 2026-05-19T03:04:53Z

`/rerun-test test/registered/core/test_srt_endpoint.py` result: ✅ PASS (flake)

Rerun job → SUCCESS.

The original failure on base-b-test-1-gpu-small (5) (test_get_server_info_concurrent, AssertionError self._result_event is None in communicator.queueing_call) did not reproduce. Classifying as flake — likely a transient race in the internal-state communicator under concurrent server_info that didn't hit the timing window on rerun. Not pursuing further.

Final per-file rerun verdicts on this main-CI sandbox:

File	Original lane	Rerun verdict
`test/registered/spec/eagle/test_eagle_infer_b.py` (`test_radix_attention`)	`base-b-test-1-gpu-large (1)`	✅ PASS — flake
`test/registered/core/test_srt_endpoint.py` (`test_get_server_info_concurrent`)	`base-b-test-1-gpu-small (5)`	✅ PASS — flake
`test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` (`test_lora_qwen3_8b_logprob_accuracy`)	`extra-a-test-1-gpu-large (0)`	❌ FAIL same `CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS` during CUDA graph capture — real bug (bisecting next)

fzyzcjy · 2026-05-19T03:19:03Z

Bisect probe: `d90bc65e30` (`[NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)` — pre-chain, HEAD-28)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26073712614 — FAIL
Tree date: 2026-05-19 (the commit on main directly preceding Tom's 23-commit refactor chain)

Verbatim CUDA error fingerprint:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
coredump:   - Device: 0
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS during CUDA graph capture as on c2a212bf… / 4a451128… (current main HEAD).

→ bug PRE-EXISTS Tom's chain. The 28 PRs between d90bc65e30 and current HEAD (PRs #25703–#25728 — Tom's scheduler refactor chain — plus #25282 DeepSeek V4 host pool, #25596 LTX2 diffusion fix, #25699 PD/NIXL aux, #25689 spec_verify metric, #24710 RMSNorm dispatch) are NOT the cause.

Bisect bound moves to last-good < d90bc65e30. Next probe: ba214ef3d3 (file-move point, 5 days ago) in flight; also dispatching 229cadec04 (midpoint of ba214ef3d3..d90bc65e30, 2026-05-16) to narrow in parallel.

fzyzcjy · 2026-05-19T03:50:01Z

Bisect probes: `ba214ef3d3` + `229cadec04`

PROBE B: ba214ef3d3 (ci: tag-gated nightly migration — foundation + 40 whole-file moves (#24725) — file-move point, 2026-05-14)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26073728329 — PASS ✅

PROBE C: 229cadec04 (Update logging for inplace setting in MoE layer (#25499) — midpoint of (ba214ef3d3..d90bc65e30), 2026-05-16)

rerun-test run: 26074082226 — PASS ✅

→ Bisect bound collapses to bug introduced in 229cadec04..d90bc65e30 (92 commits, 2026-05-16 → 2026-05-19).

Next probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — midpoint of the new range, 2026-05-18) — in flight as run 26075022728.

Bisect state so far:

SHA	Date	Subject	rerun-test verdict
`ba214ef3d3`	2026-05-14	tag-gated nightly migration — 40 whole-file moves	PASS
`229cadec04`	2026-05-16	logging update for inplace setting in MoE layer	PASS
`c58b47bc86`	2026-05-18	PoolStats dataclass move	(in flight)
`d90bc65e30`	2026-05-19	[NPU] Fix TypeError in MLA `index_head_dim`	FAIL
current HEAD	2026-05-19	(Tom's chain + 5 unrelated)	FAIL

fzyzcjy · 2026-05-19T04:01:39Z

Bisect probe: `c58b47bc86` (`Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075022728 — PASS ✅

→ Bisect bound collapses to bug introduced in c58b47bc86..d90bc65e30 (46 commits, 2026-05-18 → 2026-05-19).

Next probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — midpoint of the new range, 2026-05-18).

Bisect state:

SHA	Date	Verdict
`ba214ef3d3`	2026-05-14	PASS
`229cadec04`	2026-05-16	PASS
`c58b47bc86`	2026-05-18	PASS ✅
`f04c522534`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T04:12:53Z

Bisect probe: `f04c522534` (`[PD] Add conclude_state to fake KV backend (#25599)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075390772 — PASS ✅

→ Bisect bound collapses to bug introduced in f04c522534..d90bc65e30 (23 commits, both same-day 2026-05-18 / 2026-05-19).

Next probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — midpoint, 2026-05-18).

Bisect state:

SHA	Date	Verdict
`ba214ef3d3`	2026-05-14	PASS
`229cadec04`	2026-05-16	PASS
`c58b47bc86`	2026-05-18	PASS
`f04c522534`	2026-05-18	PASS ✅
`f5049709b3`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T04:24:56Z

Bisect probe: `f5049709b3` (`fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26075730388 — PASS ✅

→ Bisect bound: bug introduced in f5049709b3..d90bc65e30 (12 commits, 2026-05-18 → 2026-05-19).

Full range (suspicion-worthy commits highlighted):

1f185c6ba8 Support draft extend cuda graph for tokenspeed_mla attention backend (#25489)  ← CUDA graph
b7267e8fce [CI] Enable weight prefetch for 8-gpu-h200 basic tests (#25684)
9e3bb9a307 [Spec] fold can_run_cuda_graph into EagleVerifyOutput (#25566)                  ← CUDA graph
c904fdd20e ci: pr-states match renamed "PR Test Base" workflow_run (#25687)
6f892047ec [misc] Throw error when single batch overlap is enabled on Hopper (#25509)      ← Hopper
878e6b8886 [SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)          ← midpoint
745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)                ← cutedsl
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe: 878e6b8886 (midpoint).

Bisect state:

SHA	Date	Verdict
`f5049709b3`	2026-05-18	PASS ✅ (last good lower bound)
`878e6b8886`	2026-05-18	(in flight)
`d90bc65e30`	2026-05-19	FAIL (first bad upper bound)

fzyzcjy · 2026-05-19T04:36:56Z

Bisect probe: `878e6b8886` (`[SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26076099919 — PASS ✅

→ Bisect bound: bug introduced in 878e6b8886..d90bc65e30 (6 commits).

Remaining range:

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← prime suspect (CUDA-DSL packages)
dbac464726 [Spec]: Make Triton standalone spec test deterministic (#25303)
d028697d17 [NPU][Docs] Add Kimi-K2.5-W4A8 instance doc on NPU (#25269)
d90bc65e30 [NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)

Next probe (also the midpoint): b79e4b1e68 — the cutedsl-packages fix. This was the most suspicious commit in the wider range too (touches CUDA-DSL builds; LoRA forward → quant unquant.apply → cuBLAS path is a plausible blast radius).

Bisect state:

SHA	Date	Verdict
`878e6b8886`	2026-05-18	PASS ✅ (last good)
`b79e4b1e68`	2026-05-18	(in flight — prime suspect)
`d90bc65e30`	2026-05-19	FAIL (first bad)

fzyzcjy · 2026-05-19T04:48:53Z

Bisect probe: `b79e4b1e68` (`[Fix] Try to fix error caused by latest cutedsl packages (#25690)` — 2026-05-18)

File: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py
rerun-test run: 26076486815 — FAIL ❌

Same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14) fingerprint as on HEAD:

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

→ Bisect bound: bug introduced in 878e6b8886..b79e4b1e68 (3 commits inclusive of b79e4b1e68):

745abd6cc0 Add no_combine support to cutlass_moe_fp4 (#25688)
314dedf7c6 Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)
b79e4b1e68 [Fix] Try to fix error caused by latest cutedsl packages (#25690)  ← FAIL ❌

Next probe: 314dedf7c6 (midpoint of the 3-commit range, 2026-05-18).

If PASS → offender is b79e4b1e68 itself (the cutedsl fix).
If FAIL → offender is 745abd6cc0 (cutlass_moe_fp4) or 314dedf7c6 (SGLANG_CACHE_DIR env path).

Bisect state:

SHA	Date	Verdict
`878e6b8886`	2026-05-18	PASS ✅ (last good)
`745abd6cc0`	2026-05-18	(untested)
`314dedf7c6`	2026-05-18	(in flight)
`b79e4b1e68`	2026-05-18	FAIL ❌ (first bad upper bound)
`d90bc65e30`	2026-05-19	FAIL

fzyzcjy · 2026-05-19T05:01:02Z

🤖 Posted autonomously by Claude Code acting on Tom's behalf. The 9-probe bisect (PROBE_A..I) below was driven by the agent — each probe pushed a temp branch on upstream, dispatched rerun-test.yml against it, classified the result, and narrowed the range. The @-mentions are programmatic, not Tom's personal request; please push back if anything is off.

Bisect result: `test_lora_qwen3_8b_logprob_diff.py` regressed at `b79e4b1e68` (PR #25690, `[Fix] Try to fix error caused by latest cutedsl packages`)

PROBE I (the deciding probe): 314dedf7c6 (Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path (#25686)) → rerun-test 26076870779 — PASS ✅

With 314dedf7c6 PASS and b79e4b1e68 FAIL on the immediately-following commit, the regression lands on b79e4b1e68 exactly.

Final bisect table

SHA	Date	Subject	Verdict
`ba214ef3d3`	2026-05-14	tag-gated nightly migration — 40 whole-file moves	PASS
`229cadec04`	2026-05-16	logging update for inplace setting in MoE layer	PASS
`c58b47bc86`	2026-05-18	PoolStats dataclass move	PASS
`f04c522534`	2026-05-18	[PD] Add conclude_state to fake KV backend	PASS
`f5049709b3`	2026-05-18	eagle3 aux-layer-ids +1 offset fix	PASS
`878e6b8886`	2026-05-18	[SP] Fix runtime_max_tokens_per_rank	PASS
`314dedf7c6`	2026-05-18	Use SGLANG_CACHE_DIR env for gpu_p2p_access_cache path	PASS ✅ (last good)
`b79e4b1e68`	2026-05-18	[Fix] Try to fix error caused by latest cutedsl packages (#25690)	FAIL ❌ (first bad)
`d90bc65e30`	2026-05-19	[NPU] Fix TypeError in MLA `index_head_dim`	FAIL
current HEAD	2026-05-19	(Tom's chain + a handful of unrelated)	FAIL

Offending change

PR: [Fix] Try to fix error caused by latest cutedsl packages #25690 — [Fix] Try to fix error caused by latest cutedsl packages
Author: @Fridge003 (Co-authored-by @hnyls2002)
Merged: 2026-05-18 23:51 UTC
Diff: 21 +, 4 -. Touches python/pyproject.toml (switches flashinfer_python and nvidia-cutlass-dsl to the [cu13] extras variant) and scripts/ci/cuda/ci_install_dependency.sh (regex-update for [extras] notation + new purge_cutlass_libs_base() step that uninstalls nvidia-cutlass-dsl-libs-base then force-reinstalls nvidia-cutlass-dsl-libs-cu13).

The PR's own commit message explains the original bug it was fixing:

nvidia-cutlass-dsl[cu13] extras are additive on PyPI: requires_dist always pulls -libs-base AND -libs-cu13 when [cu13] is requested. Both wheels write to the same site-packages paths with different content, leaving the wrapper (cutlass.py, cu13 style) mismatched with the binding (_gpu_ops_gen.py, base style) -> GPUModuleOp signature TypeError.

The fix correctly purges -libs-base in the install script, but the LoRA Qwen3-8B forward path with CUDA graph capture now hits a kernel-side illegal address — so either the cu13 wheel's compiled kernel is broken for this path, or the purge_cutlass_libs_base step doesn't actually win in all install orderings.

Failure fingerprint (every FAIL probe + current HEAD)

coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
Fatal Python error: Aborted
RuntimeError: Rank 0 scheduler died during initialization (exit code: -6).

Python call stack at the abort thread:
  File ".../python/sglang/srt/layers/quantization/unquant.py", line 161 in apply
  File ".../python/sglang/srt/lora/layers.py", line 724 in forward
  ...
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1112 in run_once
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 1134 in capture_one_batch_size
  File ".../python/sglang/srt/model_executor/cuda_graph_runner.py", line 707 in __init__
  File ".../python/sglang/srt/model_executor/model_runner.py", line 2776 in init_device_graphs

Reproduce

# Probe latest good (PASS):
git push upstream 314dedf7c6:refs/heads/tmp-good
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-good \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

# Probe first bad (FAIL):
git push upstream b79e4b1e68:refs/heads/tmp-bad
gh workflow run rerun-test.yml --repo sgl-project/sglang --ref tmp-bad \
  -f mode=cuda -f test_command="registered/lora/test_lora_qwen3_8b_logprob_diff.py" \
  -f runs_on="1-gpu-h100" -f install_script="scripts/ci/cuda/ci_install_dependency.sh"

cc @Fridge003 @hnyls2002 — could you take a look? This regression has been on main since 2026-05-18 and is currently surfacing as extra-a-test-1-gpu-large (0) on the main-CI sandbox.

Diagnostic revert PR opened for verification: #25743 — /rerun-test of the failing LoRA file is pending there.

fzyzcjy · 2026-05-19T05:22:57Z

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Bidirectional confirmation of the bisect result via paired diagnostic PRs.

Bisect confirmed via paired diagnostic PRs

Two sibling PRs were opened to nail down b79e4b1e68 (#25690) as the root cause:

PR	What it does	`/rerun-test` LoRA file verdict	Run
#25743	Reverts `b79e4b1e68`	PASS ✅	26077407201
#25744	No revert; only a 1-line sentinel comment in `python/sglang/version.py` so the PR isn't auto-closed for 0-diff	FAIL ❌ (same `CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)` fingerprint)	26077826917

Together with the per-commit bisect probes above, that's three independent lines of evidence:

Walking from a known-good 2026-05-14 down to b79e4b1e68 (9 probes, all consistent with PASS-then-FAIL at the exact commit boundary).
Revert-the-commit → PASS on the same test file.
Don't-revert (plain main + harmless touch) → FAIL on the same test file with identical fingerprint.

The regression is unambiguously b79e4b1e68 (#25690) — independent of Tom's #25703–#25728 chain.

cc @Fridge003 @hnyls2002 — could you take a look? Closing the two diagnostic PRs now.

fzyzcjy · 2026-05-19T05:52:36Z

🤖 Posted autonomously by Claude Code acting on Tom's behalf. Second-pair double-confirmation.

2×2 paired probe — both runs match expectation

A second rerun-test was dispatched on each of the two paired branches to rule out flake on either side:

	First run	Second run
#25743 (revert) branch `tom/revert-25690-cutedsl`	PASS ✅ (26077407201)	PASS ✅ (26078646438)
#25744 (no-revert) branch `tom/probe-lora-bug-25690`	FAIL ❌ (26077826917)	FAIL ❌ (26078647279)

4-of-4 consistent with the bisect conclusion. b79e4b1e68 (#25690) is the regression — high confidence, no flake.

cc @Fridge003 @hnyls2002

fzyzcjy · 2026-05-31T02:41:59Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaging the full main-CI run on this sandbox (head = current upstream/main, the just-landed state); classifying the one CUDA-lane failure and kicking a single-file rerun to confirm it's a flake. Please push back if any conclusion is off.

CUDA-lane failure: borderline GSM8K accuracy (likely flake)

Job: extra-a-test-2-gpu-large (0) — https://github.com/sgl-project/sglang/actions/runs/26700710577/job/78693005184
Test: test/registered/spec/test_gemma4_mtp_31b_extra.py::TestGemma4MTP31B::test_gsm8k_mtp (topk=1)
Fingerprint:

AssertionError: 0.77 not greater than or equal to 0.775
[Gemma4 31B topk=1] score=0.7700 threshold=0.7750 avg_spec_accept_length=4.468558708959376

Classification: borderline accuracy flake. GSM8K is a 200-question eval, so 0.770 is exactly one question below the 0.775 bar (score std ≈ 0.42); run-to-run variance routinely crosses a 0.005 margin. The spec accept length (4.47) is healthy, so MTP itself is working. This PR is the no-diff main-verification sandbox, so the miss is a property of upstream/main, not introduced by any change here.

Other failing lanes are non-CUDA and not chased per lane policy (main-sandbox, no diff): stage-a-test-1-gpu-xpu (XPU), stage-b-test-1-gpu-small-amd-nondeterministic (AMD), stage-b-test-1-npu-a2 (NPU), and finish (cascade aggregation).

Next: rerunning this single file to confirm the flake.

fzyzcjy · 2026-05-31T02:42:00Z

/rerun-test test/registered/spec/test_gemma4_mtp_31b_extra.py

github-actions · 2026-05-31T02:42:24Z

Results for /rerun-test test/registered/spec/test_gemma4_mtp_31b_extra.py:

🚀 2-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/spec/test_gemma4_mtp_31b_extra.py

fzyzcjy · 2026-05-31T03:11:44Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Confirming the earlier gemma4-MTP GSM8K failure was a flake via the /rerun-test result plus a cross-branch data point. Please push back if any conclusion is off.

Flake confirmed: `test_gemma4_mtp_31b_extra.py`

/rerun-test on this sandbox head (c6e27e08fb): ✅ PASS — https://github.com/sgl-project/sglang/actions/runs/26701330481
Cross-branch: the same test also passed on a fresh tip-of-main rerun (~02:42 UTC) and on an unrelated PR — so the 0.770 vs 0.775 miss is GSM8K run-to-run variance, not a regression.

Remaining red lanes are all non-CUDA / chronic and not gating for a main-verification sandbox: base-c-test-8-gpu-h20 (h20 — ignored per policy), XPU, NPU (stage-b-test-1-npu-a2, pr-test-npu-finish), AMD (amd-nondeterministic, mi35x-disaggregation-amd), and the finish cascade. The CUDA gate is otherwise green; ~20 CUDA jobs still finishing.

fzyzcjy · 2026-05-31T04:11:59Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Logging the one newly-surfaced non-CUDA lane failure (Xeon CPU) with its fingerprint, classified as unrelated to this main-verification sandbox. Please push back if any conclusion is off.

Non-CUDA lane: `build-test (xeon-gnr, base-b-test-cpu)` — not gating

Failing file: test/registered/cpu/test_cpu_graph.py (exit 1). All other files in the job passed (test_soft_watchdog.py passed — its watchdog-timeout / py-spy lines are that test's intentional behavior, not an error).
Fingerprint:

UserWarning: Triton is not supported on current platform, roll back to CPU.
RuntimeError: Failed to parse benchmark output. prefill_latency=None, decode_throughput=None, decode_latency=None
[rank0]: AttributeError: 'NoneType' object has no attribute 'shape'

Classification: CPU-backend (Xeon) issue, not CUDA, and this is the no-diff main-verification sandbox — unrelated to anything here. Per lane policy (non-CUDA, not caused by a diff) it is not chased.

CUDA gate so far: green except the already-confirmed gemma4 GSM8K flake; a couple of CUDA h200 jobs still finishing. Other reds remain h20 (ignored) / NPU / AMD-mi35x / cascade finish aggregations.

fzyzcjy · 2026-05-31T05:41:07Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final summary of the full main-CI verification run (head = current upstream/main, i.e. the just-landed 24-PR KV-canary chain). Closing the sandbox afterward for reuse. Please push back if any conclusion is off.

✅ CUDA gate GREEN — main verification complete

Head SHA c6e27e08fb (= upstream/main + the no-op sentinel). All CUDA lanes passed: base-b, base-c (incl. 8-gpu h100 / h200 / b200), extra-a, extra-b, piecewise-cuda-graph, etc.

The only CUDA red was a confirmed flake:

extra-a-test-2-gpu-large → test_gemma4_mtp_31b_extra.py GSM8K 0.770 vs 0.775 (1 question / 200). /rerun-test on this exact SHA passed (run); same test also green on tip-of-main.

Remaining red lanes are non-gating (non-CUDA / chronic / cascade), none related to the landed chain:

base-c-test-8-gpu-h20 → h20 (chronic, ignored per policy)
stage-a-test-1-gpu-xpu → XPU; stage-b-test-1-npu-a2 → NPU; stage-b/stage-c ...-amd, ...-mi35x-disaggregation-amd → AMD
build-test (xeon-gnr, base-b-test-cpu) → Xeon CPU test_cpu_graph.py benchmark-parse issue (CPU backend, unrelated)
finish / pr-test-finish / pr-test-extra-finish / pr-test-npu-finish → aggregation jobs cascading from the above

Conclusion: the KV-canary feature, landed on main via the 24-PR chain (#26798–#26821), is CUDA-CI green. Closing this sandbox PR (do not merge) so it's ready for the next reuse.

fzyzcjy · 2026-06-06T01:20:01Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaging this sandbox main-CI round; will follow up once logs are fetchable. Please push back if any conclusion is off.

Round status (head 96c5c6e1db = main bf4f2ccc78 + sentinel):

base-a-test-cpu (0) (PR Test Base): FAILURE, annotation only says Process completed with exit code 255. Log fetch pending REST rate-limit reset (~50 min). Note: the same content (scripted-runtime chain Add kv_canary PP self-test fixture and SWA divergence coverage #27410–Add scripted-runtime unit, core integration, and chunked-prefill tests #27413) passed this job on DO NOT MERGE - scripted runtime #26991, so flake or new-main interaction is suspected — will classify from the log, then /rerun-failed-ci after the full round completes.
stage-a-test-1-gpu-xpu + XPU finish: chronic XPU runner infra (checkout EACCES leftover-file pattern seen on previous rounds), non-CUDA lane, not a gate.

Remaining ~95 jobs still running; will batch any reruns after the round lands.

fzyzcjy · 2026-06-06T03:22:23Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Fetched the failing job log after the rate-limit reset and classified the failure. Please push back if any conclusion is off.

base-a-test-cpu (0) root cause: HF Hub rate-limit (infra flake, not code).

test/registered/unit/server_args/test_server_args.py failed because Hugging Face Hub returned 429 Too Many Requests for Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json, and the retry also could not connect (job log):

httpx.HTTPStatusError: Client error '429 Too Many Requests' for url 'https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/resolve/main/config.json'
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
✗ FAILED: test/registered/unit/server_args/test_server_args.py (exit code 1)

Other reds this round: AMD lane (27 jobs — ongoing repo-wide AMD outage), NPU a2 (recurring perf flake), XPU (chronic runner infra). None CUDA, none code-related.

Plan: wait for the ~13 still-running jobs to land, then /rerun-failed-ci once.

fzyzcjy · 2026-06-06T06:57:09Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Classified the second CUDA-lane failure of this round from the job log. Please push back if any conclusion is off.

base-c-test-4-gpu-h100 (3): marginal KL-divergence threshold exceedance (likely numeric flake).

test/registered/models_e2e/test_qwen3_next_models.py failed with (job log):

AssertionError: avg_kl_div=0.0015218577479656225 > threshold=0.001 for Qwen/Qwen3-Next-80B-A3B-Instruct test_input_output_logprobs_match_prefill_cache_hit_helper

Marginal exceedance (1.5e-3 vs 1e-3 threshold) on a logprob-consistency check.
The content under test (scripted-runtime chain Add kv_canary PP self-test fixture and SWA divergence coverage #27410–Add scripted-runtime unit, core integration, and chunked-prefill tests #27413) is test-only / env-gated (SGLANG_TEST_SCRIPTED_RUNTIME default off) and does not touch qwen3-next or logprob numerics; the same content passed this suite on DO NOT MERGE - scripted runtime #26991.

Round summary (running=0): CUDA reds = this + base-a-test-cpu (0) (HF Hub 429, infra) + pr-test-finish cascade. Non-CUDA reds = AMD lane outage (27), NPU a2 perf flake, XPU chronic infra.

Next: one batched /rerun-failed-ci.

fzyzcjy · 2026-06-06T06:57:15Z

/rerun-failed-ci

fzyzcjy · 2026-06-07T09:47:42Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first CI failures of this verify-main run (head ffbe2e8 = main 0a190d1); classification below. Please push back if any conclusion is off.

stage-a-test-1-gpu-xpu / finish (job): runner-level infra failure during workspace cleanup, before any test ran:

##[error]File was unable to be removed Error: EACCES: permission denied, unlink '.../python/sglang.egg-info/PKG-INFO'

Classification: infra (self-hosted XPU runner permission residue), non-CUDA lane, unrelated to main's code. Not chasing per babysit policy; CUDA lanes remain the hard gate.

fzyzcjy · 2026-06-07T10:46:20Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged all non-CUDA failures on this verify-main run (head ffbe2e8 = main 0a190d1, which includes merged #27445 + #27446). All failures are non-CUDA lanes in code paths the merged PRs do not touch (the PRs only change scheduler PP idle-gating in is_fully_idle + the scripted-runtime test harness). CUDA lanes remain green. Please push back if any conclusion is off.

Non-CUDA failures (not chasing per babysit policy — none CUDA, none related to the merged code):

Lane	Job	Fingerprint	Class
XPU	79948782973	`EACCES: permission denied, unlink .../sglang.egg-info/PKG-INFO`	infra (runner cleanup)
Xeon CPU	79948776769	`decode: expect req_lens to be int64, got Int`; `--sampling-backend: invalid choice: 'token_oracle'`; exit -9	CPU-backend, pre-existing on main
NPU	79948784495	`AssertionError: 672.30 not greater than or equal to 700` (w8a8 throughput threshold)	perf-threshold flake
AMD mi325 (stage-c)	79948805518	registry pull timeout; `Residual accuracy check failed` (fused residual kernel)	chronic stage-c / infra
AMD mi35x (stage-c)	79948805511	registry pull timeout; `Fatal Python error: Aborted` (exit -6) + ConnectionRefused cascade	chronic stage-c / infra
finish / pr-test-npu-finish	—	rollup cascade of the above	cascade

The merged PRs touch no XPU/NPU/AMD/Xeon code, no sampling backends, no quantization or fused-residual kernels. Continuing to watch CUDA lanes (the hard gate) to completion.

fzyzcjy · 2026-06-07T13:18:51Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the first (and so far only) CUDA-lane failure of this verify-main run; cross-branch evidence below shows it reproduces identically on an independent scheduled main run, and a pre-merge-main probe has been dispatched. Please push back if any conclusion is off.

CUDA failure: `base-c-test-8-gpu-h200 (2)` — `test/registered/models_e2e/test_mimo_v2.py`

Job 79948809861 — server for XiaomiMiMo/MiMo-V2.5 (tp=8, dp=2, EAGLE MTP, fp8) crashes 2s after becoming HTTP-ready, during the first warmup generate:

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:109: _assert_async_cuda_kernel:
Assertion `index >= 152576 (out of range): VocabParallelEmbedding input id` failed.   (x4 ranks)
coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ASSERT (12)
Fatal Python error: Aborted                                                             (x4)
...
TimeoutError: Server failed to start within the timeout period

Cross-branch evidence

Branch	Run	test_mimo_v2	Fingerprint
sandbox (main `0a190d1c9` + sentinel)	27088945685	✗ FAIL	VocabParallelEmbedding input id out of range
`main` scheduled (`a07d813ec`, independent runner)	27091400009	✗ FAIL	byte-identical
`main` pre-#27445/#27446 (`a39c428d3`)	27093698014 (probe dispatched)	pending	—

Classification

Pre-existing main regression, deterministic (2/2 independent runs), unrelated to #27445/#27446: the merged PRs touch only scripted-runtime test harness files and PP idle-gating in is_fully_idle (short-circuited at pp_size==1; this server is pp=1). The failing path is the model-side out-of-range-token-id async assert (same family as the tp=1 fix in #27482) on MiMo-V2.5's first warmup forward. Will report the pre-merge probe result when it completes.

fzyzcjy added run-ci bypass-fastfail run-ci-extra labels May 18, 2026

fzyzcjy closed this May 18, 2026

fzyzcjy reopened this May 19, 2026

This was referenced May 19, 2026

Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743

Closed

Probe LoRA Qwen3-8B CUDA fail on plain main (negative control, NOT a fix) #25744

Closed

fzyzcjy mentioned this pull request May 19, 2026

[Fix] Try to fix error caused by latest cutedsl packages #25690

Merged

5 tasks

fzyzcjy closed this May 19, 2026

fzyzcjy reopened this May 31, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from 4a45112 to c6e27e0 Compare May 31, 2026 02:08

fzyzcjy closed this May 31, 2026

fzyzcjy reopened this Jun 6, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from c6e27e0 to 96c5c6e Compare June 6, 2026 01:12

fzyzcjy closed this Jun 6, 2026

Sandbox: verify full main CI on latest main (20260607T094129Z)

ffbe2e8

fzyzcjy reopened this Jun 7, 2026

fzyzcjy force-pushed the tom/sandbox-verify-main-ci branch from 96c5c6e to ffbe2e8 Compare June 7, 2026 09:42

Conversation

fzyzcjy commented May 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

CI States

Uh oh!

gemini-code-assist Bot commented May 18, 2026

Uh oh!

fzyzcjy commented May 18, 2026

Uh oh!

fzyzcjy commented May 19, 2026

CI failure: base-b-test-1-gpu-large (1) (PR Test Base, B200, 80 GB)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI failure: extra-a-test-1-gpu-large (0) (PR Test Extra, NVIDIA)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI failure: base-b-test-1-gpu-small (5) (PR Test Base, NVIDIA, 32 GB)

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py result: ❌ FAIL (reproducible)

Uh oh!

fzyzcjy commented May 19, 2026

/rerun-test test/registered/core/test_srt_endpoint.py result: ✅ PASS (flake)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: d90bc65e30 ([NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383) — pre-chain, HEAD-28)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probes: ba214ef3d3 + 229cadec04

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: c58b47bc86 (Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: f04c522534 ([PD] Add conclude_state to fake KV backend (#25599) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: f5049709b3 (fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: 878e6b8886 ([SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026

Bisect probe: b79e4b1e68 ([Fix] Try to fix error caused by latest cutedsl packages (#25690) — 2026-05-18)

Uh oh!

fzyzcjy commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bisect result: test_lora_qwen3_8b_logprob_diff.py regressed at b79e4b1e68 (PR #25690, [Fix] Try to fix error caused by latest cutedsl packages)

Final bisect table

Offending change

Failure fingerprint (every FAIL probe + current HEAD)

Reproduce

Uh oh!

fzyzcjy commented May 19, 2026

Bisect confirmed via paired diagnostic PRs

Uh oh!

fzyzcjy commented May 19, 2026

2×2 paired probe — both runs match expectation

Uh oh!

fzyzcjy commented May 31, 2026

CUDA-lane failure: borderline GSM8K accuracy (likely flake)

fzyzcjy commented May 18, 2026 •

edited by github-actions Bot

Loading

CI failure: `base-b-test-1-gpu-large (1)` (PR Test Base, B200, 80 GB)

fzyzcjy commented May 19, 2026 •

edited

Loading

CI failure: `extra-a-test-1-gpu-large (0)` (PR Test Extra, NVIDIA)

fzyzcjy commented May 19, 2026 •

edited

Loading

CI failure: `base-b-test-1-gpu-small (5)` (PR Test Base, NVIDIA, 32 GB)

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

`/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py` result: ❌ FAIL (reproducible)

`/rerun-test test/registered/core/test_srt_endpoint.py` result: ✅ PASS (flake)

Bisect probe: `d90bc65e30` (`[NPU] Fix TypeError in get_state_buf_infos when index_head_dim is None on MLA (#25383)` — pre-chain, HEAD-28)

Bisect probes: `ba214ef3d3` + `229cadec04`

Bisect probe: `c58b47bc86` (`Move PoolStats dataclass to scheduler_components.pool_stats_observer (#25618)` — 2026-05-18)

Bisect probe: `f04c522534` (`[PD] Add conclude_state to fake KV backend (#25599)` — 2026-05-18)

Bisect probe: `f5049709b3` (`fix(eagle3): drop +1 offset on aux layer ids when first id != 1 (#25454)` — 2026-05-18)

Bisect probe: `878e6b8886` (`[SP] Fix runtime_max_tokens_per_rank for sequence parallelism (#25685)` — 2026-05-18)

Bisect probe: `b79e4b1e68` (`[Fix] Try to fix error caused by latest cutedsl packages (#25690)` — 2026-05-18)

fzyzcjy commented May 19, 2026 •

edited

Loading

Bisect result: `test_lora_qwen3_8b_logprob_diff.py` regressed at `b79e4b1e68` (PR #25690, `[Fix] Try to fix error caused by latest cutedsl packages`)

github-actions Bot commented May 31, 2026 •

edited

Loading

Flake confirmed: `test_gemma4_mtp_31b_extra.py`

Non-CUDA lane: `build-test (xeon-gnr, base-b-test-cpu)` — not gating

CUDA failure: `base-c-test-8-gpu-h200 (2)` — `test/registered/models_e2e/test_mimo_v2.py`