DO NOT MERGE - CI sandbox for stateless scheduler b by fzyzcjy · Pull Request #25172 · sgl-project/sglang

fzyzcjy · 2026-05-13T13:25:38Z

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #26388971766
Latest PR Test (Extra): ✅ Run #26388971706

fzyzcjy · 2026-05-13T13:25:46Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request refactors the chunked prefill mechanism by replacing the global chunked_req pointer with per-request state flags (has_pending_chunk and pending_middle_outputs). This change enhances support for pipeline parallelism and ensures more robust state management across iterations. The review feedback identifies potential null pointer crashes, logic inconsistencies in request abort handling, and performance optimizations for hot-path queue scans, all of which include actionable code suggestions.

gemini-code-assist · 2026-05-13T13:28:46Z

+            for mb_list in (self.mbs, self.last_mbs, self.running_mbs):
+                for mb in mb_list:
+                    if mb is not None and not mb.is_empty():
+                        batch_reqs.extend(mb.reqs)


The iteration over self.mbs, self.last_mbs, and self.running_mbs will crash if any of these attributes are None. While self.mbs is typically a list, last_mbs and running_mbs are often None in certain scheduler states or configurations.

Suggested change

for mb_list in (self.mbs, self.last_mbs, self.running_mbs):

for mb in mb_list:

if mb is not None and not mb.is_empty():

batch_reqs.extend(mb.reqs)

if self.pp_size > 1 and hasattr(self, "mbs"):

for mb_list in (self.mbs, self.last_mbs, self.running_mbs):

if mb_list is not None:

for mb in mb_list:

if mb is not None and not mb.is_empty():

batch_reqs.extend(mb.reqs)

gemini-code-assist · 2026-05-13T13:28:46Z

+            if (recv_req.abort_all or req.rid.startswith(recv_req.rid)) and (
+                req.rid not in batch_rids
+            ):


Aborted requests that are currently in a batch (e.g., chunked-resume requests) should still be removed from the waiting_queue list to maintain a consistent scheduler state. The current logic skips them entirely. To avoid double-releasing resources, you should remove them from the list but skip the release_kv_cache call inside the processing loop (by checking if req.rid in batch_rids: continue).

if (recv_req.abort_all or req.rid.startswith(recv_req.rid)):

gemini-code-assist · 2026-05-13T13:28:46Z

+        # priority + has_pending_chunk make it sit at the head, but its
+        # presence relaxes the "is queue empty / pool full" early exits below
+        # (we must keep scheduling it to make progress, or memory leaks).
+        has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)


Performing an $O(N)$ scan of the waiting_queue using any() in the scheduler's hot path is inefficient. Since the scheduling policy ensures that has_pending_chunk requests are sorted to the front of the queue, you can optimize this by checking only the first element.

Suggested change

has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)

has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False

gemini-code-assist · 2026-05-13T13:28:46Z

+            chunked_resume = next(
+                (r for r in self.waiting_queue if r.has_pending_chunk), None
+            )


This $O(N)$ scan can be optimized. Since has_pending_chunk requests are sorted to the front of the queue, the first element is the only one that needs to be checked.

chunked_resume = self.waiting_queue[0] if self.waiting_queue and self.waiting_queue[0].has_pending_chunk else None

fzyzcjy · 2026-05-13T16:02:01Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T16:08:56Z

/rerun-test stage-b-test-2-gpu-large

ghost · 2026-05-13T16:09:25Z

⛔ stage-b-test-2-gpu-large: No test file found matching stage-b-test-2-gpu-large under test/registered/ or python/sglang/multimodal_gen/test/.

fzyzcjy · 2026-05-13T16:11:58Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T16:19:28Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T19:10:27Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T19:49:51Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T21:28:21Z

/rerun-failed-ci

fzyzcjy · 2026-05-14T01:14:51Z

/rerun-failed-ci

fzyzcjy · 2026-05-14T01:51:42Z

`stage-c-test-dsv4-8-gpu-h200` failures — infra flakes, not v2-related

The dsv4-8-gpu-h200 job has been failing on this PR's reruns. After digging into the logs, all three observed failure modes occur in setUpClass / server startup, before any v2 code path is exercised.

Attempt 3 — leftover process holds NCCL port

log

ValueError: nccl_port at 36919 is not available in 30 seconds.
nccl_port is used by a process already.
process.cmdline=['python3', '-m', 'sglang.launch_server',
                 '--disaggregation-mode', 'prefill', ...]
process.status()='running' pid=3168509

The PID is from a prior job on the same self-hosted runner that wasn't cleaned up between attempts.

Attempt 4 — two distinct distributed-init failures

log

test_disaggregation_dsv4.py — NVSHMEM UID bootstrap times out:

bootstrap.cpp:242: non-zero status: 7 bootstrap_loader_init returned error for mode UID
init.cu:1188: non-zero status: 7 nvshmem_bootstrap failed
nvshmemx_api.h:63: non-zero status: 7: Connection timed out, exiting...
[2026-05-13 22:48:20] Received sigquit from a child process. It usually means the child failed.
EOFError

test_deepseek_v4_flash_fp8_h200.py — server child SIGKILL during setUpClass:

File "test/registered/dsv4/test_deepseek_v4_flash_fp8_h200.py", line 39, in setUpClass
    cls.process = popen_launch_server(...)
Exception: Server process exited with code -9. Check server logs for errors.

Why this isn't v2

All failures occur in setUpClass (port acquisition / NVSHMEM bootstrap / server warmup). The test bodies never run, so no scheduler / chunked-prefill code is invoked.
This PR's changes are confined to chunked-prefill state-machine migration (per-Req fields, PP cross-microbatch finalize guard) and v1-test cleanup. None of it touches NCCL port allocation, NVSHMEM init, or server bootstrap.
The same NVSHMEM-UID-timeout / port-busy patterns recur across stage-c h100 / h200 / h20 lanes on unrelated PRs.

Cascade impact

dsv4-8-gpu-h200 is a fast-fail root, so each failure cascades to ~10 downstream jobs that skip with:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-dsv4-8-gpu-h200

Those downstream skips (deepep-8-gpu-h200, 4-gpu-h100, 4-gpu-b200 (0), etc.) are not independent failures.

fzyzcjy · 2026-05-14T02:36:49Z

`stage-c-test-4-gpu-b200 (3)` failure — pre-existing NVFP4 weight-update bug, not v2-related

Job: stage-c-test-4-gpu-b200 (3), attempt 5 — log

Failing test: test/registered/rl/test_update_weights_from_disk_blackwell.py::test_parameterized_update_weights_from_disk

Error (server-side, during update_weights_from_disk RPC):

File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading
    input_scale_2 = layer.input_scale.max().to(torch.float32)
File "torch/nn/modules/module.py", line 1968, in __getattr__
    raise AttributeError(
AttributeError: 'QKVParallelLinear' object has no attribute 'input_scale'. Did you mean: 'input_size'?

Stack trace (server-side):

File "python/sglang/srt/managers/scheduler.py", line 4106, in run_scheduler_process
File "python/sglang/srt/managers/scheduler.py", line 1850, in process_input_requests
File "python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 56, in update_weights_from_disk
    success, message = self.tp_worker.update_weights_from_disk(recv_req)
…
File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading

The server then SIGQUITs and the test client sees Connection aborted / RemoteDisconnected while polling _post_json, which is the cascade — the underlying root cause is the AttributeError above.

Why this isn't v2:

The crash is in python/sglang/srt/layers/quantization/modelopt_quant.py:1361 (NVFP4 / modelopt quantization path on Blackwell). v2 doesn't touch quantization or weight-update flows.
This PR's diff in the scheduler is confined to the chunked-prefill state machine (per-Req fields, PP cross-microbatch finalize guard) and scheduler_update_weights_mixin.py is unmodified.
QKVParallelLinear is constructed without an input_scale attribute under this NVFP4 codepath; v2 doesn't change layer construction.

This looks like a pre-existing modelopt-quant bug in the NVFP4 weight-reload path on Blackwell — orthogonal to this PR.

fzyzcjy · 2026-05-14T02:45:35Z

`stage-c-test-4-gpu-h100 (1)` failure — fast-fail cascade, not a real run

Job: stage-c-test-4-gpu-h100 (1), attempt 5 — log

This job did not actually run any tests; it was skipped by the fast-fail gate:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-4-gpu-b200 (3)

The root cause is the pre-existing NVFP4 weight-update bug in b200 (3) — see the b200 (3) analysis above. No h100-specific signal here.

fzyzcjy · 2026-05-14T02:50:43Z

CI status snapshot (pre-rebase)

Current attempt 5 results: 107 success / 9 failure / 16 skipped / 2 still queued.

Per-failure classification (CUDA-lane only; AMD/NPU skipped per project policy):

Job	Conclusion	Why	Comment
`stage-c-test-4-gpu-b200 (3)`	failure	Pre-existing NVFP4 weight-update bug (`QKVParallelLinear.input_scale` AttributeError in `modelopt_quant.py:1361`). Fixed on main by #25190 (commit `6c0633b0b1`, 2026-05-13) — not in this branch's merge-base.	analysis
`stage-c-test-4-gpu-h100 (1)`	failure	Fast-fail cascade from `b200 (3)`. Did not run any test.	analysis
`stage-c-test-dsv4-8-gpu-h200`	success (this attempt)	Earlier attempts 3 & 4 hit chronic infra flakes (NCCL port leftover / NVSHMEM UID bootstrap timeout) — both before any v2 code path is exercised.	analysis

Plan: rebasing feat/stateless_scheduler_b onto latest upstream/main to pick up #25190 + other recent fixes, then triggering a fresh CI run. The rebase should resolve the only real failure (b200 (3)); the cascade and infra flakes will resolve as a side effect.

When chunked-resume reqs are held in both waiting_queue and batch.reqs (stateless-scheduler refactor), abort_request would otherwise process them twice (queue pop + to_finish), causing duplicate send_output and double release_kv_cache. Build batch_rids upfront and skip waiting_queue removal for reqs already in batch — let to_finish path handle them. Pre-flight for stateless-scheduler v2.

For chunked-resume reqs (after the upcoming stateless-scheduler switch) that live in waiting_queue with non-empty prefix_indices, summing req.seqlen overcounts the committed prefix. Switch to seqlen - prefix for waiting reqs; keep the chunked_req block until that field is removed. Today's behavior is unchanged for fresh waiting reqs whose prefix_indices is empty. Pre-flight for stateless-scheduler v2.

Explicit comment that reqs still doing prefill (chunked-resume or DLLM staging) must not be merged into running_batch. Today enforced via chunked_req_to_exclude param; stateless-scheduler v2 will move to a per-req predicate. Pre-flight for v2.

Resolves rename collision: adopt upstream's `inflight_middle_chunks` (PR sgl-project#25720) and drop the local-only `pending_middle_outputs` rename. Port mixin changes deleted by upstream into the new component classes: - `scheduler_output_processor_mixin.py` -> port PP cross-mb finalize guard into `scheduler_components/batch_result_processor.py::_handle_finished_req`. - `scheduler_runtime_checker_mixin.py` -> port chunked-resume tail accounting into `scheduler_components/invariant_checker.py::_get_total_uncached_sizes` and `scheduler_components/pool_stats_observer.py::active_pool_idxs`. Both classes now take a `get_waiting_queue` callable. - `scheduler_metrics_mixin.py::_get_num_pending_tokens` -> adapted in `scheduler_components/load_inquirer.py` to read chunked tail from `waiting_queue` (chunked_req field was removed in this branch); dropped the `get_chunked_req` callable from `SchedulerLoadInquirer`. Drop `test_scheduler_chunked_req_gate.py` (deleted in this branch's v2 refactor; upstream only renamed `is_chunked` in it). Adopt upstream's component-style call sites for prebuilt batch processing in disaggregation/decode.py.

…eam rename) The name 'pending_middle_outputs' more precisely describes what the counter tracks: middle-block prefill forwards that are admitted but not yet output-processed (output processor uses it to decide whether this forward's sample is real (==0) or garbage (>0)). Restore the local-branch name across all call sites.

Upstream PR sgl-project#25444 moved Scheduler.pp_size onto a frozen ParallelState container (self.ps.pp_size). My branch's chunked-resume PP code still referenced the old direct attribute, causing AttributeError: 'Scheduler' object has no attribute 'pp_size' in _in_flight_other_mb_rids and abort_request.

fzyzcjy · 2026-05-19T15:12:22Z

CI status after merging `upstream/main` (commit `be72b26f7e`)

Fix pushed for both failing CUDA / NPU jobs:

base-a-test-1-gpu-small (0) — job log
stage-b-test-1-npu-a2 (1) — job log

Both crashed at scheduler startup with:

AttributeError: 'Scheduler' object has no attribute 'pp_size'
  File ".../scheduler.py", line 2281, in _in_flight_other_mb_rids
    if self.pp_size <= 1 or not hasattr(self, "mbs"):

Classification: real bug, ours. Upstream PR #25444 (Bundle Scheduler rank/size fields into a frozen ParallelState) moved Scheduler.pp_size to self.ps.pp_size. My branch's chunked-resume PP filter code (_in_flight_other_mb_rids and abort_request) still referenced the old attribute.

Fix in be72b26f7e: replace the two self.pp_size references with self.ps.pp_size. Other rank/size references in scheduler.py already use the new self.ps.* form.

wait-for-base-a is a cascade failure from base-a-test; expected to recover after the rerun.

fzyzcjy · 2026-05-19T16:09:50Z

AMD `stage-b-test-1-gpu-large-amd` failure — near-threshold perf flake

Job log

test_bench_serving_1gpu_part2.py failed with:

AssertionError: 70.74462937936187 not less than 70

Marginal perf overshoot (70.74 vs 70.00 threshold), AMD MI300 lane. Not CUDA. Already retried internally once (retry() failed once (0th try, maximum 1 retries)). My diff is scheduler-side chunked-resume bookkeeping with no AMD code paths and no perf-critical hot-loop changes; failure is not plausibly caused by this PR. Posting a /rerun-test to confirm flake.

fzyzcjy · 2026-05-19T16:09:57Z

/rerun-test test/registered/perf/test_bench_serving_1gpu_part2.py

ghost · 2026-05-19T16:10:22Z

🚀 1-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/perf/test_bench_serving_1gpu_part2.py

fzyzcjy · 2026-05-19T16:24:00Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:24:10Z

CUDA `extra-a-test-1-gpu-large (1)` failure — FlashAttention SM90 illegal address during CUDA graph capture

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

Root cause:

[coredump] Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
  - Kernel: cutlass_kernel_flash_attncuteflash_fwd_sm90FlashAttentionForwardSm90...
  - Site: lora/layers.py:724 -> quantization/unquant.py:161
Fatal Python error: Aborted (during CUDA graph capture, bs=256, avail_mem=12.04 GB)

Crash is inside the CUTLASS-CUTE FlashAttention SM90 kernel during cuda-graph capture for LoRA + Qwen3-8B. Our diff is scheduler-side chunked-resume bookkeeping with no LoRA, no attention-kernel, and no cuda-graph path changes. Test file is pre-existing on main (last touched by #24725 / #25197), not introduced by our merge.

extra-a-test-1-gpu-large (2) is a fast-fail cascade post-cleanup shadow of the same workflow — no actual test ran.

Posting /rerun-test to confirm flake vs reproducible.

ghost · 2026-05-19T16:24:24Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:48:20Z

LoRA Qwen3-8B `extra-a-test-1-gpu-large` CUDA fail — pre-existing flake on main

Update: my /rerun-test reran the file and it failed again with the same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in FlashAttention SM90 CUTE during CUDA graph capture.

Cross-branch evidence: this is a pre-existing high-rate flake on main, not introduced by this PR. Of the last 19 rerun-test runs of test_lora_qwen3_8b_logprob_diff.py on the repo (last ~14h):

9 failures
10 successes

That's a ~47% flake rate at the SM90 FlashAttention kernel layer, well before this PR's merge. My PR's diff is scheduler-side chunked-resume bookkeeping; no LoRA/attention/cuda-graph code paths touched.

Posting one more /rerun-test in case it lands on the success side this cycle. If it still fails this is environmental and not blocking for the merge.

fzyzcjy · 2026-05-19T16:48:21Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

ghost · 2026-05-19T16:48:57Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:53:59Z

AMD 2-GPU `stage-b-test-2-gpu-large-amd` failure — Mixtral/aiter MoE CUDA graph hang

Job log

Watchdog-triggered scheduler hang during CUDA graph capture for mistralai/Mixtral-8x7B-Instruct-v0.1 (tp_size=2, attention_backend=aiter, MoE via aiter/fused_moe.py:147):

[TP1] Pyspy failed (py-spy dump --native --pid 1464). Error: Failed to get stack traces
(repeats every 5 min, watchdog_timeout=300)

py-spy main thread stuck inside:
  fused_moe (aiter/fused_moe.py:147) -> ck_moe_stage1_fwd (aiter/ops/moe_op.py:555)
  -> run_moe_core -> mixtral.py:115 forward
  ... within cuda_graph_runner.capture

Hang in AMD aiter MoE kernel during CUDA graph capture. Not CUDA, AMD MI300 only. Our diff is scheduler-side chunked-resume bookkeeping with no aiter, no MoE, no AMD-specific paths. Cascaded fast-fails (wait-for-stage-b-amd, pr-test-amd-finish) trigger from this. Not blocking for merge.

fzyzcjy · 2026-05-20T00:11:24Z

AMD MI300 lane: two near-threshold perf assertion flakes — not ours, not blocking

Run on 5cc1a41d0d.

Job	Test	Failure	Threshold
`stage-b-test-1-gpu-small-amd (1)`	`test/registered/moe/test_torch_compile_moe.py`	`235.7 < 240` (2% short)	"must be >= 240"
`stage-b-test-1-gpu-large-amd (1)`	`test/registered/perf/test_bench_serving_1gpu_part2.py`	`81.0 > 80` then `90.8 > 80`	"must be < 80"

Both already retried internally once and failed twice; classic AMD MI300 hardware-noise perf threshold flake territory. Our diff is scheduler-side chunked-resume bookkeeping with no AMD / MoE / serving-perf code paths touched. wait-for-stage-b-amd cascade fails from these. Not blocking the CUDA gate.

fzyzcjy · 2026-05-20T02:29:05Z

Non-CUDA lane failures on `f531ac917b` — all hardware/perf, none ours

Run (AMD) + Run (NPU).

Classification (CUDA lane is still all-green; see rollup):

Lane	Job	Cause
AMD MI300	`stage-b-test-1-gpu-small-amd (3)`	`HW Exception by GPU node-2 ... reason: GPU Hang` (hardware)
AMD MI300	`stage-b-test-1-gpu-small-amd (6)`, `(9)`	likely same GPU-hang signature
AMD MI35x	`stage-b-test-1-gpu-small-amd-mi35x`, `stage-b-test-large-8-gpu-...-disaggregation-amd`, `stage-c-test-large-8-gpu-amd-mi35x (0)`, `(1)`	MI35x lane failures
AMD MI300	`stage-c-test-4-gpu-amd (0)`, `stage-c-test-large-8-gpu-amd (1)`	non-blocking AMD stage-c
NPU	`stage-b-test-16-npu-a3`	`test_npu_deepep.py` failed in NPU-specific code path
NPU	`pr-test-npu-finish`	meta cascade

None plausibly caused by this PR's diff (scheduler-side chunked-resume bookkeeping with no AMD, no NPU, no deep-EP, and no kernel-level paths). Per sglang-babysit-ci skill: non-CUDA lanes are only fixed when clearly ours AND easy. Not blocking the CUDA gate.

fzyzcjy · 2026-05-20T03:42:19Z

/rerun-test test/registered/8-gpu-models/test_deepseek_v32_indexcache.py

fzyzcjy · 2026-05-20T03:42:29Z

`extra-b-test-8-gpu-h200 (1)` CUDA OOM — DeepSeek-V3.2 server restart leak between two sub-tests

Job log.

Failing test: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py (DeepSeek-V3.2, tp_size=8, NSA attention).

Pattern:

Test 1 (mem_fraction_static=0.897, index_topk_freq=4) ran -> server torn down
Test 2 (mem_fraction_static=0.881, index_topk_pattern=...) server init:
  RuntimeError: CUDA out of memory. Tried to allocate 5.74 GiB.
  GPU has 139.80 GiB capacity, 5.33 GiB free.
  Process 3553102 has 134.46 GiB memory in use.
  (PyTorch allocator: 124.84 GiB, plus 896 MiB in CUDA Graphs private pool,
   plus 6.32 GiB reserved-but-unallocated)

This is the well-known CUDA allocation not freed between sub-tests pattern. 134 GiB lingering allocation from sub-test 1's process means sub-test 2's init can't claim its mem_fraction_static budget.

Our diff is scheduler-side chunked-resume bookkeeping; no allocator, kv pool tear-down, multiprocessing, or DSv3.2 model code paths touched. Posting /rerun-test to confirm flake.

ghost · 2026-05-20T03:42:44Z

🚀 8-gpu-h200 (1 test): ✅ View workflow run

cd test/ && python3 registered/8-gpu-models/test_deepseek_v32_indexcache.py

fzyzcjy · 2026-05-20T05:39:32Z

/rerun-test test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

fzyzcjy · 2026-05-20T05:39:53Z

`base-c-test-dsv4-4-gpu-b200 (0)` CUDA fail — CUTLASS DSL / flashinfer binding TypeError, not ours

Job log.

Failing test: test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py (DeepSeek-V4, B200, FP4 mega-MoE).

Root cause: Python TypeError raised during cute.compile of flashinfer's rmsnorm kernel:

File ".../flashinfer/norm/kernels/rmsnorm.py", line 246, in kernel
    tidx, _, _ = cute.arch.thread_idx()
File ".../cutlass/_mlir/dialects/_nvvm_ops_gen.py", line 9830, in __init__
    super().__init__(self.OPERATION_NAME, ...)
TypeError: __init__(): incompatible function arguments.
  Supported: __init__(self, operation: object) -> None
  Invoked with: (ThreadIdXOp, str, tuple, NoneType, NoneType, kwargs={attributes:dict, results:list, operands:list, ...})

ThreadIdXOp constructor signature mismatch between the installed cutlass-dsl and what flashinfer.norm.kernels.rmsnorm is calling. Dependency binding bug, surfaces during CUDA-graph capture for DSv4 q_norm.

Our diff is scheduler-side chunked-resume bookkeeping; no flashinfer, no CUTLASS DSL, no rmsnorm, no DSv4 model code paths touched. Last successful rerun-test of this same file ~14h ago was on a different SHA, so it's plausibly a recent main env/dep drift.

Posting /rerun-test to confirm flake vs reproducible.

ghost · 2026-05-20T05:40:01Z

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch — upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our branch's assert against chunked/dllm reqs in other.reqs. test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep HEAD's deletion (v1 gate removed in v2); upstream's array.array migration is moot since the file goes away.

fzyzcjy requested review from ByronHsu, ClawSeven, ShangmingCai, Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, sufeng-buaa, xiezhq-hermann and yizhang2077 as code owners May 13, 2026 13:25

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

fzyzcjy added 3 commits May 14, 2026 10:50

fzyzcjy closed this May 19, 2026

fzyzcjy reopened this May 19, 2026

fzyzcjy added the run-ci-extra label May 19, 2026

fzyzcjy and others added 4 commits May 19, 2026 23:00

Merge branch 'main' into feat/stateless_scheduler_b

f5363d4

Merge branch 'main' into feat/stateless_scheduler_b

5cc1a41

fzyzcjy added the bypass-fastfail label May 19, 2026

Merge branch 'main' into feat/stateless_scheduler_b

f531ac9

	has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)
	has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False

Conversation

fzyzcjy commented May 13, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

ghost commented May 13, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 14, 2026

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-dsv4-8-gpu-h200 failures — infra flakes, not v2-related

Attempt 3 — leftover process holds NCCL port

Attempt 4 — two distinct distributed-init failures

Why this isn't v2

Cascade impact

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-b200 (3) failure — pre-existing NVFP4 weight-update bug, not v2-related

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-h100 (1) failure — fast-fail cascade, not a real run

Uh oh!

fzyzcjy commented May 14, 2026

CI status snapshot (pre-rebase)

Uh oh!

fzyzcjy commented May 19, 2026

CI status after merging upstream/main (commit be72b26f7e)

Uh oh!

fzyzcjy commented May 19, 2026

AMD stage-b-test-1-gpu-large-amd failure — near-threshold perf flake

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

ghost commented May 19, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

fzyzcjy commented May 19, 2026

CUDA extra-a-test-1-gpu-large (1) failure — FlashAttention SM90 illegal address during CUDA graph capture

fzyzcjy commented May 13, 2026 •

edited by github-actions Bot

Loading

ghost commented May 13, 2026 •

edited by github-actions Bot

Loading

`stage-c-test-dsv4-8-gpu-h200` failures — infra flakes, not v2-related

`stage-c-test-4-gpu-b200 (3)` failure — pre-existing NVFP4 weight-update bug, not v2-related

`stage-c-test-4-gpu-h100 (1)` failure — fast-fail cascade, not a real run

CI status after merging `upstream/main` (commit `be72b26f7e`)

AMD `stage-b-test-1-gpu-large-amd` failure — near-threshold perf flake

ghost commented May 19, 2026 •

edited by github-actions Bot

Loading

CUDA `extra-a-test-1-gpu-large (1)` failure — FlashAttention SM90 illegal address during CUDA graph capture

ghost commented May 19, 2026 •

edited by github-actions Bot

Loading

LoRA Qwen3-8B `extra-a-test-1-gpu-large` CUDA fail — pre-existing flake on main

ghost commented May 19, 2026 •

edited by github-actions Bot

Loading

AMD 2-GPU `stage-b-test-2-gpu-large-amd` failure — Mixtral/aiter MoE CUDA graph hang

Non-CUDA lane failures on `f531ac917b` — all hardware/perf, none ours

`extra-b-test-8-gpu-h200 (1)` CUDA OOM — DeepSeek-V3.2 server restart leak between two sub-tests

ghost commented May 20, 2026 •

edited by github-actions Bot

Loading

`base-c-test-dsv4-4-gpu-b200 (0)` CUDA fail — CUTLASS DSL / flashinfer binding TypeError, not ours

ghost commented May 20, 2026 •

edited by github-actions Bot

Loading