DO NOT MERGE - CI sandbox for stateless scheduler b#25172
Conversation
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request refactors the chunked prefill mechanism by replacing the global chunked_req pointer with per-request state flags (has_pending_chunk and pending_middle_outputs). This change enhances support for pipeline parallelism and ensures more robust state management across iterations. The review feedback identifies potential null pointer crashes, logic inconsistencies in request abort handling, and performance optimizations for hot-path queue scans, all of which include actionable code suggestions.
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | ||
| for mb in mb_list: | ||
| if mb is not None and not mb.is_empty(): | ||
| batch_reqs.extend(mb.reqs) |
There was a problem hiding this comment.
The iteration over self.mbs, self.last_mbs, and self.running_mbs will crash if any of these attributes are None. While self.mbs is typically a list, last_mbs and running_mbs are often None in certain scheduler states or configurations.
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | |
| for mb in mb_list: | |
| if mb is not None and not mb.is_empty(): | |
| batch_reqs.extend(mb.reqs) | |
| if self.pp_size > 1 and hasattr(self, "mbs"): | |
| for mb_list in (self.mbs, self.last_mbs, self.running_mbs): | |
| if mb_list is not None: | |
| for mb in mb_list: | |
| if mb is not None and not mb.is_empty(): | |
| batch_reqs.extend(mb.reqs) |
| if (recv_req.abort_all or req.rid.startswith(recv_req.rid)) and ( | ||
| req.rid not in batch_rids | ||
| ): |
There was a problem hiding this comment.
Aborted requests that are currently in a batch (e.g., chunked-resume requests) should still be removed from the waiting_queue list to maintain a consistent scheduler state. The current logic skips them entirely. To avoid double-releasing resources, you should remove them from the list but skip the release_kv_cache call inside the processing loop (by checking if req.rid in batch_rids: continue).
if (recv_req.abort_all or req.rid.startswith(recv_req.rid)):| # priority + has_pending_chunk make it sit at the head, but its | ||
| # presence relaxes the "is queue empty / pool full" early exits below | ||
| # (we must keep scheduling it to make progress, or memory leaks). | ||
| has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue) |
There was a problem hiding this comment.
Performing an waiting_queue using any() in the scheduler's hot path is inefficient. Since the scheduling policy ensures that has_pending_chunk requests are sorted to the front of the queue, you can optimize this by checking only the first element.
| has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue) | |
| has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False |
| chunked_resume = next( | ||
| (r for r in self.waiting_queue if r.has_pending_chunk), None | ||
| ) |
There was a problem hiding this comment.
|
/rerun-failed-ci |
|
/rerun-test stage-b-test-2-gpu-large |
|
⛔ |
|
/rerun-failed-ci |
5 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
|
|
CI status snapshot (pre-rebase)Current attempt 5 results: 107 success / 9 failure / 16 skipped / 2 still queued. Per-failure classification (CUDA-lane only; AMD/NPU skipped per project policy):
Plan: rebasing |
When chunked-resume reqs are held in both waiting_queue and batch.reqs (stateless-scheduler refactor), abort_request would otherwise process them twice (queue pop + to_finish), causing duplicate send_output and double release_kv_cache. Build batch_rids upfront and skip waiting_queue removal for reqs already in batch — let to_finish path handle them. Pre-flight for stateless-scheduler v2.
For chunked-resume reqs (after the upcoming stateless-scheduler switch) that live in waiting_queue with non-empty prefix_indices, summing req.seqlen overcounts the committed prefix. Switch to seqlen - prefix for waiting reqs; keep the chunked_req block until that field is removed. Today's behavior is unchanged for fresh waiting reqs whose prefix_indices is empty. Pre-flight for stateless-scheduler v2.
Explicit comment that reqs still doing prefill (chunked-resume or DLLM staging) must not be merged into running_batch. Today enforced via chunked_req_to_exclude param; stateless-scheduler v2 will move to a per-req predicate. Pre-flight for v2.
Resolves rename collision: adopt upstream's `inflight_middle_chunks` (PR sgl-project#25720) and drop the local-only `pending_middle_outputs` rename. Port mixin changes deleted by upstream into the new component classes: - `scheduler_output_processor_mixin.py` -> port PP cross-mb finalize guard into `scheduler_components/batch_result_processor.py::_handle_finished_req`. - `scheduler_runtime_checker_mixin.py` -> port chunked-resume tail accounting into `scheduler_components/invariant_checker.py::_get_total_uncached_sizes` and `scheduler_components/pool_stats_observer.py::active_pool_idxs`. Both classes now take a `get_waiting_queue` callable. - `scheduler_metrics_mixin.py::_get_num_pending_tokens` -> adapted in `scheduler_components/load_inquirer.py` to read chunked tail from `waiting_queue` (chunked_req field was removed in this branch); dropped the `get_chunked_req` callable from `SchedulerLoadInquirer`. Drop `test_scheduler_chunked_req_gate.py` (deleted in this branch's v2 refactor; upstream only renamed `is_chunked` in it). Adopt upstream's component-style call sites for prebuilt batch processing in disaggregation/decode.py.
…eam rename) The name 'pending_middle_outputs' more precisely describes what the counter tracks: middle-block prefill forwards that are admitted but not yet output-processed (output processor uses it to decide whether this forward's sample is real (==0) or garbage (>0)). Restore the local-branch name across all call sites.
Upstream PR sgl-project#25444 moved Scheduler.pp_size onto a frozen ParallelState container (self.ps.pp_size). My branch's chunked-resume PP code still referenced the old direct attribute, causing AttributeError: 'Scheduler' object has no attribute 'pp_size' in _in_flight_other_mb_rids and abort_request.
CI status after merging
|
AMD
|
|
/rerun-test test/registered/perf/test_bench_serving_1gpu_part2.py |
|
🚀 |
|
/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
CUDA
|
|
🚀 |
LoRA Qwen3-8B
|
|
/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
|
🚀 |
AMD 2-GPU
|
AMD MI300 lane: two near-threshold perf assertion flakes — not ours, not blockingRun on
Both already retried internally once and failed twice; classic AMD MI300 hardware-noise perf threshold flake territory. Our diff is scheduler-side chunked-resume bookkeeping with no AMD / MoE / serving-perf code paths touched. |
Non-CUDA lane failures on
|
| Lane | Job | Cause |
|---|---|---|
| AMD MI300 | stage-b-test-1-gpu-small-amd (3) |
HW Exception by GPU node-2 ... reason: GPU Hang (hardware) |
| AMD MI300 | stage-b-test-1-gpu-small-amd (6), (9) |
likely same GPU-hang signature |
| AMD MI35x | stage-b-test-1-gpu-small-amd-mi35x, stage-b-test-large-8-gpu-...-disaggregation-amd, stage-c-test-large-8-gpu-amd-mi35x (0), (1) |
MI35x lane failures |
| AMD MI300 | stage-c-test-4-gpu-amd (0), stage-c-test-large-8-gpu-amd (1) |
non-blocking AMD stage-c |
| NPU | stage-b-test-16-npu-a3 |
test_npu_deepep.py failed in NPU-specific code path |
| NPU | pr-test-npu-finish |
meta cascade |
None plausibly caused by this PR's diff (scheduler-side chunked-resume bookkeeping with no AMD, no NPU, no deep-EP, and no kernel-level paths). Per sglang-babysit-ci skill: non-CUDA lanes are only fixed when clearly ours AND easy. Not blocking the CUDA gate.
|
/rerun-test test/registered/8-gpu-models/test_deepseek_v32_indexcache.py |
|
|
🚀 |
|
/rerun-test test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py |
|
|
🚀 |
schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch — upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our branch's assert against chunked/dllm reqs in other.reqs. test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep HEAD's deletion (v1 gate removed in v2); upstream's array.array migration is moot since the file goes away.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ✅ Run #26388971766
Latest PR Test (Extra): ✅ Run #26388971706