[Disagg] Finalize routed_experts_output in process_batch_result_disagg_prefill#23885
Merged
ByronHsu merged 1 commit intoApr 27, 2026
Conversation
…g_prefill PR sgl-project#22911 ("[perf] support return_routed_experts with overlap scheduling") introduced a deferred-D2H path for the captured routed-expert IDs. The agg-mode result handlers in scheduler_output_processor_mixin.py call RoutedExpertsOutput.finalize() after copy_done.synchronize() so the CPU-side tensor lands in host_cache.buffer. The PD-disagg prefill handler in disaggregation/prefill.py was missed, so its host buffer is never written and every prefill slot stays at the initial torch.zeros(...). Symptom: with PD-disagg + --enable-return-routed-experts + overlap scheduling, every prompt token's top_k row in the response is [0, 0, ..., 0]. Routing replay in trainers (e.g. Megatron) then asserts "Duplicate experts in routing! unique_counts=[1,1,...,1] expected=8". Fix: add the same finalize() call after copy_done.synchronize() in process_batch_result_disagg_prefill, matching the agg path. Verified locally on Qwen3-30B-A3B with 1 prefill + 1 decode + mini-lb + --enable-return-routed-experts + overlap scheduling. Pre-fix: 10 464 of 34 272 (token, layer) rows are all-zero. Post-fix: 0 bad rows across the same workload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Qiaolin-Yu
approved these changes
Apr 27, 2026
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
…g_prefill (sgl-project#23885) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR #22911 (
[perf] support return_routed_experts with overlap scheduling) introduced a deferred-D2H path for the captured routed-expert IDs. Aftercopy_done.synchronize(), callers must invokeRoutedExpertsOutput.finalize()to write the CPU-side tensor intohost_cache.buffer. The agg-mode handlers inscheduler_output_processor_mixin.py(process_batch_result_prefill,process_batch_result_decode) were updated to do this; the PD-disagg prefill handler indisaggregation/prefill.pywas missed.As a result, in PD-disagg mode the prefill worker's
host_cache.bufferis never written for prefill slots — they stay at the initialtorch.zeros(...).maybe_collect_routed_experts(req)then reads zeros intoreq.routed_expertsfor every prompt position.Symptom
PD-disagg +
--enable-return-routed-experts+ overlap scheduling: every prompt token's top_k row in the response is[0, 0, ..., 0]. Routing replay in downstream trainers (e.g. Megatron) asserts:This breaks RL workloads that rely on routing replay (e.g.
--enable-routing-replayflows), since the prefill prompt tokens come back as all-zero topk rows that the trainer interprets as 8 duplicate selections of expert 0.Modifications
Add the same finalize call after
copy_done.synchronize()inprocess_batch_result_disagg_prefill, mirroring the agg-mode handlers:Is this also needed for
--disable-overlap-schedule?No — fix is a no-op in that mode, but it's the only mode where the fix is required.
When
--disable-overlap-scheduleis set,model_runner.py:2989flipsno_copy_to_cpu = False, andon_forward_endtakes theelsebranch atrouted_experts_capturer.py:269: it calls_sync_fwd_experts_buffer_DtoH(...)which writes straight intohost_cache.buffersynchronously and returnsNone. Soresult.routed_experts_output is Noneand the newif result.routed_experts_output is not None: ... finalize()block is a no-op.So:
--disable-overlap-schedule): old sync_sync_fwd_experts_buffer_DtoHpath writes the host buffer inside the forward; the added block doesn't fire and isn't needed.Verification
Reproduced locally on Qwen3-30B-A3B with 1 prefill GPU + 1 decode GPU + mini-lb router +
--enable-return-routed-experts+ overlap scheduling on, mini-lb_merge_routed_expertsenabled (PR #22916), 16 concurrent generation requests, max_new_tokens=32:Pre-fix all-zero rows are concentrated on prompt positions (decode worker's host buffer never had them, prefill worker's was the one that needed populating). Post-fix all rows have the expected 8 unique expert IDs per token.
Checklist
🤖 Generated with Claude Code