[Session R3] Add routed_experts_start_len for absolute routing slice control#24851
Merged
ByronHsu merged 5 commits intoMay 10, 2026
Merged
Conversation
…trol Add `routed_experts_start_len` parameter that lets callers specify an absolute start position for returned routed-expert data, covering `[start_len, seqlen-1)`. This gives RL rollout callers explicit control over which routing rows are returned — useful when the caller already knows the prompt prefix length and wants output-only or partial-prompt routings without relying on cache heuristics. Motivation: In multi-turn RL rollouts, the accumulated routed-experts data grows with the full conversation length. Without slicing control, every request returns the full sequence's routing data including the prefix-cached range, causing O(seqlen) host gathers and ZMQ payloads that produce ~1s ITL spikes on long-context requests with high cache-hit ratios. With `routed_experts_start_len`, callers can request only the new tokens' routings, reducing the per-finish cost to O(seqlen - start_len). Changes: - Add `routed_experts_start_len: Optional[int] = None` field across the full request lifecycle: GenerateReqInput, TokenizedGenerateReqInput, OpenAI CompletionRequest/ChatCompletionRequest, Engine.generate/ async_generate, Req, tokenizer_manager, scheduler, session_controller, encode_receiver, and serving_chat/serving_completions. - Add validation in scheduler: abort if start_len > prompt_tokens. - Update BaseTopkCapturer.get_topk() with start_len parameter and defensive clamping. - Update maybe_collect_routed_experts to honor start_len, early-return when return_routed_experts is False, and log row-count mismatches. - Add comprehensive TestRoutedExpertsStartLen test class covering default behavior, row-count correctness, bounds checking, and cache-hit interaction. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
cc @zyzshishui to take over |
Collaborator
|
/tag-and-rerun-ci |
zyzshishui
reviewed
May 10, 2026
| return | ||
|
|
||
| if ( | ||
| recv_req.routed_experts_start_len is not None |
Contributor
There was a problem hiding this comment.
Maybe add if recv_req.return_routed_experts and
Qiaolin-Yu
approved these changes
May 10, 2026
Collaborator
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
better to add some doc
Change default routed experts start len to 0 and add doc
Contributor
added, plz check again |
ByronHsu
added a commit
that referenced
this pull request
May 10, 2026
…bsolute routing slice control (#24904) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: zyzshishui <zyzshishui@gmail.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In multi-turn RL rollouts with MoE models (e.g. Kimi-K2, Qwen3-30B-A3B),
return_routed_expertsreturns the full conversation-length routing data on every request — including the prefix-cached range. As conversations grow, the host gather + ZMQ payload scales as O(seqlen), producing ~1s ITL spikes on long-context requests with high cache-hit ratios. This compounds with DP attention: since all DP ranks synchronize on every decode step, a single rank stalled on a long routed-experts gather blocks the entire batch across all ranks. This makes it the dominant decode bottleneck in our production RL training loop.The problem
The caller (RL rollout client) only needs the new routing rows, not the full prefix it already accumulated from prior turns.
The solution
Add
routed_experts_start_len: Optional[int]that lets callers specify an absolute start position. The server returns routings covering[start_len, seqlen - 1)instead of the full[0, seqlen - 1). The caller setsstart_len = len(accumulated_prompt)and gets back only the output-token routings.Experiment results
Kimi-K2-Instruct, pure TP=8 (8× H200),
--enable-return-routed-experts --load-format dummy --moe-runner-backend triton --disable-piecewise-cuda-graph --cuda-graph-max-bs 4. 95% prefix-cache hit, 100 output tokens. Timer placed after GPU sync.Headline numbers
Per-component breakdown
Key findings:
routed_expertshost gather (aten::index): grows linearly in full mode (2.2 ms at 2k → 30 ms at 32k); flat ~0.1 ms with start_len at all sizes. This PR eliminates this cost.stream_output(ZMQ/pickle serialization): 5 ms → 90 ms in full mode (proportional to the 60 MB payload); flat ~0.4 ms with start_len (tiny payload).release_kv: grows linearly with seqlen in both modes (0.16 ms → 1.43 ms). This is the radix tree walk — unrelated to routed experts and the sole source of small residual growth.Changes
Request lifecycle plumbing —
routed_experts_start_len: Optional[int] = Noneadded to:GenerateReqInput,TokenizedGenerateReqInput(io_struct.py)CompletionRequest,ChatCompletionRequest(protocol.py)Engine.generate,Engine.async_generate(engine.py)Req(schedule_batch.py)Server-side logic:
BaseTopkCapturer.get_topk()gains astart_lenparameter with defensive clampingmaybe_collect_routed_experts()honorsreq.routed_experts_start_len, early-returns whenreturn_routed_expertsis False, and logs row-count mismatches as soft warningsstart_len <= prompt_tokensand aborts with a clear error otherwiseTests:
TestRoutedExpertsStartLen: 4 test cases covering default behavior, row-count correctness, bounds checking (abort on too-large start_len), and cache-hit interaction (radix prefix extends past start_len)Test plan
TestReturnRoutedExpertspasses (no regression to full-sequence return)TestRoutedExpertsStartLenpasses with TP=2 on Qwen3-30B-A3Bstart_len=None(default) produces identical output to omitting the fieldstart_len=Nreturns exactlyseqlen - 1 - Nrows matching the tail of full returnstart_len > prompt_tokensaborts with clear error message