Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs_new/docs/basic_usage/openai_api_completions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@
"source": [
"#### Returning Routed Experts (MoE Models)\n",
"\n",
"For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`."
"For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. By default this returns `[0, seqlen - 1)`, the full available sequence, because RL workflows need routed experts for the full sequence. Set `routed_experts_start_len` in `extra_body` to an absolute prefix length to return only `[routed_experts_start_len, seqlen - 1)`. For example, in multi-turn RL rollouts, routed experts for tokens from previous turns have already been collected, so setting this value avoids unnecessary transfer that cause bottlenecks."
]
},
{
Expand Down Expand Up @@ -468,7 +468,7 @@
"source": [
"#### Returning Routed Experts (MoE Models)\n",
"\n",
"For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`."
"For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. By default this returns `[0, seqlen - 1)`, the full available sequence, because RL workflows need routed experts for the full sequence. Set `routed_experts_start_len` in `extra_body` to an absolute prefix length to return only `[routed_experts_start_len, seqlen - 1)`. For example, in multi-turn RL rollouts, routed experts for tokens from previous turns have already been collected, so setting this value avoids unnecessary transfer that cause bottlenecks."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs_new/docs/basic_usage/openai_api_completions.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ for chunk in stream:

#### Returning Routed Experts (MoE Models)

For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`.
For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. By default this returns `[0, seqlen - 1)`, the full available sequence, because RL workflows need routed experts for the full sequence. Set `routed_experts_start_len` in `extra_body` to an absolute prefix length to return only `[routed_experts_start_len, seqlen - 1)`. For example, in multi-turn RL rollouts, routed experts for tokens from previous turns have already been collected, so setting this value avoids unnecessary transfer that cause bottlenecks.

```python Example
# Example with logit_bias parameter for completions API
Expand Down Expand Up @@ -406,7 +406,7 @@ print_highlight(f"Response: {response}")

#### Returning Routed Experts (MoE Models)

For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`.
For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. By default this returns `[0, seqlen - 1)`, the full available sequence, because RL workflows need routed experts for the full sequence. Set `routed_experts_start_len` in `extra_body` to an absolute prefix length to return only `[routed_experts_start_len, seqlen - 1)`. For example, in multi-turn RL rollouts, routed experts for tokens from previous turns have already been collected, so setting this value avoids unnecessary transfer that cause bottlenecks.

## Structured Outputs (JSON, Regex, EBNF)

Expand Down
7 changes: 6 additions & 1 deletion docs_new/docs/basic_usage/sampling_params.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,12 @@ The `/generate` endpoint accepts the following parameters in JSON format. For de
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>return_routed_experts</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`bool = False`</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Whether to return routed experts for MoE models. Requires `--enable-return-routed-experts` server flag. Returns base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`.</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Whether to return routed experts for MoE models. Requires `--enable-return-routed-experts` server flag. With the default `routed_experts_start_len=0`, returns the full available sequence `[0, seqlen - 1)` because RL workflows need routed experts for the full sequence. The result is base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`.</td>
</tr>
<tr>
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>routed_experts_start_len</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`int = 0`</td>
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>If `return_routed_experts`, the absolute start position for returned routed-experts rows. `0` preserves the default full sequence; set it to an accumulated prefix length to return only `[routed_experts_start_len, seqlen - 1)`. For example, in multi-turn RL rollouts, routed experts for tokens from previous turns have already been collected, so setting this value avoids unnecessary transfer that cause bottlenecks. Must be in `[0, prompt_tokens]`.</td>
</tr>
</tbody>
</table>
Expand Down
4 changes: 2 additions & 2 deletions python/sglang/srt/entrypoints/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ def generate(
custom_logit_processor: Optional[Union[List[str], str]] = None,
return_hidden_states: bool = False,
return_routed_experts: bool = False,
routed_experts_start_len: Optional[int] = None,
routed_experts_start_len: int = 0,
stream: bool = False,
bootstrap_host: Optional[Union[List[str], str]] = None,
bootstrap_port: Optional[Union[List[int], int]] = None,
Expand Down Expand Up @@ -425,7 +425,7 @@ async def async_generate(
custom_logit_processor: Optional[Union[List[str], str]] = None,
return_hidden_states: bool = False,
return_routed_experts: bool = False,
routed_experts_start_len: Optional[int] = None,
routed_experts_start_len: int = 0,
stream: bool = False,
bootstrap_host: Optional[Union[List[str], str]] = None,
bootstrap_port: Optional[Union[List[int], int]] = None,
Expand Down
4 changes: 2 additions & 2 deletions python/sglang/srt/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ class CompletionRequest(BaseModel):
user: Optional[str] = None
return_hidden_states: bool = False
return_routed_experts: bool = False
routed_experts_start_len: Optional[int] = None
routed_experts_start_len: int = 0
return_cached_tokens_details: bool = False

# Extra parameters for SRT backend only and will be ignored by OpenAI models.
Expand Down Expand Up @@ -633,7 +633,7 @@ class ChatCompletionRequest(BaseModel):
parallel_tool_calls: bool = True
return_hidden_states: bool = False
return_routed_experts: bool = False
routed_experts_start_len: Optional[int] = None
routed_experts_start_len: int = 0
return_cached_tokens_details: bool = False
reasoning_effort: Optional[Literal["none", "low", "medium", "high", "max"]] = Field(
default=None,
Expand Down
6 changes: 3 additions & 3 deletions python/sglang/srt/managers/io_struct.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ class GenerateReqInput(BaseReq):
return_indexer_topk: bool = False
# Absolute start position for returned routings; response covers
# `[routed_experts_start_len, seqlen - 1)`. Must be in [0, prompt_tokens].
# None = full sequence.
routed_experts_start_len: Optional[int] = None
# 0 = full sequence.
routed_experts_start_len: int = 0

# The modalities of the image data [image, multi-images, video]
modalities: Optional[List[str]] = None
Expand Down Expand Up @@ -734,7 +734,7 @@ class TokenizedGenerateReqInput(BaseReq):
# Whether to return captured routed experts
return_routed_experts: bool = False
# See GenerateReqInput.routed_experts_start_len.
routed_experts_start_len: Optional[int] = None
routed_experts_start_len: int = 0

return_indexer_topk: bool = False

Expand Down
2 changes: 1 addition & 1 deletion python/sglang/srt/managers/schedule_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,7 @@ def __init__(
require_reasoning: bool = False,
return_hidden_states: bool = False,
return_routed_experts: bool = False,
routed_experts_start_len: Optional[int] = None,
routed_experts_start_len: int = 0,
return_indexer_topk: bool = False,
eos_token_ids: Optional[Set[int]] = None,
bootstrap_host: Optional[str] = None,
Expand Down
33 changes: 20 additions & 13 deletions python/sglang/srt/managers/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2161,19 +2161,26 @@ def handle_generate_request(
self._add_request_to_queue(req)
return

if (
recv_req.routed_experts_start_len is not None
and recv_req.routed_experts_start_len > len(req.origin_input_ids)
):
error_msg = (
f"{recv_req.routed_experts_start_len=} is higher than the "
f"number of input tokens {len(req.origin_input_ids)=}. Please "
f"use a smaller routed_experts_start_len."
)
req.routed_experts_start_len = None
req.set_finish_with_abort(error_msg)
self._add_request_to_queue(req)
return
if recv_req.return_routed_experts:
error_msg = None
if recv_req.routed_experts_start_len < 0:
error_msg = (
f"{recv_req.routed_experts_start_len=} is lower than 0. "
"Please use a non-negative routed_experts_start_len."
)

if recv_req.routed_experts_start_len > len(req.origin_input_ids):
error_msg = (
f"{recv_req.routed_experts_start_len=} is higher than the "
f"number of input tokens {len(req.origin_input_ids)=}. Please "
f"use a smaller routed_experts_start_len."
)

if error_msg is not None:
req.routed_experts_start_len = 0
req.set_finish_with_abort(error_msg)
self._add_request_to_queue(req)
return

added_to_grammar_queue = self.grammar_manager.process_req_with_grammar(req)
if not added_to_grammar_queue:
Expand Down
11 changes: 4 additions & 7 deletions python/sglang/srt/managers/scheduler_output_processor_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,9 @@ def maybe_collect_routed_experts(self: Scheduler, req: Req):
Returns immediately if `return_routed_experts` was not set on the
request, so non-opted-in reqs don't pay the host-gather cost.

When `req.routed_experts_start_len` is set, honor the caller's
absolute start so the response covers `[start_len, seqlen - 1)`.
Otherwise the full sequence is returned (`start_len = 0`).
Honors the caller's absolute start so the response covers
`[start_len, seqlen - 1)`. The default start_len is 0, which returns
the full sequence.

Logs a soft warning if the resulting tensor's row count differs from
the expected `seqlen - 1 - start_len`, to catch silent regressions.
Expand All @@ -125,10 +125,7 @@ def maybe_collect_routed_experts(self: Scheduler, req: Req):
capturer = get_global_experts_capturer()
if capturer is None:
return
if req.routed_experts_start_len is not None:
start_len = req.routed_experts_start_len
else:
start_len = 0
start_len = req.routed_experts_start_len
req.routed_experts = capturer.get_topk(
req_pool_idx=req.req_pool_idx,
seqlen=req.seqlen,
Expand Down
4 changes: 3 additions & 1 deletion python/sglang/srt/state_capturer/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,9 @@ def get_topk(
req_to_token_pool: ReqToTokenPool,
start_len: int = 0,
) -> torch.Tensor:
start_len = max(0, min(start_len, seqlen - 1))
if start_len < 0:
raise ValueError(f"{start_len=} must be non-negative")
start_len = min(start_len, seqlen - 1)
cache_pool_idx = (
req_to_token_pool.req_to_token[req_pool_idx][start_len : seqlen - 1]
.cpu()
Expand Down
22 changes: 10 additions & 12 deletions test/registered/rl/test_return_routed_experts.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ def compare_baseline_w_reference(baseline, reference):
class TestRoutedExpertsStartLen(CustomTestCase):
"""Verify the `routed_experts_start_len` parameter:

- default (None) returns the full sequence
- default (0) returns the full sequence
- explicit start_len crops the response and the cropped tail matches
the corresponding tail of the full response
"""
Expand Down Expand Up @@ -326,25 +326,23 @@ def _seqlen(self, resp_json: dict) -> int:
meta = resp_json["meta_info"]
return meta["prompt_tokens"] + meta["completion_tokens"]

def test_start_len_none_is_default(self):
"""Omitting the field must match `routed_experts_start_len=None`,
def test_start_len_zero_is_default(self):
"""Omitting the field must match `routed_experts_start_len=0`,
which returns the full sequence (start_len=0)."""
resp_default = self._send(self._build_payload()).json()
resp_none = self._send(
self._build_payload(routed_experts_start_len=None)
).json()
resp_zero = self._send(self._build_payload(routed_experts_start_len=0)).json()

rows_default = self._routed_experts(resp_default)
rows_none = self._routed_experts(resp_none)
rows_zero = self._routed_experts(resp_zero)

seqlen_default = self._seqlen(resp_default)
seqlen_none = self._seqlen(resp_none)
self.assertEqual(seqlen_default, seqlen_none)
seqlen_zero = self._seqlen(resp_zero)
self.assertEqual(seqlen_default, seqlen_zero)
self.assertEqual(rows_default.shape[0], seqlen_default - 1)
self.assertEqual(rows_none.shape[0], seqlen_none - 1)
self.assertEqual(rows_zero.shape[0], seqlen_zero - 1)
self.assertTrue(
np.array_equal(rows_default, rows_none),
"default and explicit None must produce identical routed experts",
np.array_equal(rows_default, rows_zero),
"default and explicit 0 must produce identical routed experts",
)

def test_start_len_controls_row_count(self):
Expand Down
Loading