[perf] support return_routed_experts with overlap scheduling by Qiaolin-Yu · Pull Request #22911 · sgl-project/sglang

Qiaolin-Yu · 2026-04-15T23:12:38Z

Motivation

Before,

After,

Modifications

Accuracy Tests

Speed Tests and Profiling

h200

python3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --port 30088 --enable-return-routed-experts --tp 4 --disable-flashinfer-autotune

python3 -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 30088   --dataset-name random   --num-prompts 5   --random-input 1024   --random-output 1024   --max-concurrency 1   --return-routed-experts

before this pr,
Output token throughput (tok/s): 172.58

after this pr,
Output token throughput (tok/s): 260.37

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-15T23:12:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Qiaolin-Yu · 2026-04-15T23:44:12Z

/rerun-test test/registered/rl/test_return_routed_experts.py

github-actions · 2026-04-15T23:44:41Z

✅ 2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

Qiaolin-Yu · 2026-04-16T00:01:05Z

/tag-and-rerun-ci

Qiaolin-Yu · 2026-04-21T08:41:50Z

/rerun-test test/registered/rl/test_return_routed_experts.py

github-actions · 2026-04-21T08:42:20Z

✅ 2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

…ject#22911) Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>

Cherry-pick of upstream PR #22911 (commit c560326) onto sglang-miles. Adds a `RoutedExpertsOutput` dataclass and `no_copy_to_cpu` path so the routed-experts capture can defer the device-to-host copy until after the forward stream's `copy_done` event, allowing overlap scheduling to keep the GPU busy. Result is plumbed through `GenerationBatchResult`, `TpModelWorker`, and both EAGLE v2 spec workers; the scheduler output processor finalizes the host-side write after `copy_done.synchronize()`. Conflicts in `routed_experts_capturer.py` and `model_runner.py` were resolved to keep miles-side changes on top of upstream's `_get_local_range` / `no_copy_to_cpu` refactor: - draft-worker guard around `on_forward_end` - `bs * num_tokens_per_bs` cuda_graph token-count fix for spec decoding - DeepEP all-gather path (skip DP-rank slicing when DeepEP is on) Verified on H200 TP=4 Qwen3-30B-A3B (batch=64, in=1024, out=512): output throughput 7453.48 -> 8609.21 tok/s (+15.5%). Router replay accuracy test (test_return_routed_experts) passes 3/3 with 0 mismatches across ~26.7M expert IDs. Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… scheduling (#23860) Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

poc

9511984

Qiaolin-Yu requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners April 15, 2026 23:12

Qiaolin-Yu added 3 commits April 15, 2026 23:27

refine

c108460

refine

d89cc46

upd

284adef

Qiaolin-Yu assigned hnyls2002 Apr 16, 2026

Merge branch 'main' into overlap_r3

388dbe7

github-actions Bot added the run-ci label Apr 16, 2026

zyzshishui approved these changes Apr 21, 2026

View reviewed changes

fix

9aa6401

Merge branch 'main' into overlap_r3

814674a

zyzshishui mentioned this pull request Apr 21, 2026

Fix Overlap R3 #23349

Closed

5 tasks

hnyls2002 added the high priority label Apr 21, 2026

no diff

b9189ba

hnyls2002 approved these changes Apr 21, 2026

View reviewed changes

hnyls2002 merged commit c560326 into main Apr 21, 2026
22 of 43 checks passed

hnyls2002 deleted the overlap_r3 branch April 21, 2026 21:42

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

[perf] support return_routed_experts with overlap scheduling (sgl-pro…

0b005ff

…ject#22911) Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>

ByronHsu mentioned this pull request Apr 27, 2026

[sglang-miles] Cherry-pick #22911: return_routed_experts with overlap scheduling #23854

Closed

3 tasks

This was referenced Apr 27, 2026

[perf] cherry-pick #22911: support return_routed_experts with overlap scheduling #23860

Merged

[Disagg] Finalize routed_experts_output in process_batch_result_disagg_prefill #23885

Merged

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] support return_routed_experts with overlap scheduling#22911

[perf] support return_routed_experts with overlap scheduling#22911
hnyls2002 merged 8 commits into
mainfrom
overlap_r3

Qiaolin-Yu commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

Qiaolin-Yu commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Qiaolin-Yu commented Apr 16, 2026

Uh oh!

Qiaolin-Yu commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Qiaolin-Yu commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

Qiaolin-Yu commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Qiaolin-Yu commented Apr 16, 2026

Uh oh!

Qiaolin-Yu commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qiaolin-Yu commented Apr 15, 2026 •

edited

Loading