Skip to content

[perf] support return_routed_experts with overlap scheduling#22911

Merged
hnyls2002 merged 8 commits into
mainfrom
overlap_r3
Apr 21, 2026
Merged

[perf] support return_routed_experts with overlap scheduling#22911
hnyls2002 merged 8 commits into
mainfrom
overlap_r3

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented Apr 15, 2026

Motivation

Before,

image

After,

image

Modifications

Accuracy Tests

Speed Tests and Profiling

h200

python3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --port 30088 --enable-return-routed-experts --tp 4 --disable-flashinfer-autotune
python3 -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 30088   --dataset-name random   --num-prompts 5   --random-input 1024   --random-output 1024   --max-concurrency 1   --return-routed-experts

before this pr,
Output token throughput (tok/s): 172.58

after this pr,
Output token throughput (tok/s): 260.37

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/rl/test_return_routed_experts.py

@github-actions
Copy link
Copy Markdown
Contributor

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/rl/test_return_routed_experts.py

@github-actions
Copy link
Copy Markdown
Contributor

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

@zyzshishui zyzshishui mentioned this pull request Apr 21, 2026
5 tasks
@hnyls2002 hnyls2002 merged commit c560326 into main Apr 21, 2026
22 of 43 checks passed
@hnyls2002 hnyls2002 deleted the overlap_r3 branch April 21, 2026 21:42
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
…ject#22911)

Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>
ByronHsu pushed a commit that referenced this pull request Apr 27, 2026
Cherry-pick of upstream PR #22911 (commit c560326) onto sglang-miles.

Adds a `RoutedExpertsOutput` dataclass and `no_copy_to_cpu` path so the
routed-experts capture can defer the device-to-host copy until after the
forward stream's `copy_done` event, allowing overlap scheduling to keep
the GPU busy. Result is plumbed through `GenerationBatchResult`,
`TpModelWorker`, and both EAGLE v2 spec workers; the scheduler output
processor finalizes the host-side write after `copy_done.synchronize()`.

Conflicts in `routed_experts_capturer.py` and `model_runner.py` were
resolved to keep miles-side changes on top of upstream's
`_get_local_range` / `no_copy_to_cpu` refactor:
- draft-worker guard around `on_forward_end`
- `bs * num_tokens_per_bs` cuda_graph token-count fix for spec decoding
- DeepEP all-gather path (skip DP-rank slicing when DeepEP is on)

Verified on H200 TP=4 Qwen3-30B-A3B (batch=64, in=1024, out=512):
output throughput 7453.48 -> 8609.21 tok/s (+15.5%). Router replay
accuracy test (test_return_routed_experts) passes 3/3 with 0
mismatches across ~26.7M expert IDs.

Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ByronHsu added a commit that referenced this pull request Apr 28, 2026
… scheduling (#23860)

Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hnyls2002 hnyls2002 mentioned this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants