Skip to content

[R3] Avoid implicit CUDA sync in routed experts DP slicing#24550

Merged
hnyls2002 merged 5 commits into
sgl-project:mainfrom
zyzshishui:fix-r3-overlap
May 7, 2026
Merged

[R3] Avoid implicit CUDA sync in routed experts DP slicing#24550
hnyls2002 merged 5 commits into
sgl-project:mainfrom
zyzshishui:fix-r3-overlap

Conversation

@zyzshishui
Copy link
Copy Markdown
Contributor

Motivation

Fixes #24514

Modifications

Compute the routed-experts DP local slice indices from forward_batch.global_num_tokens_cpu instead of forward_batch.global_num_tokens_gpu.

This keeps the slice bounds as Python integers while preserving the existing indexing semantics:

  • non-CUDA-graph path: use the prefix sum of per-DP token counts
  • CUDA graph path: use dp_rank * cuda_graph_batch
  • DeepEP path remains unchanged, since capture already all-gathers into the head of the per-rank buffer

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbefc8d428

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/sglang/srt/state_capturer/routed_experts.py Outdated
Comment thread python/sglang/srt/state_capturer/routed_experts.py Outdated
Comment thread python/sglang/srt/state_capturer/routed_experts.py Outdated
@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test test_return_routed_experts.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

4-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

@hnyls2002 hnyls2002 merged commit 4a279d9 into sgl-project:main May 7, 2026
85 of 101 checks passed
@zyzshishui zyzshishui deleted the fix-r3-overlap branch May 7, 2026 02:20
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R3] Non-DeepEP DP attention codepath causes implicit CUDA sync in routed experts overlap

3 participants