Fix Overlap R3#23349
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the copy_to_cpu method in python/sglang/srt/managers/utils.py by removing the conditional check for return_routed_experts when copying routed expert outputs to the CPU. Feedback indicates that this change makes the return_routed_experts parameter unused and leads to inefficient memory management and unnecessary data transfers. It is recommended to restore the original conditional logic to maintain efficiency.
| if self.routed_experts_output is not None: | ||
| self.routed_experts_output.copy_to_cpu() | ||
| else: | ||
| self.routed_experts_output = None | ||
|
|
There was a problem hiding this comment.
The parameter return_routed_experts is now unused in this function. Removing the check and return_routed_experts and the else block that clears self.routed_experts_output leads to unnecessary D2H copies and keeps GPU tensors alive even when they are not requested by the user. If the intention was to always copy these experts, the parameter should be removed from the function signature. Otherwise, the previous conditional logic should be restored to maintain efficiency and proper memory management.
| if self.routed_experts_output is not None: | |
| self.routed_experts_output.copy_to_cpu() | |
| else: | |
| self.routed_experts_output = None | |
| if self.routed_experts_output is not None and return_routed_experts: | |
| self.routed_experts_output.copy_to_cpu() | |
| else: | |
| self.routed_experts_output = None |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a1c4236d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.routed_experts_output is not None: | ||
| self.routed_experts_output.copy_to_cpu() |
There was a problem hiding this comment.
Restore routed-expert copy gating by request flag
copy_to_cpu() now always calls self.routed_experts_output.copy_to_cpu() whenever the capturer is enabled, even when no request asked for routed experts. In overlap mode, batch.return_routed_experts is computed as any(req.return_routed_experts for req in reqs) (see schedule_batch.py), so this removed guard turns an optional D2H path into a per-batch cost. For MoE models this can add large host transfers and finalize work on every step, materially reducing throughput/latency for workloads that enable routed-expert support but only occasionally request it.
Useful? React with 👍 / 👎.
Motivation
Merged into #22911
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci