Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request reverts a set of performance optimizations introduced in a prior commit (#19190). The primary motivation is to diagnose and resolve a persistent CI test failure ( Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/rerun-stage stage-b-test-large-1-gpu |
|
✅ Triggered |
There was a problem hiding this comment.
Code Review
This pull request reverts a previous set of changes aimed at improving performance by removing GPU synchronization points. The revert is motivated by CI test flakiness that appeared after the original optimizations. The changes re-introduce synchronization during tensor creation and data transfers by removing pin_memory=True and non_blocking=True options, and also remove an optimization in torch.repeat_interleave by dropping the output_size parameter. The logic for creating certain tensors, such as mamba_track_indices, is simplified, which may result in more CPU-GPU interaction. The field all_extend_in_batch is also removed across several files. This revert seems to be a reasonable step to restore stability.
|
tested on local machine: |
Re-applies #19190 (reverted in #19581) but excludes the logits_processor.py changes that caused KL divergence regression in test_swa_radix_cache_kl. The logits_processor changes switched sample_indices, input_logprob_indices, and pruned_lens from synchronous device creation to pin_memory + non_blocking transfers, and added output_size to repeat_interleave. This removed implicit GPU sync points that changed numerical behavior, roughly doubling baseline KL divergence from ~0.001 to ~0.002+ and causing CI flakes. All other optimizations are preserved: - Mamba track indices: GPU-only construction without scalar extraction - Mamba cache zeroing: expand-from-scalar pattern (no CPU-GPU sync) - Ping-pong buffer: avoid Python-list advanced indexing on device tensors - all_extend_in_batch field propagation - connector: device parameter default - model_runner: empty_cache=False for memory check
- Vectorize mamba_track_indices via torch.gather instead of per-req scalar extraction - Vectorize mamba_track_mask via tensor arithmetic instead of Python list comprehension - Replace Python-list advanced indexing in free_mamba_cache with integer slicing - Use GPU zero-expand pattern in MambaPool.alloc to avoid implicit CPU-GPU sync - Keep tensor references in HybridReqToTokenPool.alloc instead of .tolist() roundtrip - Add all_extend_in_batch field for prefill cudagraph with DP attention - Default device=None in create_remote_connector - Avoid unnecessary cache clearing in weight update logging Split from sgl-project#19190 (reverted in sgl-project#19581): excludes logits_processor.py changes that caused SWA KL test regression. Mamba decode vectorization from internal PR.
…nsfers - Use pin_memory + non_blocking for sample_indices and input_logprob_indices in _get_pruned_states to avoid CPU-GPU sync - Use pin_memory + non_blocking for pruned_lens in _expand_metadata_for_logprobs - Add output_size to repeat_interleave calls to avoid implicit device sync Note: This was split from sgl-project#19190 (reverted in sgl-project#19581) because these changes caused SWA KL test regression. Landing separately to allow independent validation.
- Vectorize mamba_track_indices via torch.gather instead of per-req scalar extraction - Vectorize mamba_track_mask via tensor arithmetic instead of Python list comprehension - Replace Python-list advanced indexing in free_mamba_cache with integer slicing - Use GPU zero-expand pattern in MambaPool.alloc to avoid implicit CPU-GPU sync - Keep tensor references in HybridReqToTokenPool.alloc instead of .tolist() roundtrip - Add all_extend_in_batch field for prefill cudagraph with DP attention - Default device=None in create_remote_connector - Avoid unnecessary cache clearing in weight update logging Split from sgl-project#19190 (reverted in sgl-project#19581): excludes logits_processor.py changes that caused SWA KL test regression. Mamba decode vectorization from internal PR.
…P, disable cache reset in mem check (sgl-project#19190)" (sgl-project#19581) Co-authored-by: Alison Shao <alisonshao@mac.lan>
…P, disable cache reset in mem check (sgl-project#19190)" (sgl-project#19581) Co-authored-by: Alison Shao <alisonshao@mac.lan>
Motivation
Revert #19190 to verify it is the root cause of the
test_swa_radix_cache_klCI flake onstage-b-test-large-1-gpu.Bisect results (on H200):
8c0f2d40b(before #19190)b5a8e4179(#19190)The sync patch removes GPU synchronization points in
logits_processor.py(pin_memory+non_blockingtransfers,output_sizeinrepeat_interleave). This roughly doubles the baseline KL divergence from ~0.001 to ~0.002+, pushing it past the test threshold of 0.002.CI failure: https://github.com/sgl-project/sglang/actions/runs/22514914602/job/65230926406
Test plan
test_swa_radix_cache_klpasses onstage-b-test-large-1-gpu