Skip to content

[Utils] Refactor device cache emptying#24861

Merged
ByronHsu merged 5 commits into
mainfrom
codex/refactor-device-empty-cache
May 10, 2026
Merged

[Utils] Refactor device cache emptying#24861
ByronHsu merged 5 commits into
mainfrom
codex/refactor-device-empty-cache

Conversation

@hebiao064
Copy link
Copy Markdown
Collaborator

@hebiao064 hebiao064 commented May 9, 2026

Motivation

SGLang has several paths that empty the PyTorch device allocator cache while also using flush_cache to clear internal memory pools such as KV cache and Mamba cache. Some scheduler paths still hard-code torch.cuda.empty_cache(), which makes the allocator-emptying behavior CUDA-specific even though SGLang supports other device backends.

This PR keeps the existing external API behavior while making the internal distinction clearer:

  • flush_cache clears SGLang memory pools.
  • empty_device_cache only releases unused cached blocks from the active device allocator.

Modifications

  • Add empty_device_cache() as a small common helper for backend allocator cache emptying.
  • Use the helper in scheduler flush_cache and idle periodic cache emptying instead of directly calling torch.cuda.empty_cache().
  • Reuse the helper inside get_available_gpu_memory for CUDA, XPU, NPU, and MUSA empty-cache paths.
  • Deduplicate weight-update flush handling through flush_cache_after_weight_update.
  • Clarify the flush_cache docstring around memory pools such as KV cache and Mamba cache.

Accuracy Tests

Not applicable. This PR does not change model forward behavior or numerical outputs.

Speed Tests and Profiling

Not applicable. This is a small cache-management refactor and preserves existing defaults.

Validation

  • python3 -m py_compile python/sglang/srt/utils/common.py python/sglang/srt/managers/scheduler.py python/sglang/srt/managers/scheduler_update_weights_mixin.py python/sglang/srt/managers/io_struct.py
  • git diff --check

Note: local pytest with uv --directory python --extra test could not run on macOS arm64 because sgl-deep-gemm==0.0.1 has no wheel for this platform.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hebiao064 hebiao064 changed the title refactor device empty cache Refactor device cache emptying May 9, 2026
@ByronHsu
Copy link
Copy Markdown
Collaborator

ByronHsu commented May 9, 2026

/tag-and-rerun-ci

@ByronHsu ByronHsu changed the title Refactor device cache emptying [Utils] Refactor device cache emptying May 10, 2026
@ByronHsu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ByronHsu ByronHsu merged commit 9578ba1 into main May 10, 2026
182 of 211 checks passed
@ByronHsu ByronHsu deleted the codex/refactor-device-empty-cache branch May 10, 2026 04:28
ByronHsu pushed a commit to ByronHsu/sglang that referenced this pull request May 10, 2026
Replace direct torch.cuda.empty_cache() / memory_reserved() calls in
continue_generation with the empty_device_cache() helper from sgl-project#24861,
making the in-place pause resume path work on all device backends.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants