[RL] Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM#24854
Merged
ByronHsu merged 5 commits intosgl-project:mainfrom May 10, 2026
Merged
Conversation
Post-weight-update code paths (e.g. DeepSeek MLA w_kc/w_vc derivation, FP8 block-quant scale rebuild) do many alloc/free cycles. Same-shape later allocations don't always reuse the freed blocks because of allocator split/merge heuristics and transient peaks during the cycle, so the PyTorch caching allocator's working footprint grows over weight-update cycles until it hits steady state. Live tensor count is stable; the growth is cached-but-unused blocks held by the allocator. Add an empty_cache field on ContinueGenerationReqInput, defaulted True. When set, the scheduler calls torch.cuda.empty_cache() while the engine is still paused, before flipping _engine_paused = False, returning the cached blocks to the driver with no race against active streams. An INFO log reports CUDA reserved memory before/after and how much was freed, making it easy to verify the empty_cache step is firing and to see how much transient memory it returns. Set empty_cache=False on the request to opt out. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
in-place pause mode
in-place pause modein-place pause mode to avoid OOM
ByronHsu
commented
May 9, 2026
| # during post-weight-update processing) back to the driver before | ||
| # inference resumes, with no race against active streams. Set to | ||
| # False to skip the empty_cache call. | ||
| empty_cache: bool = True |
hebiao064
reviewed
May 9, 2026
Align with the naming convention used in update_* request structs (e.g. UpdateWeightsFromDistributedReqInput.torch_empty_cache). Co-authored-by: Cursor <cursoragent@cursor.com>
hebiao064
reviewed
May 9, 2026
|
|
||
| def continue_generation(self, recv_req: ContinueGenerationReqInput): | ||
| if recv_req.empty_cache: | ||
| before_mb = torch.cuda.memory_reserved() / (1024 * 1024) |
Collaborator
There was a problem hiding this comment.
maybe make it compatible with AMD and other accelerator?
sglang/python/sglang/srt/utils/common.py
Line 498 in 12f42f2
Collaborator
Author
There was a problem hiding this comment.
it is a bit too messy. also it contradicts with #24854 (comment). i will keep it simple and just support torch for now
hebiao064
approved these changes
May 9, 2026
Collaborator
hebiao064
left a comment
There was a problem hiding this comment.
LGTM with two minor comments
Collaborator
|
/tag-and-rerun-ci |
ispobock
approved these changes
May 10, 2026
…y-cache-on-resume
Replace direct torch.cuda.empty_cache() / memory_reserved() calls in continue_generation with the empty_device_cache() helper from sgl-project#24861, making the in-place pause resume path work on all device backends. Co-authored-by: Cursor <cursoragent@cursor.com>
…n resume" This reverts commit ab9ad0a.
Collaborator
Author
|
/tag-and-rerun-ci |
ByronHsu
added a commit
that referenced
this pull request
May 10, 2026
…in-place pause mode to avoid OOM (#24905) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Cursor <cursoragent@cursor.com>
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 11, 2026
* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Post-weight-update processing (e.g. DeepSeek MLA w_kc/w_vc derivation, FP8 scale rebuild) creates transient CUDA allocations that fragment PyTorch's block cache. Without
empty_cache(), reserved memory grows each iteration and eventually OOMs.flush_cachecalled?empty_cachebefore this PRempty_cacheafter this PRabortflush_cache)flush_cache+ resume)retractflush_cache)flush_cache+ resume)in_placeabortandretractare safer because they already callempty_cache()as part offlush_cache. Thein_placepath never callsflush_cache(to preserve KV cache), soempty_cache()was never triggered — this PR closes that gap.With this change,
abortandretractwill callempty_cache()twice (once inflush_cache, once on resume), but the second call is benign — it's a no-op when there are no cached blocks.Changes
empty_cache: bool = TruetoContinueGenerationReqInput. The scheduler callstorch.cuda.empty_cache()while still paused (no race with active streams).empty_cache=False.Before
OOM after repeated weight updates:
Full traceback
After
Stable memory, no OOM.
Test plan