[RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM by ByronHsu · Pull Request #24854 · sgl-project/sglang

ByronHsu · 2026-05-09T21:03:09Z

Motivation

Post-weight-update processing (e.g. DeepSeek MLA w_kc/w_vc derivation, FP8 scale rebuild) creates transient CUDA allocations that fragment PyTorch's block cache. Without empty_cache(), reserved memory grows each iteration and eventually OOMs.

Pause mode	`flush_cache` called?	`empty_cache` before this PR	`empty_cache` after this PR
`abort`	Yes	Yes (via `flush_cache`)	Yes (via `flush_cache` + resume)
`retract`	Yes	Yes (via `flush_cache`)	Yes (via `flush_cache` + resume)
`in_place`	No	No ← the gap	Yes (via resume)

abort and retract are safer because they already call empty_cache() as part of flush_cache. The in_place path never calls flush_cache (to preserve KV cache), so empty_cache() was never triggered — this PR closes that gap.

With this change, abort and retract will call empty_cache() twice (once in flush_cache, once on resume), but the second call is benign — it's a no-op when there are no cached blocks.

Changes

Add empty_cache: bool = True to ContinueGenerationReqInput. The scheduler calls torch.cuda.empty_cache() while still paused (no race with active streams).
Log reserved-memory delta at INFO level for observability.
Callers can opt out with empty_cache=False.

Before

OOM after repeated weight updates:

Full traceback

      ret, can_run_graph = self.forward_extend(
                           ^^^^^^^^^^^^^^^^^^^^
    File "sglang/srt/model_executor/model_runner.py", line 2780, in forward_extend
      self.model.forward(
    File "torch/utils/_contextlib.py", line 120, in decorate_context
      return func(*args, **kwargs)
    File "sglang/srt/models/deepseek_v2.py", line 2298, in forward
      hidden_states = self.model(
    File "sglang/srt/models/deepseek_v2.py", line 2061, in forward
      hidden_states, residual, topk_indices = layer(
    File "sglang/srt/models/deepseek_v2.py", line 1750, in forward
      hidden_states = self.mlp(
    File "sglang/srt/models/deepseek_v2.py", line 574, in forward
      return self.forward_deepep(hidden_states, forward_batch)
    File "sglang/srt/models/deepseek_v2.py", line 782, in forward_deepep
      shared_output = self._forward_shared_experts(hidden_states)
    File "sglang/srt/models/deepseek_v2.py", line 979, in _forward_shared_experts
      return self.shared_experts(
    File "sglang/srt/models/deepseek_v2.py", line 258, in forward
      x, _ = self.down_proj(
    File "sglang/srt/layers/linear.py", line 1509, in forward
      output_parallel = self.quant_method.apply(self, input_parallel, bias=bias_)
    File "sglang/srt/layers/quantization/fp8.py", line 737, in apply
      return self.w8a8_block_fp8_linear(
    File "sglang/srt/layers/quantization/fp8_utils.py", line 678, in deepgemm_w8a8_block_fp8_linear_with_fallback
      output = w8a8_block_fp8_matmul_deepgemm(
    File "sglang/srt/layers/quantization/fp8_kernel.py", line 1107, in w8a8_block_fp8_matmul_deepgemm
      deep_gemm_fp8_fp8_bf16_nt(A, As, B, Bs, C)
    File "sglang/srt/layers/quantization/fp8_kernel.py", line 123, in deep_gemm_fp8_fp8_bf16_nt
      deep_gemm_wrapper.gemm_nt_f8f8bf16((A, As), (B, Bs), C)
    File "sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 98, in gemm_nt_f8f8bf16
      deep_gemm.fp8_gemm_nt(
    File "deep_gemm/__init__.py", line 50, in _fn
      return func(*args, **kwargs)
  RuntimeError: CUDA driver error (/sgl-kernel/build/_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:84): 2
  (CUDA_ERROR_OUT_OF_MEMORY, out of memory)

After

Stable memory, no OOM.

Test plan

1-node Qwen3-30B-A3B agg recipe, in-place pause + routing replay: log fires every resume, ~2 MB reclaimed on first, ~0 MB thereafter.
ESS, loss, exit code unchanged.

Post-weight-update code paths (e.g. DeepSeek MLA w_kc/w_vc derivation, FP8 block-quant scale rebuild) do many alloc/free cycles. Same-shape later allocations don't always reuse the freed blocks because of allocator split/merge heuristics and transient peaks during the cycle, so the PyTorch caching allocator's working footprint grows over weight-update cycles until it hits steady state. Live tensor count is stable; the growth is cached-but-unused blocks held by the allocator. Add an empty_cache field on ContinueGenerationReqInput, defaulted True. When set, the scheduler calls torch.cuda.empty_cache() while the engine is still paused, before flipping _engine_paused = False, returning the cached blocks to the driver with no race against active streams. An INFO log reports CUDA reserved memory before/after and how much was freed, making it easy to verify the empty_cache step is firing and to see how much transient memory it returns. Set empty_cache=False on the request to opt out. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist · 2026-05-09T21:03:12Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ByronHsu · 2026-05-09T21:22:00Z

+    # during post-weight-update processing) back to the driver before
+    # inference resumes, with no race against active streams. Set to
+    # False to skip the empty_cache call.
+    empty_cache: bool = True


@guapisolo @yueming-yuan is it safe?

Align with the naming convention used in update_* request structs (e.g. UpdateWeightsFromDistributedReqInput.torch_empty_cache). Co-authored-by: Cursor <cursoragent@cursor.com>

hebiao064 · 2026-05-09T21:36:56Z


    def continue_generation(self, recv_req: ContinueGenerationReqInput):
+        if recv_req.empty_cache:
+            before_mb = torch.cuda.memory_reserved() / (1024 * 1024)


maybe make it compatible with AMD and other accelerator?

sglang/python/sglang/srt/utils/common.py

Line 498 in 12f42f2

def get_available_gpu_memory(

it is a bit too messy. also it contradicts with #24854 (comment). i will keep it simple and just support torch for now

hebiao064

LGTM with two minor comments

ispobock · 2026-05-10T03:45:14Z

/tag-and-rerun-ci

…y-cache-on-resume

Replace direct torch.cuda.empty_cache() / memory_reserved() calls in continue_generation with the empty_device_cache() helper from sgl-project#24861, making the in-place pause resume path work on all device backends. Co-authored-by: Cursor <cursoragent@cursor.com>

…n resume" This reverts commit ab9ad0a.

ByronHsu · 2026-05-10T06:36:19Z

/tag-and-rerun-ci

…in-place pause mode to avoid OOM (#24905) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Cursor <cursoragent@cursor.com>

* main: (87 commits) [Fix] Disable FlashInfer allreduce fusion under deterministic inference (sgl-project#24629) fix: STANDALONE spec-decode hidden-size mismatch crash (sgl-project#24217) Followup fix for Custom AR V2 in non NVL scenarios (sgl-project#24742) Fix reduce_scatterv producer contract for SUM_LEN (sgl-project#24785) [NPU]Documentation update for communications quantization feature (sgl-project#24668) [Session R3] Add routed_experts_start_len for absolute routing slice control (sgl-project#24851) [Model] Add MiniCPM-V 4.6 support (sgl-project#24855) Support Intern-S2-Preview (sgl-project#24875) [PD] Unify dsv4 dispatch with swa (sgl-project#24888) Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (sgl-project#24775) Fix PD bootstrap failure handling (sgl-project#24772) [Spec] Cleanup idle stub and shape-check patterns (sgl-project#24881) [Bug] Add dsv4 state_type branch to mooncake disaggregation (sgl-project#24878) [Spec V1] Split draft-extend phase from `EagleDraftInput` into new `EagleDraftExtendInput` (sgl-project#24859) [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (sgl-project#24696) [spec decoding] support kimi-k2.5-eagle3-mla (sgl-project#24826) [SPEC V2] fix: skip stale state updates in spec-v2 overlap (sgl-project#23456) [RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM (sgl-project#24854) [diffusion] CI: add cache-dit CI tests (sgl-project#19213) [Utils] Make request dump robust to unpicklable server_args and large meta_info (sgl-project#24767) ... # Conflicts: # python/sglang/srt/utils/common.py

ByronHsu requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners May 9, 2026 21:03

ByronHsu changed the title ~~[Scheduler] Call torch.cuda.empty_cache() before resuming generation~~ [RL] Call torch.cuda.empty_cache() before resuming generation to avoid memory increase from fragmentation May 9, 2026

ByronHsu changed the title ~~[RL] Call torch.cuda.empty_cache() before resuming generation to avoid memory increase from fragmentation~~ [RL] Call torch.cuda.empty_cache() for in-place pause mode May 9, 2026

ByronHsu changed the title ~~[RL] Call torch.cuda.empty_cache() for in-place pause mode~~ [RL] Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM May 9, 2026

ByronHsu commented May 9, 2026

View reviewed changes

hebiao064 reviewed May 9, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/io_struct.py Outdated

Rename empty_cache to torch_empty_cache for consistency

00f1063

Align with the naming convention used in update_* request structs (e.g. UpdateWeightsFromDistributedReqInput.torch_empty_cache). Co-authored-by: Cursor <cursoragent@cursor.com>

hebiao064 reviewed May 9, 2026

View reviewed changes

hebiao064 approved these changes May 9, 2026

View reviewed changes

ispobock approved these changes May 10, 2026

View reviewed changes

github-actions Bot added the run-ci label May 10, 2026

Byron Hsu and others added 3 commits May 10, 2026 04:29

Merge remote-tracking branch 'upstream/main' into byron/upstream-empt…

5039429

…y-cache-on-resume

Revert "Use empty_device_cache() for device-agnostic cache clearing o…

36d4b7f

…n resume" This reverts commit ab9ad0a.

ByronHsu merged commit cfd3fd0 into sgl-project:main May 10, 2026
104 of 133 checks passed

ByronHsu mentioned this pull request May 10, 2026

[sglang-miles] Cherry-pick #24854: Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM #24905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM#24854

[RL] Call torch.cuda.empty_cache() for `in-place` pause mode to avoid OOM#24854
ByronHsu merged 5 commits intosgl-project:mainfrom
ByronHsu:byron/upstream-empty-cache-on-resume

ByronHsu commented May 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

ByronHsu May 9, 2026

Uh oh!

Uh oh!

hebiao064 May 9, 2026

Uh oh!

ByronHsu May 10, 2026

Uh oh!

hebiao064 left a comment

Uh oh!

ispobock commented May 10, 2026

Uh oh!

ByronHsu commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ByronHsu commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Before

After

Test plan

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

ByronHsu May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hebiao064 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

ByronHsu May 10, 2026

Choose a reason for hiding this comment

Uh oh!

hebiao064 left a comment

Choose a reason for hiding this comment

Uh oh!

ispobock commented May 10, 2026

Uh oh!

ByronHsu commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ByronHsu commented May 9, 2026 •

edited

Loading