[feat] Reduce GPU memory overhead by using weakref by yhyang201 · Pull Request #9673 · sgl-project/sglang

yhyang201 · 2025-08-27T02:28:19Z

Motivation

Due to circular references, some objects containing Tensor instances (likely ScheduleBatch objects) were detected by Python’s garbage collector but not released immediately. Instead, they accumulated for a period of time before being freed, causing torch.cuda.memory_allocated() and torch.cuda.memory_reserved() to remain significantly higher than the actual requirement.

This PR removes the circular references, allowing these objects to be released as soon as they become unreachable, thereby reducing the actual peak memory usage.

workload:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --port 30005

python -m sglang.bench_serving  --backend sglang-oai-chat --dataset-name random-image --num-prompts 100 --host 127.0.0.1 --random-range-ratio 1 --random-input-len 1024 --random-output-len 1 --random-image-num-images 3  --max-concurrency 1 --port 30005  --random-image-resolution 1080p --warmup-requests 0

Before this PR:

After this PR(using weakref):

Attempting to address: #9365

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

hebiao064 · 2025-08-27T04:21:25Z

weakref might work for this case: https://peps.python.org/pep-0205/#circular-references

yhyang201 · 2025-08-27T07:10:50Z

weakref might work for this case: https://peps.python.org/pep-0205/#circular-references

I think using weakref would be a better choice! The code and corresponding experimental results (both approaches break the reference cycle, so the results remain unchanged) have already been updated.

hebiao064

@merrymercy FYI, there is circular dependency in schedule batch class which caused OOM and this PR fixed it

JustinTong0323 · 2025-08-27T17:21:13Z

Thanks Yuhao!

Swipe4057 · 2025-08-28T04:24:46Z

Thank you very much!

narutolhy · 2025-08-29T23:53:00Z

Thank you very much! I tested this pr in my task.
After using this pr, torch.cuda.memory_allocated() is stable after each step, and no gradual increase is seen.

zhaochenyang20 · 2025-08-30T04:27:56Z

Thank you very much! I tested this pr in my task. After using this pr, torch.cuda.memory_allocated() is stable after each step, and no gradual increase is seen.

@yhyang201 Great to see this!

WingEdge777 · 2025-09-01T03:03:19Z

Great job！But I'm more interesting how you find the key point. Amazing.

yhyang201 · 2025-09-02T03:57:32Z

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed

I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

zhaochenyang20 · 2025-09-02T04:00:54Z

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed
I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

🐂🍺

WingEdge777 · 2025-09-02T04:15:15Z

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed
I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

Simple, rough, but efficient. Learned, Thanks much.

UnlceYang · 2025-09-04T07:37:14Z

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.

Jimmy-L99 · 2025-11-19T02:45:38Z

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时，我对此感到非常困惑，你确实解决了我的问题，谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

yhyang201 · 2025-11-19T02:53:11Z

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时，我对此感到非常困惑，你确实解决了我的问题，谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

You can try again using the latest main version.
Could you share the launch command? Thanks.

Jimmy-L99 · 2025-11-19T02:55:32Z

You can try again using the latest main version.

Could you share the launch command? Thanks.

Sure.
Here is the command:

    environment:
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - SGLANG_VLM_CACHE_SIZE_MB=5120
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /model/Qwen3-VL-32B-Instruct-FP8
      --host xxxx
      --port xxxx
      --context-length 40960
      --tp-size 4
      --mem-fraction-static 0.60
      --chunked-prefill-size 8192
      --keep-mm-feature-on-device
      --tool-call-parser qwen25
      --attention-backend fa3
      --mm-attention-backend fa3
      --max-running-requests 20
      --enable-cache-report
      --enable-metrics

yhyang201 · 2025-11-19T02:59:42Z

If it may cause an OOM, it’s best not to enable --keep-mm-feature-on-device. @Jimmy-L99

Jimmy-L99 · 2025-11-19T11:28:40Z

If it may cause an OOM, it’s best not to enable --keep-mm-feature-on-device. @Jimmy-L99

@yhyang201 Thanks, after disable --keep-mm-feature-on-device, the situation improved somewhat. The memory usage of GPU 0 no longer increased rapidly as before. However, after running batch inference continuously for about 10 minutes, the memory usage of all GPUs gradually increased to over 90%. Although sglang freed some of the KV cache, providing some relief to the memory usage, it eventually OOM.
I initially thought the 32B model was too demanding, but the same issue persists even after switching to the 8B parameter model.

environment:

docker image: sglang v0.5.5post3
GPU: RTX 5880 * 4, 192G GPU memory total.

launch command as follow:

    environment:
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - SGLANG_VLM_CACHE_SIZE_MB=4096
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /model/Qwen3-VL-8B-Instruct
      --host 192.168.10.87
      --port 8018
      --context-length 40960
      --tp-size 4
      --mem-fraction-static 0.50
      --chunked-prefill-size 8192
      --tool-call-parser qwen25
      --attention-backend fa3
      --mm-attention-backend fa3
      --max-running-requests 10
      --enable-cache-report
      --enable-metrics

yuan-luo · 2026-01-07T03:31:32Z

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时，我对此感到非常困惑，你确实解决了我的问题，谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

Did you enable CUDA_IPC? If yes, try this fix out.
#16118

yhyang201 requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners August 27, 2025 02:28

This comment was marked as resolved.

Sign in to view

yhyang201 mentioned this pull request Aug 27, 2025

[Bug] [Tracking] VLM/LLM OOM related issues #9365

Closed

hnyls2002 self-assigned this Aug 27, 2025

use weakref to break circular references

dc96baf

yhyang201 force-pushed the reduce-gpu-memory branch from b34e245 to dc96baf Compare August 27, 2025 07:14

rename __batch_ref to _batch_ref

7abf598

hebiao064 approved these changes Aug 27, 2025

View reviewed changes

hebiao064 changed the title ~~[WIP] Reduce GPU memory overhead by breaking circular references~~ [feat] Reduce GPU memory overhead by breaking circular references Aug 27, 2025

hebiao064 changed the title ~~[feat] Reduce GPU memory overhead by breaking circular references~~ [feat] Reduce GPU memory overhead by using weakref Aug 27, 2025

zhyncs merged commit c377923 into sgl-project:main Aug 28, 2025
188 of 194 checks passed

huangtingwei9988 mentioned this pull request Sep 2, 2025

fix the abnormal GPU memory occupation of multimodal model continued to increase until OOM #7340

Closed

6 tasks

AlienKevin mentioned this pull request Sep 3, 2025

vlm: remove redundant d2h movement of mm feature tensors #9987

Merged

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[feat] Reduce GPU memory overhead by using weakref (sgl-project#9673)

2ccd7fe

b8zhong mentioned this pull request Sep 11, 2025

[Bug] [Multimodal] GPU memory leak #8429

Closed

5 tasks

anvdn pushed a commit to hyperscience/sglang that referenced this pull request Sep 18, 2025

[feat] Reduce GPU memory overhead by using weakref (sgl-project#9673)

64c64a1

anvdn pushed a commit to hyperscience/sglang that referenced this pull request Sep 25, 2025

[feat] Reduce GPU memory overhead by using weakref (sgl-project#9673)

c7e25ed

Swipe4057 mentioned this pull request Oct 20, 2025

[Bug] A video memory leak in Qwen-Next / MLA-model causes an OOM. #11860

Closed

5 tasks

hnyls2002 mentioned this pull request Dec 31, 2025

Fix OOM by offloading multimodal features to CPU after embedding #16018

Merged

6 tasks

Conversation

yhyang201 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

hebiao064 commented Aug 27, 2025

Uh oh!

yhyang201 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebiao064 left a comment

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Aug 27, 2025

Uh oh!

Swipe4057 commented Aug 28, 2025

Uh oh!

Uh oh!

narutolhy commented Aug 29, 2025

Uh oh!

zhaochenyang20 commented Aug 30, 2025

Uh oh!

WingEdge777 commented Sep 1, 2025

Uh oh!

yhyang201 commented Sep 2, 2025

Uh oh!

zhaochenyang20 commented Sep 2, 2025

Uh oh!

WingEdge777 commented Sep 2, 2025

Uh oh!

UnlceYang commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jimmy-L99 commented Nov 19, 2025

Uh oh!

yhyang201 commented Nov 19, 2025

Uh oh!

Jimmy-L99 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhyang201 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jimmy-L99 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

environment:

launch command as follow:

Uh oh!

yuan-luo commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

yhyang201 commented Aug 27, 2025 •

edited

Loading

yhyang201 commented Aug 27, 2025 •

edited

Loading

UnlceYang commented Sep 4, 2025 •

edited

Loading

Jimmy-L99 commented Nov 19, 2025 •

edited

Loading

yhyang201 commented Nov 19, 2025 •

edited

Loading

Jimmy-L99 commented Nov 19, 2025 •

edited

Loading

yuan-luo commented Jan 7, 2026 •

edited

Loading