Skip to content

[feat] Reduce GPU memory overhead by using weakref#9673

Merged
zhyncs merged 2 commits intosgl-project:mainfrom
yhyang201:reduce-gpu-memory
Aug 28, 2025
Merged

[feat] Reduce GPU memory overhead by using weakref#9673
zhyncs merged 2 commits intosgl-project:mainfrom
yhyang201:reduce-gpu-memory

Conversation

@yhyang201
Copy link
Collaborator

@yhyang201 yhyang201 commented Aug 27, 2025

Motivation

Due to circular references, some objects containing Tensor instances (likely ScheduleBatch objects) were detected by Python’s garbage collector but not released immediately. Instead, they accumulated for a period of time before being freed, causing torch.cuda.memory_allocated() and torch.cuda.memory_reserved() to remain significantly higher than the actual requirement.

This PR removes the circular references, allowing these objects to be released as soon as they become unreachable, thereby reducing the actual peak memory usage.

workload:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --port 30005

python -m sglang.bench_serving  --backend sglang-oai-chat --dataset-name random-image --num-prompts 100 --host 127.0.0.1 --random-range-ratio 1 --random-input-len 1024 --random-output-len 1 --random-image-num-images 3  --max-concurrency 1 --port 30005  --random-image-resolution 1080p --warmup-requests 0

Before this PR:
output1

After this PR(using weakref):

output3

Attempting to address: #9365

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

gemini-code-assist[bot]

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

@hebiao064
Copy link
Collaborator

weakref might work for this case: https://peps.python.org/pep-0205/#circular-references

@hnyls2002 hnyls2002 self-assigned this Aug 27, 2025
@yhyang201
Copy link
Collaborator Author

yhyang201 commented Aug 27, 2025

weakref might work for this case: https://peps.python.org/pep-0205/#circular-references

I think using weakref would be a better choice! The code and corresponding experimental results (both approaches break the reference cycle, so the results remain unchanged) have already been updated.

Copy link
Collaborator

@hebiao064 hebiao064 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy FYI, there is circular dependency in schedule batch class which caused OOM and this PR fixed it

@hebiao064 hebiao064 changed the title [WIP] Reduce GPU memory overhead by breaking circular references [feat] Reduce GPU memory overhead by breaking circular references Aug 27, 2025
@hebiao064 hebiao064 changed the title [feat] Reduce GPU memory overhead by breaking circular references [feat] Reduce GPU memory overhead by using weakref Aug 27, 2025
@JustinTong0323
Copy link
Collaborator

Thanks Yuhao!

@Swipe4057
Copy link
Contributor

Thank you very much!

@zhyncs zhyncs merged commit c377923 into sgl-project:main Aug 28, 2025
188 of 194 checks passed
@narutolhy
Copy link
Contributor

Thank you very much! I tested this pr in my task.
After using this pr, torch.cuda.memory_allocated() is stable after each step, and no gradual increase is seen.

@zhaochenyang20
Copy link
Collaborator

Thank you very much! I tested this pr in my task. After using this pr, torch.cuda.memory_allocated() is stable after each step, and no gradual increase is seen.

@yhyang201 Great to see this!

@WingEdge777
Copy link
Contributor

Great job!But I'm more interesting how you find the key point. Amazing.

@yhyang201
Copy link
Collaborator Author

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed

I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

@zhaochenyang20
Copy link
Collaborator

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed

I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

🐂🍺

@WingEdge777
Copy link
Contributor

import gc

gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()  # trigger garbage collection
print(gc.garbage)  # list objects that cannot be reclaimed

I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs).

Simple, rough, but efficient. Learned, Thanks much.

@UnlceYang
Copy link

UnlceYang commented Sep 4, 2025

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
@b8zhong b8zhong mentioned this pull request Sep 11, 2025
5 tasks
@Jimmy-L99
Copy link
Contributor

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时,我对此感到非常困惑,你确实解决了我的问题,谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

@yhyang201
Copy link
Collaborator Author

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时,我对此感到非常困惑,你确实解决了我的问题,谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

  1. You can try again using the latest main version.
  2. Could you share the launch command? Thanks.

@Jimmy-L99
Copy link
Contributor

Jimmy-L99 commented Nov 19, 2025

  1. You can try again using the latest main version.
  2. Could you share the launch command? Thanks.

Sure.
Here is the command:

    environment:
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - SGLANG_VLM_CACHE_SIZE_MB=5120
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /model/Qwen3-VL-32B-Instruct-FP8
      --host xxxx
      --port xxxx
      --context-length 40960
      --tp-size 4
      --mem-fraction-static 0.60
      --chunked-prefill-size 8192
      --keep-mm-feature-on-device
      --tool-call-parser qwen25
      --attention-backend fa3
      --mm-attention-backend fa3
      --max-running-requests 20
      --enable-cache-report
      --enable-metrics

@yhyang201
Copy link
Collaborator Author

yhyang201 commented Nov 19, 2025

If it may cause an OOM, it’s best not to enable --keep-mm-feature-on-device. @Jimmy-L99

@Jimmy-L99
Copy link
Contributor

Jimmy-L99 commented Nov 19, 2025

If it may cause an OOM, it’s best not to enable --keep-mm-feature-on-device. @Jimmy-L99

@yhyang201 Thanks, after disable --keep-mm-feature-on-device, the situation improved somewhat. The memory usage of GPU 0 no longer increased rapidly as before. However, after running batch inference continuously for about 10 minutes, the memory usage of all GPUs gradually increased to over 90%. Although sglang freed some of the KV cache, providing some relief to the memory usage, it eventually OOM.
I initially thought the 32B model was too demanding, but the same issue persists even after switching to the 8B parameter model.

environment:

docker image: sglang v0.5.5post3
GPU: RTX 5880 * 4, 192G GPU memory total.

launch command as follow:

    environment:
      - CUDA_VISIBLE_DEVICES=4,5,6,7
      - SGLANG_VLM_CACHE_SIZE_MB=4096
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /model/Qwen3-VL-8B-Instruct
      --host 192.168.10.87
      --port 8018
      --context-length 40960
      --tp-size 4
      --mem-fraction-static 0.50
      --chunked-prefill-size 8192
      --tool-call-parser qwen25
      --attention-backend fa3
      --mm-attention-backend fa3
      --max-running-requests 10
      --enable-cache-report
      --enable-metrics
046f489a4bbd5a95697e11e7e1896b22

@yuan-luo
Copy link
Collaborator

yuan-luo commented Jan 7, 2026

@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks.在将 Qwen3-VL-32B 部署到生产环境时,我对此感到非常困惑,你确实解决了我的问题,谢谢。

I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem?

Did you enable CUDA_IPC? If yes, try this fix out.
#16118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.