[feat] Reduce GPU memory overhead by using weakref#9673
[feat] Reduce GPU memory overhead by using weakref#9673zhyncs merged 2 commits intosgl-project:mainfrom
Conversation
|
weakref might work for this case: https://peps.python.org/pep-0205/#circular-references |
I think using weakref would be a better choice! The code and corresponding experimental results (both approaches break the reference cycle, so the results remain unchanged) have already been updated. |
b34e245 to
dc96baf
Compare
hebiao064
left a comment
There was a problem hiding this comment.
@merrymercy FYI, there is circular dependency in schedule batch class which caused OOM and this PR fixed it
|
Thanks Yuhao! |
|
Thank you very much! |
|
Thank you very much! I tested this pr in my task. |
@yhyang201 Great to see this! |
|
Great job!But I'm more interesting how you find the key point. Amazing. |
I added this code in sglang and noticed that SchedulerBatch was not being reclaimed in time. However, the drawback of this approach is that it produces an overwhelming amount of output, which makes it cumbersome to review (as it requires filtering out irrelevant logs). |
🐂🍺 |
Simple, rough, but efficient. Learned, Thanks much. |
|
@yhyang201 I'm very confused about this when I deploy Qwen3-VL-32B in productive environment, you indeed solve my problem, thanks. |
I am currently using Qwen3-VL-32B with 4 * 48GB GPUs and sglang v0.5.4. During batch inference, the GPU0 memory usage keeps increasing until it runs out of memory (OOM). May I ask how to resolve this problem? |
|
Sure. |
|
If it may cause an OOM, it’s best not to enable |
@yhyang201 Thanks, after disable environment:docker image: sglang v0.5.5post3 launch command as follow:
|
Did you enable CUDA_IPC? If yes, try this fix out. |

Motivation
Due to circular references, some objects containing Tensor instances (likely ScheduleBatch objects) were detected by Python’s garbage collector but not released immediately. Instead, they accumulated for a period of time before being freed, causing torch.cuda.memory_allocated() and torch.cuda.memory_reserved() to remain significantly higher than the actual requirement.
This PR removes the circular references, allowing these objects to be released as soon as they become unreachable, thereby reducing the actual peak memory usage.
workload:
Before this PR:

After this PR(using weakref):
Attempting to address: #9365
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist