[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching#34183
Conversation
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a CPU memory leak in prefix caching by resolving a reference cycle within the Request class. The original implementation using functools.partial to create a block hasher inadvertently caused Request objects to be retained in memory longer than necessary. The fix, which involves storing the block hasher directly and applying it explicitly, is a clean and correct way to break this cycle. The related changes in the scheduler and tests are consistent with this fix. I have one suggestion to improve the robustness of the new recompute_block_hashes method to prevent potential future bugs.
|
Hi @ywang96, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Roger Wang <hey@rogerw.io>
|
Thanks for the fix 🚀 I tested it again with the following command running the benchmark multiple times and report the memory usage after each run both with the default memory allocator and with jemalloc: Memory usage is now very stable:
Even without the fix, the GC occasionally seems to be able to reclaim the memory (especially with jemalloc) but not very consistently. In any case this fixes the memory growth 🎉 |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a critical CPU memory leak caused by a reference cycle within the Request object, which is particularly impactful for multimodal models. The use of functools.partial was correctly identified as the source of the cycle. The fix, which involves replacing the partial with a dedicated method update_block_hashes that explicitly passes self, is a clean and standard approach to breaking such cycles. The changes are consistently applied across all relevant files, including tests. The provided performance metrics clearly validate the fix, showing a significant reduction in memory growth. The implementation is solid, and I have no further recommendations.
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks @reaganjlee @lgeiger for help looking into this as well!
…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>
…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>
…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>
Purpose
FIXES #28726
#24964 #27896 Introduced a change so that each request cycle creates fewer GC-tracked objects and overall there are less frequent gen-2 collections.
While this is okay for text-only inference where each
Requesthas very little data, for multimodal models this will result memory consumption growth without bound sinceRequesteach can containmm_featuresthat are sometimes up to two-digit MBs.Test Plan
The original issue was reproducible with this test script test.py and confirmed fixed with this PR
Test Result
Main branch
This PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.