Skip to content

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching#34183

Merged
DarkLight1337 merged 9 commits intovllm-project:mainfrom
ywang96:fix-mm-cpu-leak
Feb 10, 2026
Merged

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching#34183
DarkLight1337 merged 9 commits intovllm-project:mainfrom
ywang96:fix-mm-cpu-leak

Conversation

@ywang96
Copy link
Copy Markdown
Member

@ywang96 ywang96 commented Feb 9, 2026

Purpose

FIXES #28726

#24964 #27896 Introduced a change so that each request cycle creates fewer GC-tracked objects and overall there are less frequent gen-2 collections.

While this is okay for text-only inference where each Request has very little data, for multimodal models this will result memory consumption growth without bound since Request each can contain mm_features that are sometimes up to two-digit MBs.

Test Plan

The original issue was reproducible with this test script test.py and confirmed fixed with this PR

Test Result

Main branch

  Branch:               main (4d3965096)
  Total RSS idle:       3.67 GB
  Total RSS after R1:   11.10 GB  (warmup)
  Total RSS after R5:   18.08 GB
  Total growth R2-5:    +6.98 GB (avg +1.74 GB/round)

  EngineCore RSS idle:  3.67 GB
  EngineCore after R1:  11.10 GB  (warmup)
  EngineCore after R5:  18.08 GB
  EngineCore growth:    +6.98 GB (avg +1.74 GB/round)

This PR

  Branch:               fix-mm-cpu-leak (1aed1a1ef)
  Total RSS idle:       3.67 GB
  Total RSS after R1:   9.47 GB  (warmup)
  Total RSS after R5:   10.25 GB
  Total growth R2-5:    +0.78 GB (avg +0.20 GB/round)

  EngineCore RSS idle:  3.67 GB
  EngineCore after R1:  9.47 GB  (warmup)
  EngineCore after R5:  10.25 GB
  EngineCore growth:    +0.78 GB (avg +0.20 GB/round)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a CPU memory leak in prefix caching by resolving a reference cycle within the Request class. The original implementation using functools.partial to create a block hasher inadvertently caused Request objects to be retained in memory longer than necessary. The fix, which involves storing the block hasher directly and applying it explicitly, is a clean and correct way to break this cycle. The related changes in the scheduler and tests are consistent with this fix. I have one suggestion to improve the robustness of the new recompute_block_hashes method to prevent potential future bugs.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 9, 2026

Hi @ywang96, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026
@lgeiger
Copy link
Copy Markdown
Contributor

lgeiger commented Feb 10, 2026

Thanks for the fix 🚀

I tested it again with the following command running the benchmark multiple times and report the memory usage after each run both with the default memory allocator and with jemalloc:

vllm serve Qwen/Qwen2.5-VL-3B-Instruct --limit-mm-per-prompt.video 0 --max-model-len 25000

vllm bench serve --backend openai-chat --model Qwen/Qwen2.5-VL-3B-Instruct --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Memory usage is now very stable:

version memory allocator idle after 1st after 2nd after 3rd after 4th after 5th after 6th after 7th after 8th after 9th after 10th
#34183 jemalloc 3.6 GB 8.2 GB 8.7 GB 8.9 GB 9.0 GB 8.9 GB 8.9 GB 9.4 GB 9.0 GB 9.2 GB 9.4 GB
#34183 default 4.0 GB 12.9 GB 14.4 GB 14.4 GB 14.2 GB
1d5922f jemalloc 3.8 GB 9.8 GB 15.1 GB 18.1 GB 19.8 GB 21.4 GB 23.5 GB 24.3 GB 9.2 GB 10.0 GB 12.2 GB
1d5922f default 3.9 GB 13.6 GB 21.6 GB 27.5 GB OOM

Even without the fix, the GC occasionally seems to be able to reclaim the memory (especially with jemalloc) but not very consistently. In any case this fixes the memory growth 🎉

Signed-off-by: Roger Wang <hey@rogerw.io>
@ywang96
Copy link
Copy Markdown
Member Author

ywang96 commented Feb 10, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical CPU memory leak caused by a reference cycle within the Request object, which is particularly impactful for multimodal models. The use of functools.partial was correctly identified as the source of the cycle. The fix, which involves replacing the partial with a dedicated method update_block_hashes that explicitly passes self, is a clean and standard approach to breaking such cycles. The changes are consistently applied across all relevant files, including tests. The provided performance metrics clearly validate the fix, showing a significant reduction in memory growth. The implementation is solid, and I have no further recommendations.

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reaganjlee @lgeiger for help looking into this as well!

@DarkLight1337 DarkLight1337 merged commit 8a5e0e2 into vllm-project:main Feb 10, 2026
43 checks passed
@khluu khluu added this to the v0.16.0 cherry picks milestone Feb 11, 2026
khluu pushed a commit that referenced this pull request Feb 11, 2026
…efix caching (#34183)

Signed-off-by: Roger Wang <hey@rogerw.io>
(cherry picked from commit 8a5e0e2)
qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026
qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…efix caching (vllm-project#34183)

Signed-off-by: Roger Wang <hey@rogerw.io>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…efix caching (vllm-project#34183)

Signed-off-by: Roger Wang <hey@rogerw.io>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
…efix caching (vllm-project#34183)

Signed-off-by: Roger Wang <hey@rogerw.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unbounded CPU Memory Growth When Using Prefix Caching

5 participants