[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching by ywang96 · Pull Request #34183 · vllm-project/vllm

ywang96 · 2026-02-09T23:25:48Z

Purpose

#24964 #27896 Introduced a change so that each request cycle creates fewer GC-tracked objects and overall there are less frequent gen-2 collections.

While this is okay for text-only inference where each Request has very little data, for multimodal models this will result memory consumption growth without bound since Request each can contain mm_features that are sometimes up to two-digit MBs.

Test Plan

The original issue was reproducible with this test script test.py and confirmed fixed with this PR

Test Result

Main branch

  Branch:               main (4d3965096)
  Total RSS idle:       3.67 GB
  Total RSS after R1:   11.10 GB  (warmup)
  Total RSS after R5:   18.08 GB
  Total growth R2-5:    +6.98 GB (avg +1.74 GB/round)

  EngineCore RSS idle:  3.67 GB
  EngineCore after R1:  11.10 GB  (warmup)
  EngineCore after R5:  18.08 GB
  EngineCore growth:    +6.98 GB (avg +1.74 GB/round)

This PR

  Branch:               fix-mm-cpu-leak (1aed1a1ef)
  Total RSS idle:       3.67 GB
  Total RSS after R1:   9.47 GB  (warmup)
  Total RSS after R5:   10.25 GB
  Total growth R2-5:    +0.78 GB (avg +0.20 GB/round)

  EngineCore RSS idle:  3.67 GB
  EngineCore after R1:  9.47 GB  (warmup)
  EngineCore after R5:  10.25 GB
  EngineCore growth:    +0.78 GB (avg +0.20 GB/round)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.io>

gemini-code-assist

Code Review

This pull request effectively addresses a CPU memory leak in prefix caching by resolving a reference cycle within the Request class. The original implementation using functools.partial to create a block hasher inadvertently caused Request objects to be retained in memory longer than necessary. The fix, which involves storing the block hasher directly and applying it explicitly, is a clean and correct way to break this cycle. The related changes in the scheduler and tests are consistent with this fix. I have one suggestion to improve the robustness of the new recompute_block_hashes method to prevent potential future bugs.

vllm/v1/request.py

mergify · 2026-02-09T23:30:41Z

Hi @ywang96, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roger Wang <hey@rogerw.io>

lgeiger · 2026-02-10T01:29:59Z

Thanks for the fix 🚀

I tested it again with the following command running the benchmark multiple times and report the memory usage after each run both with the default memory allocator and with jemalloc:

vllm serve Qwen/Qwen2.5-VL-3B-Instruct --limit-mm-per-prompt.video 0 --max-model-len 25000

vllm bench serve --backend openai-chat --model Qwen/Qwen2.5-VL-3B-Instruct --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Memory usage is now very stable:

version	memory allocator	idle	after 1st	after 2nd	after 3rd	after 4th	after 5th	after 6th	after 7th	after 8th	after 9th	after 10th
#34183	jemalloc	3.6 GB	8.2 GB	8.7 GB	8.9 GB	9.0 GB	8.9 GB	8.9 GB	9.4 GB	9.0 GB	9.2 GB	9.4 GB
#34183	default	4.0 GB	12.9 GB	14.4 GB	14.4 GB	14.2 GB
`1d5922f`	jemalloc	3.8 GB	9.8 GB	15.1 GB	18.1 GB	19.8 GB	21.4 GB	23.5 GB	24.3 GB	9.2 GB	10.0 GB	12.2 GB
`1d5922f`	default	3.9 GB	13.6 GB	21.6 GB	27.5 GB	OOM

Even without the fix, the GC occasionally seems to be able to reclaim the memory (especially with jemalloc) but not very consistently. In any case this fixes the memory growth 🎉

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 · 2026-02-10T02:50:55Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively resolves a critical CPU memory leak caused by a reference cycle within the Request object, which is particularly impactful for multimodal models. The use of functools.partial was correctly identified as the source of the cycle. The fix, which involves replacing the partial with a dedicated method update_block_hashes that explicitly passes self, is a clean and standard approach to breaking such cycles. The changes are consistently applied across all relevant files, including tests. The provided performance metrics clearly validate the fix, showing a significant reduction in memory growth. The implementation is solid, and I have no further recommendations.

DarkLight1337

Thanks @reaganjlee @lgeiger for help looking into this as well!

…efix caching (#34183) Signed-off-by: Roger Wang <hey@rogerw.io> (cherry picked from commit 8a5e0e2)

…efix caching vllm-project/vllm#34183

…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 added 5 commits February 9, 2026 18:02

fix

65d265f

Signed-off-by: Roger Wang <hey@rogerw.io>

add

bf23061

Signed-off-by: Roger Wang <hey@rogerw.io>

delete

502c29b

Signed-off-by: Roger Wang <hey@rogerw.io>

add analysis

3873b02

Signed-off-by: Roger Wang <hey@rogerw.io>

encapsulate

1aed1a1

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery and robertgshaw2-redhat as code owners February 9, 2026 23:25

mergify bot added v1 bug Something isn't working labels Feb 9, 2026

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

vllm/v1/request.py Outdated Show resolved Hide resolved

ywang96 mentioned this pull request Feb 9, 2026

[Bug]: Unbounded CPU Memory Growth When Using Prefix Caching #28726

Closed

1 task

ywang96 added 2 commits February 9, 2026 23:36

pre-commit

5fcef15

Signed-off-by: Roger Wang <hey@rogerw.io>

Merge branch 'main' into fix-mm-cpu-leak

410fdfa

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026

ywang96 added 2 commits February 10, 2026 02:50

rename

abcbecc

Signed-off-by: Roger Wang <hey@rogerw.io>

Merge branch 'main' into fix-mm-cpu-leak

6926a9a

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

DarkLight1337 approved these changes Feb 10, 2026

View reviewed changes

DarkLight1337 merged commit 8a5e0e2 into vllm-project:main Feb 10, 2026
43 checks passed

khluu added this to the v0.16.0 cherry picks milestone Feb 11, 2026

khluu pushed a commit that referenced this pull request Feb 11, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

b2e1fc3

…efix caching (#34183) Signed-off-by: Roger Wang <hey@rogerw.io> (cherry picked from commit 8a5e0e2)

qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

40c60c4

…efix caching vllm-project/vllm#34183

qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

b9fcb29

…efix caching vllm-project/vllm#34183

DarkLight1337 mentioned this pull request Feb 25, 2026

[Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests #35191

Closed

1 task

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

85943c2

…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

71191f9

…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in pr…

7f035e2

…efix caching (vllm-project#34183) Signed-off-by: Roger Wang <hey@rogerw.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching#34183

[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching#34183
DarkLight1337 merged 9 commits intovllm-project:mainfrom
ywang96:fix-mm-cpu-leak

ywang96 commented Feb 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

lgeiger commented Feb 10, 2026

Uh oh!

ywang96 commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

ywang96 commented Feb 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

lgeiger commented Feb 10, 2026

Uh oh!

ywang96 commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ywang96 commented Feb 9, 2026 •

edited by github-actions bot

Loading