[Bugfix] Fix Whisper/encoder-decoder GPU memory leak by NickLucche · Pull Request #32789 · vllm-project/vllm

NickLucche · 2026-01-21T17:07:38Z

Overview

For encoder-decoder models (e.g., Whisper), the encoder cache manager was returning newly allocated entries in get_freed_mm_hashes() during the same scheduling step they were allocated.
Since the runner frees cache entries before model execution, this caused encoder outputs free attempts to miss, leading to an ever-growing encoder cache.
Flow: EncoderCacheManager.get_freed_mm_hashes->runner "free mm hashes"->runner._execute_mm_encoder

This PR effectively delays freeing by one scheduling step with a basic double buffer.
Entries allocated in step N are only returned for freeing in step N+1, ensuring the model has executed and used the encoder output before it's removed from cache.

Test with

Newly added test:

pytest -s -v -x tests/models/multimodal/generation/test_whisper.py::test_encoder_cache_cleanup

cc @DarkLight1337

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist

Code Review

The pull request effectively addresses a critical memory leak in the EncoderDecoderCacheManager for encoder-decoder models like Whisper. The original issue stemmed from encoder cache entries being prematurely marked for freeing in the same scheduling step they were allocated, leading to missed deallocation attempts and an ever-growing cache. The introduced double-buffering mechanism, utilizing self.allocated and self.to_free lists, correctly delays the freeing of cache entries by one scheduling step. This ensures that encoder outputs are consumed by the model before being eligible for deallocation, thereby preventing the memory leak. The addition of test_encoder_cache_cleanup is a valuable regression test that validates the fix.

gemini-code-assist · 2026-01-21T17:09:26Z

vllm/v1/core/encoder_cache_manager.py

+        self.allocated: list[str] = []
+        self.to_free: list[str] = []


The change from a single freed list to allocated and to_free lists is crucial for implementing the double-buffering mechanism. This ensures that entries are not prematurely marked for freeing before the model has a chance to use them, directly addressing the memory leak.

gemini-code-assist · 2026-01-21T17:09:26Z

vllm/v1/core/encoder_cache_manager.py

+        # As encoder cache is not used for enc-dec models, we can free the entries here
+        # The actual free happens in the runner, *before* the model is executed.
+        # Therefore, `freeable` acts as a buffer to free the entries only after the
+        # model is executed, mimicking the state transition of `EncoderCacheManager`.
+        to_free = self.to_free
+        self.to_free = self.allocated
+        self.allocated = []


This new logic for get_freed_mm_hashes is the core of the fix. By moving self.allocated to self.to_free and then returning the previous self.to_free list, the system ensures that cache entries are only freed after a full scheduling cycle. This effectively prevents the race condition that caused the memory leak by delaying the deallocation until the encoder outputs have been consumed by the model.

gemini-code-assist · 2026-01-21T17:09:26Z

vllm/v1/core/encoder_cache_manager.py


        mm_hash = request.mm_features[input_id].identifier
-        self.freed.append(mm_hash)
+        self.allocated.append(mm_hash)


Updating the allocate method to append to self.allocated correctly marks newly allocated entries for processing in the next scheduling step, aligning with the double-buffering strategy to prevent premature freeing.

gemini-code-assist · 2026-01-21T17:09:27Z

tests/models/multimodal/generation/test_whisper.py

+@pytest.mark.core_model
+@pytest.mark.parametrize("model", ["openai/whisper-large-v3-turbo"])
+def test_encoder_cache_cleanup(
+    vllm_runner,
+    model: str,
+    input_audios,
+    monkeypatch,
+) -> None:
+    """Test that encoder cache is properly cleaned up after requests complete.
+
+    This is a regression test for a bug where encoder cache entries were freed
+    in the same scheduling step they were allocated, before the model could use
+    them.
+    """
+    # Set single-process mode to access the model runner's encoder cache directly
+    monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
+    check_model_available(model)
+
+    with vllm_runner(
+        model,
+        dtype="half",
+        max_model_len=448,
+        tensor_parallel_size=1,
+        limit_mm_per_prompt={"audio": 2},
+        enforce_eager=True,
+    ) as vllm_model:
+        engine_core = vllm_model.llm.llm_engine.engine_core.engine_core
+        model_runner = engine_core.model_executor.driver_worker.worker.model_runner
+        encoder_cache = model_runner.encoder_cache
+
+        # Run multiple sequential requests to ensure cache is properly managed
+        for vllm_prompts, _, audios in input_audios:
+            vllm_model.generate_greedy(vllm_prompts, max_tokens=50, audios=audios)
+
+        # After all requests complete, encoder cache should be empty
+        cache_size = len(encoder_cache)
+        assert cache_size == 0, (
+            f"Encoder cache should be empty after all requests complete, "
+            f"but has {cache_size} entries. This indicates encoder cache "
+            f"entries are not being properly freed."
+        )


The addition of test_encoder_cache_cleanup is an excellent and necessary regression test. It directly targets the memory leak issue by verifying that the encoder cache is empty after multiple sequential requests, providing strong assurance that the fix is effective and robust.

DarkLight1337

cc @ywang96

DarkLight1337 · 2026-01-21T17:15:26Z

Maybe this could also solve the other memory leak issues

NickLucche · 2026-01-21T18:36:11Z

@DarkLight1337 it looks related, I think the delay in the freeing flow could be the case, but this PR is really targeted at enc-dec since I only edited the EncoderDecoderCacheManager

DarkLight1337

Let's merge this to fix #31577 at least

Signed-off-by: NickLucche <nlucches@redhat.com> (cherry picked from commit ea6102b)

) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: 陈建华 <1647430658@qq.com>

) Signed-off-by: NickLucche <nlucches@redhat.com>

* Replace urllib's `urlparse` with urllib3's `parse_url` (vllm-project#32746) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> (cherry picked from commit 8ebf271) * Bump opencv-python dependecy version to 4.13 (vllm-project#32668) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> (cherry picked from commit 444e2e7) * Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789) Signed-off-by: NickLucche <nlucches@redhat.com> (cherry picked from commit ea6102b) --------- Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

…lm#32789)

- [Misc] Implement `TokenizerLike.convert_tokens_to_ids` (vllm-project/vllm#31796) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151) - [Bug] Revert torch warning fix (vllm-project/vllm#31585) [INFERENG-4152](https://issues.redhat.com/browse/INFERENG-4152) - [Bug] Fix AttributeError: `ColumnParallelLinear` object has no attribute `weight_scale_inv` (vllm-project/vllm#30823) [INFERENG-4153](https://issues.redhat.com/browse/INFERENG-4153) - Avoid `opencv-python-headless==4.13.0.90`, it's broken. See opencv/opencv-python#1183 - [Bugfix] Handle mistral tokenizer in get_hf_processor (vllm-project/vllm#31817) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151) - [Bugfix] Fix Whisper/encoder-decoder GPU memory leak vllm-project/vllm#32789 - [Model] Handle `trust_remote_code` for transformers backend (vllm-project/vllm#32194) (fixes GHSA-2pc9-4j83-qjmr) - [Bugfix] CUDA: fix segfault by bumping numba to `numba==0.63.1` ([AIPCC-9384](https://issues.redhat.com/browse/AIPCC-9384)) - [Bugfix] pin `mistral_common==1.8.5` to avoid crash with Voxtral ([INFERENG-4154](https://issues.redhat.com/browse/INFERENG-4154)) - [Bugfix] fix tokenizer loading for mistral models (vllm-project/vllm#33175) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151)

) Signed-off-by: NickLucche <nlucches@redhat.com>

- [build] fix cu130 related release pipeline steps and publish as nightly image (vllm-project#32522) - [Misc] Replace urllib's `urlparse` with urllib3's `parse_url` (vllm-project#32746) - [Misc] Bump opencv-python dependency version to 4.13 (vllm-project#32668) - [Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789) - [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh (vllm-project#32971) - tokenizers: mistral: fix merge conflict - `Dockerfile.tpu.ubi`: add `git` to allow `pip install git+https`

NickLucche added 2 commits January 21, 2026 10:04

init

5a4633b

Signed-off-by: NickLucche <nlucches@redhat.com>

add test

0c159cf

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche requested review from ApostaC, DarkLight1337, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners January 21, 2026 17:07

mergify bot added multi-modality Related to multi-modality (#4194) v1 bug Something isn't working labels Jan 21, 2026

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

DarkLight1337 reviewed Jan 21, 2026

View reviewed changes

DarkLight1337 mentioned this pull request Jan 21, 2026

[Bug]: Unbounded CPU Memory Growth When Using Prefix Caching #28726

Closed

1 task

NickLucche changed the title ~~[Bugfix] Fix Whisper/encoder-decoder memory leak~~ [Bugfix] Fix Whisper/encoder-decoder GPU memory leak Jan 21, 2026

DarkLight1337 approved these changes Jan 22, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) January 22, 2026 08:36

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 22, 2026

DarkLight1337 merged commit ea6102b into vllm-project:main Jan 22, 2026
52 of 55 checks passed

khluu pushed a commit that referenced this pull request Jan 23, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (#32789)

4dc11b0

Signed-off-by: NickLucche <nlucches@redhat.com> (cherry picked from commit ea6102b)

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789

ee48c4d

) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

jamesbraza mentioned this pull request Jan 25, 2026

Crash with CUDACachingAllocator.cpp when running Nvidia nemotron-parse at load pytorch/pytorch#173104

Open

cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789

d6518d7

) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: 陈建华 <1647430658@qq.com>

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789

3fc2fac

) Signed-off-by: NickLucche <nlucches@redhat.com>

npanpaliya pushed a commit to odh-on-pz/vllm-cpu that referenced this pull request Feb 16, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project/vl…

ddbc7fa

…lm#32789)

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789

c190e84

) Signed-off-by: NickLucche <nlucches@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789
DarkLight1337 merged 2 commits intovllm-project:mainfrom
NickLucche:fix-whisper-cache-leak

NickLucche commented Jan 21, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

gemini-code-assist bot Jan 21, 2026

Uh oh!

DarkLight1337 left a comment

Uh oh!

DarkLight1337 commented Jan 21, 2026

Uh oh!

NickLucche commented Jan 21, 2026

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

NickLucche commented Jan 21, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Test with

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jan 21, 2026

Uh oh!

NickLucche commented Jan 21, 2026

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NickLucche commented Jan 21, 2026 •

edited by github-actions bot

Loading