Skip to content

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789

Merged
DarkLight1337 merged 2 commits intovllm-project:mainfrom
NickLucche:fix-whisper-cache-leak
Jan 22, 2026
Merged

[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789
DarkLight1337 merged 2 commits intovllm-project:mainfrom
NickLucche:fix-whisper-cache-leak

Conversation

@NickLucche
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche commented Jan 21, 2026

Fix #31577.

Overview

For encoder-decoder models (e.g., Whisper), the encoder cache manager was returning newly allocated entries in get_freed_mm_hashes() during the same scheduling step they were allocated.
Since the runner frees cache entries before model execution, this caused encoder outputs free attempts to miss, leading to an ever-growing encoder cache.
Flow: EncoderCacheManager.get_freed_mm_hashes->runner "free mm hashes"->runner._execute_mm_encoder

This PR effectively delays freeing by one scheduling step with a basic double buffer.
Entries allocated in step N are only returned for freeing in step N+1, ensuring the model has executed and used the encoder output before it's removed from cache.

Test with

Newly added test:

pytest -s -v -x tests/models/multimodal/generation/test_whisper.py::test_encoder_cache_cleanup

cc @DarkLight1337

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@mergify mergify bot added multi-modality Related to multi-modality (#4194) v1 bug Something isn't working labels Jan 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a critical memory leak in the EncoderDecoderCacheManager for encoder-decoder models like Whisper. The original issue stemmed from encoder cache entries being prematurely marked for freeing in the same scheduling step they were allocated, leading to missed deallocation attempts and an ever-growing cache. The introduced double-buffering mechanism, utilizing self.allocated and self.to_free lists, correctly delays the freeing of cache entries by one scheduling step. This ensures that encoder outputs are consumed by the model before being eligible for deallocation, thereby preventing the memory leak. The addition of test_encoder_cache_cleanup is a valuable regression test that validates the fix.

Comment on lines +360 to +361
self.allocated: list[str] = []
self.to_free: list[str] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The change from a single freed list to allocated and to_free lists is crucial for implementing the double-buffering mechanism. This ensures that entries are not prematurely marked for freeing before the model has a chance to use them, directly addressing the memory leak.

Comment on lines +397 to +403
# As encoder cache is not used for enc-dec models, we can free the entries here
# The actual free happens in the runner, *before* the model is executed.
# Therefore, `freeable` acts as a buffer to free the entries only after the
# model is executed, mimicking the state transition of `EncoderCacheManager`.
to_free = self.to_free
self.to_free = self.allocated
self.allocated = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This new logic for get_freed_mm_hashes is the core of the fix. By moving self.allocated to self.to_free and then returning the previous self.to_free list, the system ensures that cache entries are only freed after a full scheduling cycle. This effectively prevents the race condition that caused the memory leak by delaying the deallocation until the encoder outputs have been consumed by the model.


mm_hash = request.mm_features[input_id].identifier
self.freed.append(mm_hash)
self.allocated.append(mm_hash)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Updating the allocate method to append to self.allocated correctly marks newly allocated entries for processing in the next scheduling step, aligning with the double-buffering strategy to prevent premature freeing.

Comment on lines +181 to +221
@pytest.mark.core_model
@pytest.mark.parametrize("model", ["openai/whisper-large-v3-turbo"])
def test_encoder_cache_cleanup(
vllm_runner,
model: str,
input_audios,
monkeypatch,
) -> None:
"""Test that encoder cache is properly cleaned up after requests complete.

This is a regression test for a bug where encoder cache entries were freed
in the same scheduling step they were allocated, before the model could use
them.
"""
# Set single-process mode to access the model runner's encoder cache directly
monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
check_model_available(model)

with vllm_runner(
model,
dtype="half",
max_model_len=448,
tensor_parallel_size=1,
limit_mm_per_prompt={"audio": 2},
enforce_eager=True,
) as vllm_model:
engine_core = vllm_model.llm.llm_engine.engine_core.engine_core
model_runner = engine_core.model_executor.driver_worker.worker.model_runner
encoder_cache = model_runner.encoder_cache

# Run multiple sequential requests to ensure cache is properly managed
for vllm_prompts, _, audios in input_audios:
vllm_model.generate_greedy(vllm_prompts, max_tokens=50, audios=audios)

# After all requests complete, encoder cache should be empty
cache_size = len(encoder_cache)
assert cache_size == 0, (
f"Encoder cache should be empty after all requests complete, "
f"but has {cache_size} entries. This indicates encoder cache "
f"entries are not being properly freed."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addition of test_encoder_cache_cleanup is an excellent and necessary regression test. It directly targets the memory leak issue by verifying that the encoder cache is empty after multiple sequential requests, providing strong assurance that the fix is effective and robust.

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337
Copy link
Copy Markdown
Member

Maybe this could also solve the other memory leak issues

@NickLucche NickLucche changed the title [Bugfix] Fix Whisper/encoder-decoder memory leak [Bugfix] Fix Whisper/encoder-decoder GPU memory leak Jan 21, 2026
@NickLucche
Copy link
Copy Markdown
Collaborator Author

@DarkLight1337 it looks related, I think the delay in the freeing flow could be the case, but this PR is really targeted at enc-dec since I only edited the EncoderDecoderCacheManager

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this to fix #31577 at least

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) January 22, 2026 08:36
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 22, 2026
@DarkLight1337 DarkLight1337 merged commit ea6102b into vllm-project:main Jan 22, 2026
52 of 55 checks passed
khluu pushed a commit that referenced this pull request Jan 23, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
(cherry picked from commit ea6102b)
monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026
)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: 陈建华 <1647430658@qq.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
)

Signed-off-by: NickLucche <nlucches@redhat.com>
yma11 pushed a commit to yma11/vllm that referenced this pull request Feb 3, 2026
* Replace urllib's `urlparse` with urllib3's `parse_url` (vllm-project#32746)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
(cherry picked from commit 8ebf271)

* Bump opencv-python dependecy version to 4.13 (vllm-project#32668)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
(cherry picked from commit 444e2e7)

* Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789)

Signed-off-by: NickLucche <nlucches@redhat.com>
(cherry picked from commit ea6102b)

---------

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
npanpaliya pushed a commit to odh-on-pz/vllm-cpu that referenced this pull request Feb 16, 2026
npanpaliya pushed a commit to odh-on-pz/vllm-cpu that referenced this pull request Feb 16, 2026
- [Misc] Implement `TokenizerLike.convert_tokens_to_ids`
(vllm-project/vllm#31796)
  [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151)
- [Bug] Revert torch warning fix (vllm-project/vllm#31585)
  [INFERENG-4152](https://issues.redhat.com/browse/INFERENG-4152)
- [Bug] Fix AttributeError: `ColumnParallelLinear` object has no
attribute `weight_scale_inv` (vllm-project/vllm#30823)
  [INFERENG-4153](https://issues.redhat.com/browse/INFERENG-4153)
- Avoid `opencv-python-headless==4.13.0.90`, it's broken. See
opencv/opencv-python#1183
- [Bugfix] Handle mistral tokenizer in get_hf_processor
(vllm-project/vllm#31817)
  [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151)
- [Bugfix] Fix Whisper/encoder-decoder GPU memory leak
vllm-project/vllm#32789
- [Model] Handle `trust_remote_code` for transformers backend
(vllm-project/vllm#32194) (fixes
GHSA-2pc9-4j83-qjmr)
- [Bugfix] CUDA: fix segfault by bumping numba to `numba==0.63.1`
([AIPCC-9384](https://issues.redhat.com/browse/AIPCC-9384))
- [Bugfix] pin `mistral_common==1.8.5` to avoid crash with Voxtral
([INFERENG-4154](https://issues.redhat.com/browse/INFERENG-4154))
- [Bugfix] fix tokenizer loading for mistral models
(vllm-project/vllm#33175)
  [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151)
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
)

Signed-off-by: NickLucche <nlucches@redhat.com>
dtrifiro added a commit to dtrifiro/vllm that referenced this pull request Mar 9, 2026
- [build] fix cu130 related release pipeline steps and publish as
nightly image (vllm-project#32522)
- [Misc] Replace urllib's `urlparse` with urllib3's `parse_url`
(vllm-project#32746)
- [Misc] Bump opencv-python dependency version to 4.13
(vllm-project#32668)
- [Bugfix] Fix Whisper/encoder-decoder GPU memory leak
(vllm-project#32789)
- [CI] fix version comparsion and exclusion patterns in
upload-release-wheels.sh (vllm-project#32971)
- tokenizers: mistral: fix merge conflict
- `Dockerfile.tpu.ubi`: add `git` to allow `pip install git+https`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Memory leak in serving Whisper

2 participants