[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak#32789DarkLight1337 merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses a critical memory leak in the EncoderDecoderCacheManager for encoder-decoder models like Whisper. The original issue stemmed from encoder cache entries being prematurely marked for freeing in the same scheduling step they were allocated, leading to missed deallocation attempts and an ever-growing cache. The introduced double-buffering mechanism, utilizing self.allocated and self.to_free lists, correctly delays the freeing of cache entries by one scheduling step. This ensures that encoder outputs are consumed by the model before being eligible for deallocation, thereby preventing the memory leak. The addition of test_encoder_cache_cleanup is a valuable regression test that validates the fix.
| self.allocated: list[str] = [] | ||
| self.to_free: list[str] = [] |
There was a problem hiding this comment.
| # As encoder cache is not used for enc-dec models, we can free the entries here | ||
| # The actual free happens in the runner, *before* the model is executed. | ||
| # Therefore, `freeable` acts as a buffer to free the entries only after the | ||
| # model is executed, mimicking the state transition of `EncoderCacheManager`. | ||
| to_free = self.to_free | ||
| self.to_free = self.allocated | ||
| self.allocated = [] |
There was a problem hiding this comment.
This new logic for get_freed_mm_hashes is the core of the fix. By moving self.allocated to self.to_free and then returning the previous self.to_free list, the system ensures that cache entries are only freed after a full scheduling cycle. This effectively prevents the race condition that caused the memory leak by delaying the deallocation until the encoder outputs have been consumed by the model.
|
|
||
| mm_hash = request.mm_features[input_id].identifier | ||
| self.freed.append(mm_hash) | ||
| self.allocated.append(mm_hash) |
| @pytest.mark.core_model | ||
| @pytest.mark.parametrize("model", ["openai/whisper-large-v3-turbo"]) | ||
| def test_encoder_cache_cleanup( | ||
| vllm_runner, | ||
| model: str, | ||
| input_audios, | ||
| monkeypatch, | ||
| ) -> None: | ||
| """Test that encoder cache is properly cleaned up after requests complete. | ||
|
|
||
| This is a regression test for a bug where encoder cache entries were freed | ||
| in the same scheduling step they were allocated, before the model could use | ||
| them. | ||
| """ | ||
| # Set single-process mode to access the model runner's encoder cache directly | ||
| monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0") | ||
| check_model_available(model) | ||
|
|
||
| with vllm_runner( | ||
| model, | ||
| dtype="half", | ||
| max_model_len=448, | ||
| tensor_parallel_size=1, | ||
| limit_mm_per_prompt={"audio": 2}, | ||
| enforce_eager=True, | ||
| ) as vllm_model: | ||
| engine_core = vllm_model.llm.llm_engine.engine_core.engine_core | ||
| model_runner = engine_core.model_executor.driver_worker.worker.model_runner | ||
| encoder_cache = model_runner.encoder_cache | ||
|
|
||
| # Run multiple sequential requests to ensure cache is properly managed | ||
| for vllm_prompts, _, audios in input_audios: | ||
| vllm_model.generate_greedy(vllm_prompts, max_tokens=50, audios=audios) | ||
|
|
||
| # After all requests complete, encoder cache should be empty | ||
| cache_size = len(encoder_cache) | ||
| assert cache_size == 0, ( | ||
| f"Encoder cache should be empty after all requests complete, " | ||
| f"but has {cache_size} entries. This indicates encoder cache " | ||
| f"entries are not being properly freed." | ||
| ) |
There was a problem hiding this comment.
|
Maybe this could also solve the other memory leak issues |
|
@DarkLight1337 it looks related, I think the delay in the freeing flow could be the case, but this PR is really targeted at enc-dec since I only edited the |
DarkLight1337
left a comment
There was a problem hiding this comment.
Let's merge this to fix #31577 at least
Signed-off-by: NickLucche <nlucches@redhat.com> (cherry picked from commit ea6102b)
* Replace urllib's `urlparse` with urllib3's `parse_url` (vllm-project#32746) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> (cherry picked from commit 8ebf271) * Bump opencv-python dependecy version to 4.13 (vllm-project#32668) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> (cherry picked from commit 444e2e7) * Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789) Signed-off-by: NickLucche <nlucches@redhat.com> (cherry picked from commit ea6102b) --------- Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
- [Misc] Implement `TokenizerLike.convert_tokens_to_ids` (vllm-project/vllm#31796) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151) - [Bug] Revert torch warning fix (vllm-project/vllm#31585) [INFERENG-4152](https://issues.redhat.com/browse/INFERENG-4152) - [Bug] Fix AttributeError: `ColumnParallelLinear` object has no attribute `weight_scale_inv` (vllm-project/vllm#30823) [INFERENG-4153](https://issues.redhat.com/browse/INFERENG-4153) - Avoid `opencv-python-headless==4.13.0.90`, it's broken. See opencv/opencv-python#1183 - [Bugfix] Handle mistral tokenizer in get_hf_processor (vllm-project/vllm#31817) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151) - [Bugfix] Fix Whisper/encoder-decoder GPU memory leak vllm-project/vllm#32789 - [Model] Handle `trust_remote_code` for transformers backend (vllm-project/vllm#32194) (fixes GHSA-2pc9-4j83-qjmr) - [Bugfix] CUDA: fix segfault by bumping numba to `numba==0.63.1` ([AIPCC-9384](https://issues.redhat.com/browse/AIPCC-9384)) - [Bugfix] pin `mistral_common==1.8.5` to avoid crash with Voxtral ([INFERENG-4154](https://issues.redhat.com/browse/INFERENG-4154)) - [Bugfix] fix tokenizer loading for mistral models (vllm-project/vllm#33175) [INFERENG-4151](https://issues.redhat.com/browse/INFERENG-4151)
- [build] fix cu130 related release pipeline steps and publish as nightly image (vllm-project#32522) - [Misc] Replace urllib's `urlparse` with urllib3's `parse_url` (vllm-project#32746) - [Misc] Bump opencv-python dependency version to 4.13 (vllm-project#32668) - [Bugfix] Fix Whisper/encoder-decoder GPU memory leak (vllm-project#32789) - [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh (vllm-project#32971) - tokenizers: mistral: fix merge conflict - `Dockerfile.tpu.ubi`: add `git` to allow `pip install git+https`
Fix #31577.
Overview
For encoder-decoder models (e.g., Whisper), the encoder cache manager was returning newly allocated entries in
get_freed_mm_hashes()during the same scheduling step they were allocated.Since the runner frees cache entries before model execution, this caused encoder outputs free attempts to miss, leading to an ever-growing encoder cache.
Flow:
EncoderCacheManager.get_freed_mm_hashes->runner "free mm hashes"->runner._execute_mm_encoderThis PR effectively delays freeing by one scheduling step with a basic double buffer.
Entries allocated in step N are only returned for freeing in step N+1, ensuring the model has executed and used the encoder output before it's removed from cache.
Test with
Newly added test:
cc @DarkLight1337