[Core] Free KV cache GPU memory on engine shutdown#28953
[Core] Free KV cache GPU memory on engine shutdown#28953markmc wants to merge 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the GPU memory leak on engine shutdown, especially for the single-process case, by introducing an explicit cleanup path. The refactoring in tests/utils.py to extract check_gpu_memory_usage is a nice improvement for test clarity.
I have two main points of feedback regarding the robustness of the shutdown mechanism:
- The reliance on
__del__for cleanup inLLMEnginecan be unreliable in complex applications with reference cycles. - The
shutdownmethod inGPUWorkeris now less defensive, which could lead to errors during shutdown if a worker failed to initialize completely.
Details are in the line comments.
|
This pull request has merge conflicts that must be resolved before it can be |
244269a to
6a0c159
Compare
|
This is turning out to be a bit of a saga:
|
1753889 to
0f6b13a
Compare
To allow using check_gpu_memory_usage() at the start of a test. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Rather than failing with: ``` Failed: Timeout (>120.0s) from pytest-timeout. ``` fail with this instead: ``` ValueError: Memory of devices devices=[0] not free after dur_s=120.00 (threshold='2.0 GiB') ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Fixes the shutdown test in the single-process case. Start of test: ``` gpu memory used/total (GiB): 0=0.86/80.00; ``` end of test: ``` gpu memory used/total (GiB): 0=1.41/80.00 ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
In the TP=1 and disable_multiprocessing case, the test process gets polluted with CUDA initialized. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
0f6b13a to
0070d5a
Compare
Add a public shutdown() method to the LLM class so library users can explicitly release GPU memory and engine resources without waiting for garbage collection. llm = LLM(model="my-model") results = llm.generate(prompts) llm.shutdown() # free resources immediately The method delegates to engine_core.shutdown(timeout=...) and respects VllmConfig.shutdown_timeout when no explicit timeout is given. A __del__ fallback ensures resources are freed even when shutdown() is not called explicitly. Related to RFC vllm-project#24885, complements vllm-project#28953 (KV cache GPU memory cleanup on engine shutdown). Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Related to #24885
Addresses the
enable_multiprocessing=FalseTODO intests/v1/shutdown/test_delete.py::test_llm_deleteThe trickiest part is the single-process case ("inproc" engine and "uniproc" executor") where we can't rely on shutting down child processes to release GPU memory. To address that, we:
engine_core.shutdown()fromLLMEngine.__del__()GPUWorker.shutdown()Other changes include:
wait_for_gpu_memory_to_clear()timeout parameter to get a niceMemory of devices not freeerrorassert_mp_fork_context()added to avoidWe must use the 'spawn' multiprocessing start method....silently breaking theevil_forward()monkey patchTP=1enable_multiprocessing=Falsetests into a forked subprocess, to avoid initializing CUDA in the parent processtest_forward_error::test_llm_model_error_without_multiprocessingdisabled in theenable_multiprocessing=Falsecase - it is still broken