[Core] Free KV cache GPU memory on engine shutdown by markmc · Pull Request #28953 · vllm-project/vllm

markmc · 2025-11-18T17:28:23Z

Related to #24885

Addresses the enable_multiprocessing=False TODO in tests/v1/shutdown/test_delete.py::test_llm_delete

The trickiest part is the single-process case ("inproc" engine and "uniproc" executor") where we can't rely on shutting down child processes to release GPU memory. To address that, we:

Call engine_core.shutdown() from LLMEngine.__del__()
Free KV cache GPU memory in GPUWorker.shutdown()

Other changes include:

Avoid pytest timeout when waiting for GPU cleanup - use the wait_for_gpu_memory_to_clear() timeout parameter to get a nice Memory of devices not free error
Print memory usage at start of shutdown tests
assert_mp_fork_context() added to avoid We must use the 'spawn' multiprocessing start method.... silently breaking the evil_forward() monkey patch
Move the TP=1 enable_multiprocessing=False tests into a forked subprocess, to avoid initializing CUDA in the parent process
Leave test_forward_error::test_llm_model_error_without_multiprocessing disabled in the enable_multiprocessing=False case - it is still broken

gemini-code-assist

Code Review

This pull request effectively addresses the GPU memory leak on engine shutdown, especially for the single-process case, by introducing an explicit cleanup path. The refactoring in tests/utils.py to extract check_gpu_memory_usage is a nice improvement for test clarity.

I have two main points of feedback regarding the robustness of the shutdown mechanism:

The reliance on __del__ for cleanup in LLMEngine can be unreliable in complex applications with reference cycles.
The shutdown method in GPUWorker is now less defensive, which could lead to errors during shutdown if a worker failed to initialize completely.

Details are in the line comments.

vllm/v1/engine/llm_engine.py

vllm/v1/worker/gpu_worker.py

mergify · 2025-11-21T01:42:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-11-21T11:21:44Z

This is turning out to be a bit of a saga:

The test_forward_error test was failing because the evil_forward monkey patch wasn't working, because of this: We must use the 'spawn' multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. - I've added assert_mp_fork_context() so this doesn't silently break the monkey patch in future
The reason this was happening is that the TP=1 enable_multiprocessing=False test was initializing CUDA in the parent process - I moved the enable_multiprocessing=False into a forked subprocess
This wasn't working for me locally because of [Test] Fix pytest termination with @create_new_process_for_each_test("fork") #29130
test_forward_error::test_llm_model_error_without_multiprocessing still isn't working - something is holding onto references - for now, I've left it disabled (as it currently is)

To allow using check_gpu_memory_usage() at the start of a test. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Rather than failing with: ``` Failed: Timeout (>120.0s) from pytest-timeout. ``` fail with this instead: ``` ValueError: Memory of devices devices=[0] not free after dur_s=120.00 (threshold='2.0 GiB') ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Fixes the shutdown test in the single-process case. Start of test: ``` gpu memory used/total (GiB): 0=0.86/80.00; ``` end of test: ``` gpu memory used/total (GiB): 0=1.41/80.00 ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

In the TP=1 and disable_multiprocessing case, the test process gets polluted with CUDA initialized. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Add a public shutdown() method to the LLM class so library users can explicitly release GPU memory and engine resources without waiting for garbage collection. llm = LLM(model="my-model") results = llm.generate(prompts) llm.shutdown() # free resources immediately The method delegates to engine_core.shutdown(timeout=...) and respects VllmConfig.shutdown_timeout when no explicit timeout is given. A __del__ fallback ensures resources are freed even when shutdown() is not called explicitly. Related to RFC vllm-project#24885, complements vllm-project#28953 (KV cache GPU memory cleanup on engine shutdown). Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

github-actions · 2026-03-19T03:07:20Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify bot added the v1 label Nov 18, 2025

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

vllm/v1/engine/llm_engine.py Show resolved Hide resolved

vllm/v1/worker/gpu_worker.py Show resolved Hide resolved

markmc requested a review from njhill November 18, 2025 19:36

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

mergify bot added the needs-rebase label Nov 21, 2025

markmc force-pushed the llm-engine-shutdown branch 2 times, most recently from 244269a to 6a0c159 Compare November 21, 2025 11:15

mergify bot removed the needs-rebase label Nov 21, 2025

markmc force-pushed the llm-engine-shutdown branch 2 times, most recently from 1753889 to 0f6b13a Compare November 24, 2025 11:53

markmc requested a review from WoosukKwon as a code owner November 24, 2025 11:53

markmc added 8 commits December 18, 2025 09:33

[Core] Refactor wait_for_gpu_memory_to_clear() test util

1e42112

To allow using check_gpu_memory_usage() at the start of a test. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Print memory usage at start of shutdown tests

2480e7b

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Free KV cache GPU memory on engine shutdown

7f2cb59

Fixes the shutdown test in the single-process case. Start of test: ``` gpu memory used/total (GiB): 0=0.86/80.00; ``` end of test: ``` gpu memory used/total (GiB): 0=1.41/80.00 ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Skip enable_multiprocessing=False cases, check MP method

4f595c2

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Run disable_multiprocessing delete test in subprocess

b0e9dc2

In the TP=1 and disable_multiprocessing case, the test process gets polluted with CUDA initialized. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Enable startup error tests without multiprocessing

6cf4b3f

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Add free_kv_cache() to V2 model runner

0070d5a

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the llm-engine-shutdown branch from 0f6b13a to 0070d5a Compare December 18, 2025 14:43

markmc mentioned this pull request Mar 6, 2026

[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs #36258

Open

5 tasks

wojciech-wais mentioned this pull request Mar 6, 2026

[Feature] Add LLM.shutdown() for explicit engine teardown #36283

Open

5 tasks

github-actions bot added the stale Over 90 days of inactivity label Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Free KV cache GPU memory on engine shutdown#28953

[Core] Free KV cache GPU memory on engine shutdown#28953
markmc wants to merge 8 commits intovllm-project:mainfrom
markmc:llm-engine-shutdown

markmc commented Nov 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

markmc commented Nov 21, 2025

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

markmc commented Nov 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

markmc commented Nov 21, 2025

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markmc commented Nov 18, 2025 •

edited by github-actions bot

Loading