Skip to content

[Core] Free KV cache GPU memory on engine shutdown#28953

Open
markmc wants to merge 8 commits intovllm-project:mainfrom
markmc:llm-engine-shutdown
Open

[Core] Free KV cache GPU memory on engine shutdown#28953
markmc wants to merge 8 commits intovllm-project:mainfrom
markmc:llm-engine-shutdown

Conversation

@markmc
Copy link
Member

@markmc markmc commented Nov 18, 2025

Related to #24885

Addresses the enable_multiprocessing=False TODO in tests/v1/shutdown/test_delete.py::test_llm_delete

The trickiest part is the single-process case ("inproc" engine and "uniproc" executor") where we can't rely on shutting down child processes to release GPU memory. To address that, we:

  1. Call engine_core.shutdown() from LLMEngine.__del__()
  2. Free KV cache GPU memory in GPUWorker.shutdown()

Other changes include:

  1. Avoid pytest timeout when waiting for GPU cleanup - use the wait_for_gpu_memory_to_clear() timeout parameter to get a nice Memory of devices not free error
  2. Print memory usage at start of shutdown tests
  3. assert_mp_fork_context() added to avoid We must use the 'spawn' multiprocessing start method.... silently breaking the evil_forward() monkey patch
  4. Move the TP=1 enable_multiprocessing=False tests into a forked subprocess, to avoid initializing CUDA in the parent process
  5. Leave test_forward_error::test_llm_model_error_without_multiprocessing disabled in the enable_multiprocessing=False case - it is still broken

@mergify mergify bot added the v1 label Nov 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the GPU memory leak on engine shutdown, especially for the single-process case, by introducing an explicit cleanup path. The refactoring in tests/utils.py to extract check_gpu_memory_usage is a nice improvement for test clarity.

I have two main points of feedback regarding the robustness of the shutdown mechanism:

  1. The reliance on __del__ for cleanup in LLMEngine can be unreliable in complex applications with reference cycles.
  2. The shutdown method in GPUWorker is now less defensive, which could lead to errors during shutdown if a worker failed to initialize completely.

Details are in the line comments.

@markmc markmc requested a review from njhill November 18, 2025 19:36
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@mergify
Copy link

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 21, 2025
@markmc markmc force-pushed the llm-engine-shutdown branch 2 times, most recently from 244269a to 6a0c159 Compare November 21, 2025 11:15
@mergify mergify bot removed the needs-rebase label Nov 21, 2025
@markmc
Copy link
Member Author

markmc commented Nov 21, 2025

This is turning out to be a bit of a saga:

  1. The test_forward_error test was failing because the evil_forward monkey patch wasn't working, because of this: We must use the 'spawn' multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. - I've added assert_mp_fork_context() so this doesn't silently break the monkey patch in future
  2. The reason this was happening is that the TP=1 enable_multiprocessing=False test was initializing CUDA in the parent process - I moved the enable_multiprocessing=False into a forked subprocess
  3. This wasn't working for me locally because of [Test] Fix pytest termination with @create_new_process_for_each_test("fork") #29130
  4. test_forward_error::test_llm_model_error_without_multiprocessing still isn't working - something is holding onto references - for now, I've left it disabled (as it currently is)

@markmc markmc force-pushed the llm-engine-shutdown branch 2 times, most recently from 1753889 to 0f6b13a Compare November 24, 2025 11:53
@markmc markmc requested a review from WoosukKwon as a code owner November 24, 2025 11:53
To allow using check_gpu_memory_usage() at the start of a test.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Rather than failing with:

```
Failed: Timeout (>120.0s) from pytest-timeout.
```

fail with this instead:

```
ValueError: Memory of devices devices=[0] not free after dur_s=120.00 (threshold='2.0 GiB')
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Fixes the shutdown test in the single-process case.

Start of test:

```
gpu memory used/total (GiB): 0=0.86/80.00;
```

end of test:

```
gpu memory used/total (GiB): 0=1.41/80.00
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
In the TP=1 and disable_multiprocessing case, the test process
gets polluted with CUDA initialized.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the llm-engine-shutdown branch from 0f6b13a to 0070d5a Compare December 18, 2025 14:43
wojciech-wais added a commit to wojciech-wais/vllm that referenced this pull request Mar 6, 2026
Add a public shutdown() method to the LLM class so library users
can explicitly release GPU memory and engine resources without
waiting for garbage collection.

  llm = LLM(model="my-model")
  results = llm.generate(prompts)
  llm.shutdown()          # free resources immediately

The method delegates to engine_core.shutdown(timeout=...) and
respects VllmConfig.shutdown_timeout when no explicit timeout is
given.  A __del__ fallback ensures resources are freed even when
shutdown() is not called explicitly.

Related to RFC vllm-project#24885, complements vllm-project#28953 (KV cache GPU memory
cleanup on engine shutdown).

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed stale Over 90 days of inactivity v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant