[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs by wojciech-wais · Pull Request #36258 · vllm-project/vllm

wojciech-wais · 2026-03-06T15:38:48Z

Follow-up items from RFC #24885 (shutdown semantics):

Add /live liveness probe endpoint that returns 200 during graceful shutdown/drain (so Kubernetes does not restart the pod mid-drain) and only returns 503 on fatal engine error. The /health endpoint continues to serve as the readiness probe (503 during drain).
Add is_engine_dead property to EngineClient protocol to distinguish fatal errors from graceful shutdown state.
Exempt /live and /metrics from ScalingMiddleware so liveness probes and metrics scraping work during elastic EP scaling.
Add LLM.shutdown(timeout) for library/offline users to explicitly trigger clean shutdown instead of relying on garbage collection. Also adds del fallback.
Update docs/deployment/k8s.md with:
- Graceful shutdown section explaining --shutdown-timeout
- Probe endpoint behavior table (/health vs /live)
- Complete Deployment YAML with terminationGracePeriodSeconds
- Optional preStop hook pattern
- Fix existing NVIDIA GPU example to use /live for livenessProbe

Fixes #24885

Purpose

Address RFC #24885

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-03-06T15:39:25Z

Documentation preview: https://vllm--36258.org.readthedocs.build/en/36258/

gemini-code-assist

Code Review

This pull request introduces valuable features for graceful shutdown and Kubernetes integration, including a new /live liveness probe and an LLM.shutdown() method. The accompanying documentation updates are clear and comprehensive. The overall implementation is solid, but I've identified one critical issue in the EngineClient protocol where the default implementation of is_engine_dead contradicts its documentation. This could lead to incorrect liveness probe behavior in Kubernetes. My review includes a suggestion to fix this by making the property abstract.

gemini-code-assist · 2026-03-06T15:42:30Z

vllm/engine/protocol.py

+    def is_engine_dead(self) -> bool:
+        """Return True only when the engine has encountered a fatal error.
+        This is distinct from ``errored`` which also returns True during
+        graceful shutdown/drain."""
+        return self.errored


The implementation return self.errored contradicts the docstring, which states that is_engine_dead is distinct from errored. The errored property is expected to be true during graceful shutdown, while is_engine_dead should only be true on fatal errors. This default implementation will cause incorrect behavior for consumers of the EngineClient protocol if they don't override it, potentially leading Kubernetes to kill pods that are gracefully shutting down.

To enforce the correct behavior in subclasses, this property should be an abstract method.

Suggested change

def is_engine_dead(self) -> bool:

"""Return True only when the engine has encountered a fatal error.

This is distinct from ``errored`` which also returns True during

graceful shutdown/drain."""

return self.errored

@abstractmethod

def is_engine_dead(self) -> bool:

"""Return True only when the engine has encountered a fatal error.

This is distinct from ``errored`` which also returns True during

graceful shutdown/drain."""

...

markmc · 2026-03-06T16:11:04Z

Follow-up items from RFC #24885 (shutdown semantics):

Thank you, will try to review soon 👍

Add LLM.shutdown(timeout) for library/offline users to explicitly trigger clean shutdown instead of relying on garbage collection. Also adds del fallback.

Let's deal with LLM.shutdown() as a separate PR - it's quite orthogonal to the Kubernetes integration - and I'd also like to see more testing of this. Probably some overlap with #28953 also

wojciech-wais · 2026-03-06T21:47:52Z

Thanks for the feedback @markmc.
I have then prepared separate PR for the shutdown:

#36283

And updated this PR to remove from it the explicit shutdown call.

mergify · 2026-03-08T15:42:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-13T19:53:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

- Add /live endpoint that returns 200 during graceful drain (unlike /health which returns 503). This lets Kubernetes distinguish between "draining" and "dead" via separate liveness and readiness probes. - Add is_engine_dead property to EngineClient/AsyncLLM so the /live endpoint only fails on fatal engine errors, not graceful shutdown. - Exempt /live and /metrics from ScalingMiddleware 503 blocking. - Add comprehensive "Graceful Shutdown" section to k8s deployment docs with probe configuration examples and terminationGracePeriodSeconds. Part of RFC vllm-project#24885 Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

…ngMiddleware - TestLiveEndpoint: test /live returns 200 for healthy/draining engines, 503 for dead engines, 200 for render-only servers. - TestHealthDraining: test /health returns 503 when draining (skipping check_health), 200 when healthy, 200 for render-only servers. - TestScalingMiddlewareExemptions: test /live and /metrics are exempt from 503 during scaling, other paths are blocked. Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

mergify · 2026-03-20T09:20:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

wojciech-wais requested review from DarkLight1337 and njhill as code owners March 6, 2026 15:38

mergify bot added documentation Improvements or additions to documentation frontend v1 labels Mar 6, 2026

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

wojciech-wais force-pushed the feat/shutdown-followups branch from 5dd9a85 to 428a55c Compare March 6, 2026 21:47

mergify bot added the needs-rebase label Mar 8, 2026

wojciech-wais force-pushed the feat/shutdown-followups branch from 428a55c to c58c099 Compare March 11, 2026 06:28

mergify bot removed the needs-rebase label Mar 11, 2026

wojciech-wais requested review from NickLucche, aarnphm and robertgshaw2-redhat as code owners March 11, 2026 06:54

wojciech-wais mentioned this pull request Mar 11, 2026

[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish #36666

Merged

mergify bot added the needs-rebase label Mar 13, 2026

wojciech-wais added 2 commits March 18, 2026 21:28

wojciech-wais force-pushed the feat/shutdown-followups branch from a4c65e5 to e3b3804 Compare March 18, 2026 20:28

mergify bot removed the needs-rebase label Mar 18, 2026

mergify bot added the needs-rebase label Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258

[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258
wojciech-wais wants to merge 2 commits intovllm-project:mainfrom
wojciech-wais:feat/shutdown-followups

wojciech-wais commented Mar 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

markmc commented Mar 6, 2026

Uh oh!

wojciech-wais commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 8, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wojciech-wais commented Mar 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

markmc commented Mar 6, 2026

Uh oh!

wojciech-wais commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 8, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wojciech-wais commented Mar 6, 2026 •

edited by github-actions bot

Loading