Skip to content

[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258

Open
wojciech-wais wants to merge 2 commits intovllm-project:mainfrom
wojciech-wais:feat/shutdown-followups
Open

[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258
wojciech-wais wants to merge 2 commits intovllm-project:mainfrom
wojciech-wais:feat/shutdown-followups

Conversation

@wojciech-wais
Copy link

@wojciech-wais wojciech-wais commented Mar 6, 2026

Follow-up items from RFC #24885 (shutdown semantics):

  1. Add /live liveness probe endpoint that returns 200 during graceful shutdown/drain (so Kubernetes does not restart the pod mid-drain) and only returns 503 on fatal engine error. The /health endpoint continues to serve as the readiness probe (503 during drain).

  2. Add is_engine_dead property to EngineClient protocol to distinguish fatal errors from graceful shutdown state.

  3. Exempt /live and /metrics from ScalingMiddleware so liveness probes and metrics scraping work during elastic EP scaling.

  4. Add LLM.shutdown(timeout) for library/offline users to explicitly trigger clean shutdown instead of relying on garbage collection. Also adds del fallback.

  5. Update docs/deployment/k8s.md with:

    • Graceful shutdown section explaining --shutdown-timeout
    • Probe endpoint behavior table (/health vs /live)
    • Complete Deployment YAML with terminationGracePeriodSeconds
    • Optional preStop hook pattern
    • Fix existing NVIDIA GPU example to use /live for livenessProbe

Fixes #24885

Purpose

Address RFC #24885

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Mar 6, 2026

Documentation preview: https://vllm--36258.org.readthedocs.build/en/36258/

@mergify mergify bot added documentation Improvements or additions to documentation frontend v1 labels Mar 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable features for graceful shutdown and Kubernetes integration, including a new /live liveness probe and an LLM.shutdown() method. The accompanying documentation updates are clear and comprehensive. The overall implementation is solid, but I've identified one critical issue in the EngineClient protocol where the default implementation of is_engine_dead contradicts its documentation. This could lead to incorrect liveness probe behavior in Kubernetes. My review includes a suggestion to fix this by making the property abstract.

Comment on lines +124 to +128
def is_engine_dead(self) -> bool:
"""Return True only when the engine has encountered a fatal error.
This is distinct from ``errored`` which also returns True during
graceful shutdown/drain."""
return self.errored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The implementation return self.errored contradicts the docstring, which states that is_engine_dead is distinct from errored. The errored property is expected to be true during graceful shutdown, while is_engine_dead should only be true on fatal errors. This default implementation will cause incorrect behavior for consumers of the EngineClient protocol if they don't override it, potentially leading Kubernetes to kill pods that are gracefully shutting down.

To enforce the correct behavior in subclasses, this property should be an abstract method.

Suggested change
def is_engine_dead(self) -> bool:
"""Return True only when the engine has encountered a fatal error.
This is distinct from ``errored`` which also returns True during
graceful shutdown/drain."""
return self.errored
@abstractmethod
def is_engine_dead(self) -> bool:
"""Return True only when the engine has encountered a fatal error.
This is distinct from ``errored`` which also returns True during
graceful shutdown/drain."""
...

@markmc
Copy link
Member

markmc commented Mar 6, 2026

Follow-up items from RFC #24885 (shutdown semantics):

Thank you, will try to review soon 👍

  1. Add LLM.shutdown(timeout) for library/offline users to explicitly trigger clean shutdown instead of relying on garbage collection. Also adds del fallback.

Let's deal with LLM.shutdown() as a separate PR - it's quite orthogonal to the Kubernetes integration - and I'd also like to see more testing of this. Probably some overlap with #28953 also

@wojciech-wais wojciech-wais force-pushed the feat/shutdown-followups branch from 5dd9a85 to 428a55c Compare March 6, 2026 21:47
@wojciech-wais
Copy link
Author

Thanks for the feedback @markmc.
I have then prepared separate PR for the shutdown:

#36283

And updated this PR to remove from it the explicit shutdown call.

@mergify
Copy link

mergify bot commented Mar 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Mar 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 13, 2026
- Add /live endpoint that returns 200 during graceful drain (unlike
  /health which returns 503). This lets Kubernetes distinguish between
  "draining" and "dead" via separate liveness and readiness probes.

- Add is_engine_dead property to EngineClient/AsyncLLM so the /live
  endpoint only fails on fatal engine errors, not graceful shutdown.

- Exempt /live and /metrics from ScalingMiddleware 503 blocking.

- Add comprehensive "Graceful Shutdown" section to k8s deployment docs
  with probe configuration examples and terminationGracePeriodSeconds.

Part of RFC vllm-project#24885

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
…ngMiddleware

- TestLiveEndpoint: test /live returns 200 for healthy/draining engines,
  503 for dead engines, 200 for render-only servers.
- TestHealthDraining: test /health returns 503 when draining (skipping
  check_health), 200 when healthy, 200 for render-only servers.
- TestScalingMiddlewareExemptions: test /live and /metrics are exempt
  from 503 during scaling, other paths are blocked.

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
@wojciech-wais wojciech-wais force-pushed the feat/shutdown-followups branch from a4c65e5 to e3b3804 Compare March 18, 2026 20:28
@mergify mergify bot removed the needs-rebase label Mar 18, 2026
@mergify
Copy link

mergify bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wojciech-wais.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Clarifying vLLM Shutdown Semantics

2 participants