[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258
[Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs#36258wojciech-wais wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--36258.org.readthedocs.build/en/36258/ |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable features for graceful shutdown and Kubernetes integration, including a new /live liveness probe and an LLM.shutdown() method. The accompanying documentation updates are clear and comprehensive. The overall implementation is solid, but I've identified one critical issue in the EngineClient protocol where the default implementation of is_engine_dead contradicts its documentation. This could lead to incorrect liveness probe behavior in Kubernetes. My review includes a suggestion to fix this by making the property abstract.
| def is_engine_dead(self) -> bool: | ||
| """Return True only when the engine has encountered a fatal error. | ||
| This is distinct from ``errored`` which also returns True during | ||
| graceful shutdown/drain.""" | ||
| return self.errored |
There was a problem hiding this comment.
The implementation return self.errored contradicts the docstring, which states that is_engine_dead is distinct from errored. The errored property is expected to be true during graceful shutdown, while is_engine_dead should only be true on fatal errors. This default implementation will cause incorrect behavior for consumers of the EngineClient protocol if they don't override it, potentially leading Kubernetes to kill pods that are gracefully shutting down.
To enforce the correct behavior in subclasses, this property should be an abstract method.
| def is_engine_dead(self) -> bool: | |
| """Return True only when the engine has encountered a fatal error. | |
| This is distinct from ``errored`` which also returns True during | |
| graceful shutdown/drain.""" | |
| return self.errored | |
| @abstractmethod | |
| def is_engine_dead(self) -> bool: | |
| """Return True only when the engine has encountered a fatal error. | |
| This is distinct from ``errored`` which also returns True during | |
| graceful shutdown/drain.""" | |
| ... |
Thank you, will try to review soon 👍
Let's deal with |
5dd9a85 to
428a55c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
428a55c to
c58c099
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
- Add /live endpoint that returns 200 during graceful drain (unlike /health which returns 503). This lets Kubernetes distinguish between "draining" and "dead" via separate liveness and readiness probes. - Add is_engine_dead property to EngineClient/AsyncLLM so the /live endpoint only fails on fatal engine errors, not graceful shutdown. - Exempt /live and /metrics from ScalingMiddleware 503 blocking. - Add comprehensive "Graceful Shutdown" section to k8s deployment docs with probe configuration examples and terminationGracePeriodSeconds. Part of RFC vllm-project#24885 Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
…ngMiddleware - TestLiveEndpoint: test /live returns 200 for healthy/draining engines, 503 for dead engines, 200 for render-only servers. - TestHealthDraining: test /health returns 503 when draining (skipping check_health), 200 when healthy, 200 for render-only servers. - TestScalingMiddlewareExemptions: test /live and /metrics are exempt from 503 during scaling, other paths are blocked. Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
a4c65e5 to
e3b3804
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Follow-up items from RFC #24885 (shutdown semantics):
Add /live liveness probe endpoint that returns 200 during graceful shutdown/drain (so Kubernetes does not restart the pod mid-drain) and only returns 503 on fatal engine error. The /health endpoint continues to serve as the readiness probe (503 during drain).
Add is_engine_dead property to EngineClient protocol to distinguish fatal errors from graceful shutdown state.
Exempt /live and /metrics from ScalingMiddleware so liveness probes and metrics scraping work during elastic EP scaling.
Add LLM.shutdown(timeout) for library/offline users to explicitly trigger clean shutdown instead of relying on garbage collection. Also adds del fallback.
Update docs/deployment/k8s.md with:
Fixes #24885
Purpose
Address RFC #24885
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.