[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish#36666
[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish#36666njhill merged 13 commits intovllm-project:mainfrom
Conversation
6e94dc4 to
a664812
Compare
|
I expect |
There was a problem hiding this comment.
Code Review
This pull request introduces a shutdown timeout feature to vLLM, allowing for a grace period to complete in-flight requests before shutting down. The changes include adding a shutdown_timeout parameter to the VllmConfig and EngineArgs, modifying the signal handlers to initiate a graceful shutdown, and updating the process management to respect the timeout. The close() methods were renamed to shutdown().
|
@markmc I wanted to note a couple of follow-on things I'd noticed, probably not directly related to the 4-GPU test breakages but thought I would record here anyhow:
Don't necessarily need to address these as part of the re-added PR of course. |
|
@markmc @njhill On the 4-GPU test failure (#36624): From the bisection in #36624, the failure is a race condition in A few things I could investigate:
I don't have a 4-GPU setup to reproduce |
|
Thanks @wojciech-wais for looking at this... I will defer to @markmc because I don't know of his plan or whether he's already started on this, but would be great if we can confirm the issue and incorporate a clean fix in this PR. I am interested and will try to have a closer look today too. The follow-on points could also be fixed here or handled in a follow-on PR... |
When the engine core receives SIGTERM/SIGINT, abort all in-flight requests via the scheduler and send abort outputs to clients before exiting. Then signal the output thread to drain all pending messages (including abort responses) by pushing ENGINE_CORE_DEAD sentinel and joining the thread with a 5s timeout. Previously, the engine core would exit immediately on signal without notifying clients that their requests were aborted. This left clients hanging with no response. Now clients receive proper abort responses before the process terminates. Addresses follow-up points from njhill on PR vllm-project#36666: 1. Abort requests when shutdown timeout expires 2. Wait for output thread to drain before exiting Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
Add _send_error_output() helper method to consolidate error output creation. Model it on existing _send_abort_outputs() Refactor _handle_request_preproc_error() to use the new helper. This is a pure refactoring with no behavior change, preparing for shutdown request rejection logic. Note: The original code set engine_index=self.engine_index when creating EngineCoreOutputs, but this is redundant. The process_output_sockets() thread overwrites engine_index when dequeuing outputs from output_queue, so setting it during construction is pointless. The new helper follows the existing pattern from _send_abort_outputs() and the scheduler which also omit engine_index during construction. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Add shutdown_timeout parameter. 0 = abort immediately, >0 = wait. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add --shutdown-timeout CLI argument to EngineArgs. The arguments exist but don't affect behavior yet. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Add shutdown(timeout) methods to CoreEngineProcManager, APIServerProcessManager, and DPCoordinator to encapsulate process shutdown logic within manager classes. This eliminates the need for callers to directly access .processes or .proc attributes while maintaining backward compatibility. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add shutdown() method with timeout support through the engine client hierarchy to enable graceful shutdown coordination. Changes: - Add abstract shutdown() method to EngineClient protocol - Update AsyncLLM.shutdown() to accept optional timeout parameter - Update EngineCoreClient abstract method and all implementations: - InprocClient: Accept but ignore timeout (no processes to terminate) - MPClient: Detach finalizer, call engine_manager.shutdown(), clean up resources The finalizer must be detached in MPClient to prevent automatic cleanup during garbage collection. This allows explicit shutdown with configurable timeout, followed by manual resource cleanup. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Restructure signal handling to support graceful shutdown with configurable timeout for the single API server launcher. Changes: - Replace immediate task cancellation in signal_handler with shutdown_event.set() to trigger graceful shutdown - Add _handle_shutdown() coroutine that: - Waits for shutdown signal - Retrieves shutdown timeout from vllm_config - Calls engine_client.shutdown(timeout=timeout) for graceful engine shutdown - Then cancels server tasks and stops SSL refresher - Create shutdown_task to run the shutdown handler in background - Cancel shutdown_task in finally block for cleanup This enables the single API server to respect the shutdown mode configuration, waiting for in-flight requests to complete before shutting down when shutdown_timeout>0. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Use encapsulated shutdown() methods on managers with a shared deadline instead of manually collecting processes. This ensures the total shutdown time doesn't exceed the configured timeout while eliminating direct attribute access. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Complete graceful shutdown support for run_headless() by calling engine_manager.shutdown() with the configured shutdown_timeout when shutdown is requested via signal. When shutdown_requested is set (by SIGTERM/SIGINT signal handler), the finally block now retrieves the configured wait_timeout and calls engine_manager.shutdown(timeout=timeout) to allow the engine processes time to shutdown gracefully before being forcefully terminated. This completes the shutdown mode implementation for headless mode, following the same timeout-based approach as the single API server. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Implement graceful shutdown for the engine core with configurable
timeout using a clean state machine pattern.
State Machine Design:
- Add EngineShutdownState(IntEnum) with ordered states:
* RUNNING (0): Normal operation, accepting requests
* REQUESTED (1): SIGTERM received, shutdown not yet initiated
* SHUTTING_DOWN (2): Actively draining/aborting requests
- Track state via self.shutdown_state in EngineCoreProc
Shutdown Flow:
1. Signal handler sets shutdown_state = REQUESTED (instead of immediate exit)
2. Busy loop calls _handle_shutdown() to check state and initiate shutdown
3. Abort mode: finish_requests() with FINISHED_ABORTED status
4. Wait mode: log draining message, continue processing existing requests
5. Exit when no work remaining: not has_work()
Implementation:
_handle_shutdown() -> bool:
- Returns True to continue loop, False to exit
- On REQUESTED: initiate shutdown (abort or wait mode)
- On SHUTTING_DOWN: exit when no work remains
- Logs "Shutdown initiated" and "Shutdown complete"
_reject_add_in_shutdown(request) -> bool:
- Reject ADD requests during shutdown
- Send error output to client
- Returns True if rejected
_reject_utility_in_shutdown(client_idx, call_id, method_name) -> bool:
- Reject UTILITY calls during shutdown
- Send failure message to client
- Returns True if rejected
Signal Handler:
- Set engine_core.shutdown_state = REQUESTED (non-blocking)
- Raise SystemExit immediately if engine not yet created
- No complex work in signal handler (async-signal-safe)
Busy Loop:
- Change from infinite loop to: while self._handle_shutdown()
- Clear separation: shutdown logic vs. work processing
- Explicit SystemExit on loop exit
Request Handling:
- Early return pattern for rejected requests
- ABORT requests continue to work during shutdown
- ADD/UTILITY requests rejected with appropriate errors
The IntEnum ordering enables natural state checks:
if self.shutdown_state >= EngineShutdownState.REQUESTED:
# Any shutdown state (REQUESTED or SHUTTING_DOWN)
This enables tests in test_shutdown_timeout.py to verify:
- Abort mode exits quickly without completing requests
- Wait mode completes in-flight requests before exit
- New requests rejected after SIGTERM
- Timeout enforcement via shutdown_timeout
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
a664812 to
5934a72
Compare
|
Hi @markmc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
When the engine is idle and blocked on input_queue.get(), it cannot detect when shutdown state becomes REQUESTED after receiving SIGTERM. Pushing a WAKEUP message onto the input_queue is a straightforward way to wake up the main thread from input_queue.get(), however it is not safe to do this from a signal handler - so we do this from a separate callback thread. This commit implements an event-based solution: 1. Add WAKEUP sentinel to EngineCoreRequestType for waking idle engine 2. Add a new SignalCallback for safely triggering a callback from a handler 3. Add a signal callback that pushes WAKEUP to input_queue 4. Trigger the signal callback from the signal handler 5. Main loop wakes from blocking get(), checks shutdown_state, proceeds The watchdog thread is daemon and exits cleanly when process terminates. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add comprehensive integration tests for shutdown modes using RemoteOpenAIServer test infrastructure for consistent server lifecycle management. Test Infrastructure: - Uses RemoteOpenAIServer for automatic server startup/cleanup - _get_child_pids: Find child processes recursively - _assert_children_cleaned_up: Verify process cleanup - ShutdownState: Track request outcomes during shutdown - _concurrent_request_loop: Generate concurrent load Test Coverage: test_wait_mode_completes_requests: - Start server with --shutdown-mode wait - Send concurrent requests in background - SIGTERM during processing - Verify in-flight requests complete successfully - Verify new requests rejected (503/500/connection errors) - Verify clean shutdown and child process cleanup test_abort_mode_exits_quickly: - Default mode (--shutdown-mode abort) - SIGTERM exits quickly without waiting - Verify fast shutdown (<10s) - Verify child process cleanup test_abort_mode_fails_inflight_requests: - Explicit abort mode - SIGTERM immediately aborts in-flight requests - Verify requests fail with errors - Verify fast shutdown test_wait_mode_with_short_timeout: - Short --shutdown-wait-timeout (3s) - Verify timeout enforcement - Server exits within timeout + overhead - Child process cleanup verified test_request_rejection_during_shutdown: - SIGTERM starts shutdown - New requests rejected with errors - Verifies engine core rejection logic test_multi_api_server_shutdown: - Launch with --api-server-count 2 - Concurrent requests across multiple servers - SIGTERM propagates to all children - All API servers and engine cores terminate Edge Cases Tested: - Default vs explicit mode - Short vs long timeouts - Multiple API servers - Request rejection timing - Child process cleanup This test suite validates the complete shutdown flow from HTTP layer through engine coordination to clean process termination. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Will Eaton <weaton@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
When requests are rejected due to server shutdown or fail during pre-processing, use FinishReason.ABORT instead of ERROR. This better reflects that these failures are due to administrative action rather than actual processing errors. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
5934a72 to
88a65db
Compare
|
@njhill
This is a standalone PR against main. It will work independently of the shutdown timeout relanding here, but will compose cleanly with it. Once #36666 will be merged, the same mechanism will apply when the timeout |
I'd prefer merge this now as is, so we can get back to the baseline of what was previously working and merged - I'll note Nick's suggested improvements in #24885 |
…s to finish (vllm-project#36666) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…s to finish (vllm-project#36666) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…s to finish (vllm-project#36666) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…s to finish (vllm-project#36666) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…s to finish (vllm-project#36666) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
Re-apply #34730 and #36270 which were reverted by #36628 because of
test_external_lb_dp.pyfailures inDistributed Tests (4 GPUs)Relates to #24885 and supersedes a subset of #32420
🤖 Co-authored with Claude Code
Co-authored-by: @njhill
Overview
This PR adds support for a shutdown timeout in the API server and engine core, allowing controlled handling of a SIGTERM signal - either immediate termination by default (
--shutdown-timeout=0) or wait for in-flight requests to finish (--shutdown-timeout=SECONDS).Background
Previously, SIGTERM to the API server or multi-API server launcher would cause immediate process termination without coordinating request handling. This PR addresses that by:
--shutdown-timeout:--shutdown-timeoutseconds)Scope and Constraints
This PR is intentionally scoped to the API server use case with SIGTERM handling. The following are explicitly out of scope:
LLMorAsyncLLMusage (though shutdown() methods are added for future use)Implementation Approach
Core Mechanism: EngineShutdownState State Machine
The implementation uses a separate
EngineShutdownStateenum to manage shutdown lifecycle independently from pause state:Why a separate state machine:
How it works:
engine_core.shutdown_state = EngineShutdownState.REQUESTED_handle_shutdown()method checks state each iteration:RUNNING: Continue normal operationREQUESTED: Abort/wait based on config, transition toSHUTTING_DOWN, reject new requestsSHUTTING_DOWN: Exit when no work remaining_handle_client_request()calls_reject_add_in_shutdown()and_reject_utility_in_shutdown()which check shutdown state and reject new workrun_busy_loop()exits when_handle_shutdown()returnsFalseIdle Engine Wake-up
When the engine is idle and blocked on
input_queue.get(), it cannot detect shutdown state changes. To deal with this, we:Shutdown Flow
Single API Server:
Multi-API Server Launcher:
Headless Mode: