Skip to content

[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish#36666

Merged
njhill merged 13 commits intovllm-project:mainfrom
markmc:shutdown-timeout-v2
Mar 13, 2026
Merged

[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish#36666
njhill merged 13 commits intovllm-project:mainfrom
markmc:shutdown-timeout-v2

Conversation

@markmc
Copy link
Member

@markmc markmc commented Mar 10, 2026

Re-apply #34730 and #36270 which were reverted by #36628 because of test_external_lb_dp.py failures in Distributed Tests (4 GPUs)


Relates to #24885 and supersedes a subset of #32420

🤖 Co-authored with Claude Code
Co-authored-by: @njhill

Overview

This PR adds support for a shutdown timeout in the API server and engine core, allowing controlled handling of a SIGTERM signal - either immediate termination by default (--shutdown-timeout=0) or wait for in-flight requests to finish (--shutdown-timeout=SECONDS).

Background

Previously, SIGTERM to the API server or multi-API server launcher would cause immediate process termination without coordinating request handling. This PR addresses that by:

  1. Rejecting new requests when shutdown is initiated (not queuing them like pause does)
  2. Handling in-flight requests according to --shutdown-timeout:
    • Default, no timeout - abort all in-flight requests immediately
    • With timeout - Wait for in-flight requests to complete (up to --shutdown-timeout seconds)
  3. Using a separate shutdown state machine in the engine core to manage the shutdown lifecycle

Scope and Constraints

This PR is intentionally scoped to the API server use case with SIGTERM handling. The following are explicitly out of scope:

  • KV transfers: unlike [Frontend] Enable drain shutdown mode for non-DP deployments #32420, prefill instances are not (yet) waiting for KV transfers to complete before shutting down
  • Library use case: No changes to direct LLM or AsyncLLM usage (though shutdown() methods are added for future use)
  • ScalingMiddleware reuse: Request rejection happens at engine core level, not middleware
  • Signal handling redesign: No new shutdown pipe mechanism; minimal changes to existing handlers
  • Ctrl-C behavior: Process groups and Ctrl-C handling remain unchanged; only SIGTERM is handled
  • /health endpoint changes: No new /live endpoint or changes to health check behavior
  • External/hybrid load balancer validation: Assumes parallel SIGTERM to multiple API servers works (untested, similar to pause behavior)

Implementation Approach

Core Mechanism: EngineShutdownState State Machine

The implementation uses a separate EngineShutdownState enum to manage shutdown lifecycle independently from pause state:

class EngineShutdownState(IntEnum):
    RUNNING = 0
    REQUESTED = 1
    SHUTTING_DOWN = 2

Why a separate state machine:

  • Shutdown has different semantics than pause (permanent vs temporary)
  • Shutdown needs request rejection, pause needs request queuing
  • Different lifecycle: pause can be resumed, shutdown cannot
  • Cleaner separation of concerns

How it works:

  1. SIGTERM signal handler sets engine_core.shutdown_state = EngineShutdownState.REQUESTED
  2. Engine core's _handle_shutdown() method checks state each iteration:
    • RUNNING: Continue normal operation
    • REQUESTED: Abort/wait based on config, transition to SHUTTING_DOWN, reject new requests
    • SHUTTING_DOWN: Exit when no work remaining
  3. Engine core's _handle_client_request() calls _reject_add_in_shutdown() and _reject_utility_in_shutdown() which check shutdown state and reject new work
  4. Engine core's run_busy_loop() exits when _handle_shutdown() returns False

Idle Engine Wake-up

When the engine is idle and blocked on input_queue.get(), it cannot detect shutdown state changes. To deal with this, we:

  1. Add WAKEUP sentinel to EngineCoreRequestType for waking idle engine
  2. Add a new signal-context-safe SignalCallback to allow triggering a callback from a signal handler
  3. Add a signal callback that pushes WAKEUP to input_queue
  4. Trigger the signal callback from the signal handler
  5. Main loop wakes from blocking get(), checks shutdown_state, proceeds

Shutdown Flow

Single API Server:

SIGTERM → signal_handler() sets shutdown_event
       → _handle_shutdown() coroutine wakes up
       → Retrieves timeout from vllm_config.shutdown_config.wait_timeout
       → Calls engine_client.shutdown(timeout=timeout)
       → Shutdown propagates through client hierarchy:
          - AsyncLLM.shutdown() → EngineCoreClient.shutdown()
          - MPClient detaches finalizer, calls engine_manager.shutdown()
          - CoreEngineProcManager.shutdown() terminates/waits for processes
       → Then cancels server tasks and stops SSL refresher

Multi-API Server Launcher:

SIGTERM → signal_handler() sets shutdown_requested flag, raises SystemExit
       → finally block executes
       → Computes shared deadline from configured wait_timeout
       → Calls shutdown(deadline=deadline) on all managers:
          - api_server_manager.shutdown(deadline=deadline)
          - local_engine_manager.shutdown(deadline=deadline) if present
          - coordinator.shutdown(deadline=deadline) if present
       → Each manager terminates/waits for processes within deadline

Headless Mode:

SIGTERM → signal_handler() sets shutdown_requested flag, raises SystemExit
       → finally block executes
       → Retrieves timeout from vllm_config.shutdown_config.wait_timeout
       → Calls engine_manager.shutdown(timeout=timeout)
       → Process manager terminates/waits for processes with timeout

@markmc
Copy link
Member Author

markmc commented Mar 10, 2026

I expect Distributed Tests (4 GPUs) to fail - triggering tests just to re-establish that as a baseline

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shutdown timeout feature to vLLM, allowing for a grace period to complete in-flight requests before shutting down. The changes include adding a shutdown_timeout parameter to the VllmConfig and EngineArgs, modifying the signal handlers to initiate a graceful shutdown, and updating the process management to respect the timeout. The close() methods were renamed to shutdown().

@njhill
Copy link
Member

njhill commented Mar 10, 2026

@markmc I wanted to note a couple of follow-on things I'd noticed, probably not directly related to the 4-GPU test breakages but thought I would record here anyhow:

  • I realized that in the timeout case, we don't abort requests when the timeout has expired. It would be nice if we did that consistently whether it's 0 or some larger number, and in all cases allow additional small time for those abort responses to propagate
  • Related, I'm not sure even in the existing abort case, whether we probably allow for those abort responses to propagate before things are shut down. I need to look more closely but in particular was wondering whether we wait until the output processing thread has finished processing work in the output queue before continuing to exit the engine core proc.

Don't necessarily need to address these as part of the re-added PR of course.

@wojciech-wais
Copy link

@markmc @njhill
I'd like to help with this relanding effort. I was the co-author on the original #34730 and have been working on the related shutdown improvements in the RFC #24885 space (#36258).

On the 4-GPU test failure (#36624):

From the bisection in #36624, the failure is a race condition in run_engine_core() _ with 4 API servers per DP rank, EngineCore_DP1's worker hits BrokenPipeError because the parent exits before
initialization completes. I suspect the SignalCallback thread + shutdown_state enum replacement of the simple signal handler changes process lifecycle timing enough to expose this race under heavier
initialization load. The [1] variant (single API server) passes because initialization is lighter.

A few things I could investigate:

  1. Narrow the culprit in run_engine_core() _ the original PR changed both the signal handler mechanism (SignalCallback thread) and added os.setpgrp(). We already confirmed that os.setpgrp()
    alone is risky - can orphan engine cores if the parent crashes. I can test whether keeping the simpler signal handler while still adding the shutdown state machine avoids the race.

  2. Address @njhill's follow-on points _ abort requests when timeout expires + allow abort responses to propagate before engine core exits. I can look into whether the output processing thread drains
    its queue before run_busy_loop() exits.

  3. Our /live probe PR ([Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs #36258) depends on this landing _ it adds app.state.draining = True in the signal handler and a /live endpoint that stays 200 during drain. Happy to rebase it on top of this
    once the test failure is resolved.

I don't have a 4-GPU setup to reproduce test_external_lb_dp[4] locally, but I can do code analysis and prepare fixes for review.
How do you think about addressing the (1) and (2)?

@njhill
Copy link
Member

njhill commented Mar 12, 2026

Thanks @wojciech-wais for looking at this... I will defer to @markmc because I don't know of his plan or whether he's already started on this, but would be great if we can confirm the issue and incorporate a clean fix in this PR. I am interested and will try to have a closer look today too.

The follow-on points could also be fixed here or handled in a follow-on PR...

@njhill
Copy link
Member

njhill commented Mar 13, 2026

@markmc fyi I separately found/fixed this issue #36950, thought there is a chance it could be implicated in the problem here. I didn't test that fix with this PR yet though.

wojciech-wais added a commit to wojciech-wais/vllm that referenced this pull request Mar 13, 2026
When the engine core receives SIGTERM/SIGINT, abort all in-flight
requests via the scheduler and send abort outputs to clients before
exiting. Then signal the output thread to drain all pending messages
(including abort responses) by pushing ENGINE_CORE_DEAD sentinel and
joining the thread with a 5s timeout.

Previously, the engine core would exit immediately on signal without
notifying clients that their requests were aborted. This left clients
hanging with no response. Now clients receive proper abort responses
before the process terminates.

Addresses follow-up points from njhill on PR vllm-project#36666:
1. Abort requests when shutdown timeout expires
2. Wait for output thread to drain before exiting

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
markmc and others added 5 commits March 13, 2026 03:34
Add _send_error_output() helper method to consolidate error output creation.

Model it on existing _send_abort_outputs()

Refactor _handle_request_preproc_error() to use the new helper.

This is a pure refactoring with no behavior change, preparing for shutdown
request rejection logic.

Note: The original code set engine_index=self.engine_index when creating
EngineCoreOutputs, but this is redundant. The process_output_sockets()
thread overwrites engine_index when dequeuing outputs from output_queue,
so setting it during construction is pointless. The new helper follows the
existing pattern from _send_abort_outputs() and the scheduler which also
omit engine_index during construction.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Add shutdown_timeout parameter. 0 = abort immediately, >0 = wait.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add --shutdown-timeout CLI argument to EngineArgs.

The arguments exist but don't affect behavior yet.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Add shutdown(timeout) methods to CoreEngineProcManager,
APIServerProcessManager, and DPCoordinator to encapsulate process
shutdown logic within manager classes.

This eliminates the need for callers to directly access .processes
or .proc attributes while maintaining backward compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add shutdown() method with timeout support through the engine client
hierarchy to enable graceful shutdown coordination.

Changes:
- Add abstract shutdown() method to EngineClient protocol
- Update AsyncLLM.shutdown() to accept optional timeout parameter
- Update EngineCoreClient abstract method and all implementations:
  - InprocClient: Accept but ignore timeout (no processes to terminate)
  - MPClient: Detach finalizer, call engine_manager.shutdown(), clean up resources

The finalizer must be detached in MPClient to prevent automatic cleanup
during garbage collection. This allows explicit shutdown with configurable
timeout, followed by manual resource cleanup.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added 4 commits March 13, 2026 03:34
Restructure signal handling to support graceful shutdown with
configurable timeout for the single API server launcher.

Changes:
- Replace immediate task cancellation in signal_handler with
  shutdown_event.set() to trigger graceful shutdown
- Add _handle_shutdown() coroutine that:
  - Waits for shutdown signal
  - Retrieves shutdown timeout from vllm_config
  - Calls engine_client.shutdown(timeout=timeout) for graceful engine shutdown
  - Then cancels server tasks and stops SSL refresher
- Create shutdown_task to run the shutdown handler in background
- Cancel shutdown_task in finally block for cleanup

This enables the single API server to respect the shutdown mode
configuration, waiting for in-flight requests to complete before
shutting down when shutdown_timeout>0.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Use encapsulated shutdown() methods on managers with a shared deadline
instead of manually collecting processes. This ensures the total shutdown
time doesn't exceed the configured timeout while eliminating direct
attribute access.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Complete graceful shutdown support for run_headless() by calling
engine_manager.shutdown() with the configured shutdown_timeout when
shutdown is requested via signal.

When shutdown_requested is set (by SIGTERM/SIGINT signal handler),
the finally block now retrieves the configured wait_timeout and
calls engine_manager.shutdown(timeout=timeout) to allow the engine
processes time to shutdown gracefully before being forcefully
terminated.

This completes the shutdown mode implementation for headless mode,
following the same timeout-based approach as the single API server.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Implement graceful shutdown for the engine core with configurable
timeout using a clean state machine pattern.

State Machine Design:
- Add EngineShutdownState(IntEnum) with ordered states:
  * RUNNING (0): Normal operation, accepting requests
  * REQUESTED (1): SIGTERM received, shutdown not yet initiated
  * SHUTTING_DOWN (2): Actively draining/aborting requests
- Track state via self.shutdown_state in EngineCoreProc

Shutdown Flow:
1. Signal handler sets shutdown_state = REQUESTED (instead of immediate exit)
2. Busy loop calls _handle_shutdown() to check state and initiate shutdown
3. Abort mode: finish_requests() with FINISHED_ABORTED status
4. Wait mode: log draining message, continue processing existing requests
5. Exit when no work remaining: not has_work()

Implementation:

_handle_shutdown() -> bool:
- Returns True to continue loop, False to exit
- On REQUESTED: initiate shutdown (abort or wait mode)
- On SHUTTING_DOWN: exit when no work remains
- Logs "Shutdown initiated" and "Shutdown complete"

_reject_add_in_shutdown(request) -> bool:
- Reject ADD requests during shutdown
- Send error output to client
- Returns True if rejected

_reject_utility_in_shutdown(client_idx, call_id, method_name) -> bool:
- Reject UTILITY calls during shutdown
- Send failure message to client
- Returns True if rejected

Signal Handler:
- Set engine_core.shutdown_state = REQUESTED (non-blocking)
- Raise SystemExit immediately if engine not yet created
- No complex work in signal handler (async-signal-safe)

Busy Loop:
- Change from infinite loop to: while self._handle_shutdown()
- Clear separation: shutdown logic vs. work processing
- Explicit SystemExit on loop exit

Request Handling:
- Early return pattern for rejected requests
- ABORT requests continue to work during shutdown
- ADD/UTILITY requests rejected with appropriate errors

The IntEnum ordering enables natural state checks:
  if self.shutdown_state >= EngineShutdownState.REQUESTED:
      # Any shutdown state (REQUESTED or SHUTTING_DOWN)

This enables tests in test_shutdown_timeout.py to verify:
- Abort mode exits quickly without completing requests
- Wait mode completes in-flight requests before exit
- New requests rejected after SIGTERM
- Timeout enforcement via shutdown_timeout

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the shutdown-timeout-v2 branch from a664812 to 5934a72 Compare March 13, 2026 07:34
@markmc
Copy link
Member Author

markmc commented Mar 13, 2026

@markmc fyi I separately found/fixed this issue #36950, thought there is a chance it could be implicated in the problem here.

Thanks, let's find out!

@mergify
Copy link

mergify bot commented Mar 13, 2026

Hi @markmc, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

markmc and others added 4 commits March 13, 2026 03:48
When the engine is idle and blocked on input_queue.get(), it cannot
detect when shutdown state becomes REQUESTED after receiving SIGTERM.

Pushing a WAKEUP message onto the input_queue is a straightforward
way to wake up the main thread from input_queue.get(), however it
is not safe to do this from a signal handler - so we do this from
a separate callback thread.

This commit implements an event-based solution:

1. Add WAKEUP sentinel to EngineCoreRequestType for waking idle engine
2. Add a new SignalCallback for safely triggering a callback from a handler
3. Add a signal callback that pushes WAKEUP to input_queue
4. Trigger the signal callback from the signal handler
5. Main loop wakes from blocking get(), checks shutdown_state, proceeds

The watchdog thread is daemon and exits cleanly when process terminates.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Add comprehensive integration tests for shutdown modes using
RemoteOpenAIServer test infrastructure for consistent server lifecycle
management.

Test Infrastructure:
- Uses RemoteOpenAIServer for automatic server startup/cleanup
- _get_child_pids: Find child processes recursively
- _assert_children_cleaned_up: Verify process cleanup
- ShutdownState: Track request outcomes during shutdown
- _concurrent_request_loop: Generate concurrent load

Test Coverage:

test_wait_mode_completes_requests:
- Start server with --shutdown-mode wait
- Send concurrent requests in background
- SIGTERM during processing
- Verify in-flight requests complete successfully
- Verify new requests rejected (503/500/connection errors)
- Verify clean shutdown and child process cleanup

test_abort_mode_exits_quickly:
- Default mode (--shutdown-mode abort)
- SIGTERM exits quickly without waiting
- Verify fast shutdown (<10s)
- Verify child process cleanup

test_abort_mode_fails_inflight_requests:
- Explicit abort mode
- SIGTERM immediately aborts in-flight requests
- Verify requests fail with errors
- Verify fast shutdown

test_wait_mode_with_short_timeout:
- Short --shutdown-wait-timeout (3s)
- Verify timeout enforcement
- Server exits within timeout + overhead
- Child process cleanup verified

test_request_rejection_during_shutdown:
- SIGTERM starts shutdown
- New requests rejected with errors
- Verifies engine core rejection logic

test_multi_api_server_shutdown:
- Launch with --api-server-count 2
- Concurrent requests across multiple servers
- SIGTERM propagates to all children
- All API servers and engine cores terminate

Edge Cases Tested:
- Default vs explicit mode
- Short vs long timeouts
- Multiple API servers
- Request rejection timing
- Child process cleanup

This test suite validates the complete shutdown flow from HTTP layer
through engine coordination to clean process termination.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Will Eaton <weaton@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
When requests are rejected due to server shutdown or fail during
pre-processing, use FinishReason.ABORT instead of ERROR. This better
reflects that these failures are due to administrative action rather
than actual processing errors.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@wojciech-wais
Copy link

@njhill
Following up on your two points I've opened #36964 which addresses both of them:

  1. Abort requests on shutdown
    When the engine core receives SIGTERM/SIGINT, all in-flight requests are now explicitly aborted via scheduler.finish_requests(None, FINISHED_ABORTED) and abort outputs
    are sent to clients (reusing the existing _send_abort_outputs() from the pause path).

  2. Drain output thread before exit
    After sending abort outputs, the ENGINE_CORE_DEAD sentinel is pushed and the output thread is joined with a 5s timeout, ensuring all pending messages (including
    abort responses) are flushed through ZMQ before the process terminates.

This is a standalone PR against main. It will work independently of the shutdown timeout relanding here, but will compose cleanly with it. Once #36666 will be merged, the same mechanism will apply when the timeout
expires (abort + drain).

@markmc
Copy link
Member Author

markmc commented Mar 13, 2026

@markmc fyi I separately found/fixed this issue #36950, thought there is a chance it could be implicated in the problem here.

Thanks, let's find out!

Distributed Tests (4 GPUs) is indeed passing now

I'd prefer merge this now as is, so we can get back to the baseline of what was previously working and merged - I'll note Nick's suggested improvements in #24885

@njhill njhill merged commit 7afe0fa into vllm-project:main Mar 13, 2026
50 checks passed
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 15, 2026
…s to finish (vllm-project#36666)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
…s to finish (vllm-project#36666)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…s to finish (vllm-project#36666)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…s to finish (vllm-project#36666)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…s to finish (vllm-project#36666)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants