Skip to content

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor#43154

Open
rjrock wants to merge 12 commits into
vllm-project:mainfrom
rjrock:rocprof-worker-shutdown
Open

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor#43154
rjrock wants to merge 12 commits into
vllm-project:mainfrom
rjrock:rocprof-worker-shutdown

Conversation

@rjrock
Copy link
Copy Markdown
Contributor

@rjrock rjrock commented May 19, 2026

Purpose

rocprofv3 requires a grace period during process shutdown in order to emit trace data. This PR adds the environment variable VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS that sets a shutdown grace period for worker processes of MultiProcExecutor. The env var is also passed to the engine manager shutdown.

Previously, running a command like the below would fail.

rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Similarly, any rocprofv3 trace command that took longer than the 4 second shutdown period in multiproc_executor.py::_ensure_worker_termination would fail.

With this change merged, a successful run would look like the below.

export VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS=120
rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Test Plan

  • pytest tests/v1/executor/test_executor.py::test_multiproc_executor_worker_termination_timeout
  • pytest -s -v tests/v1/engine/test_core_engine_actor_manager.py::test_background_resources_passes_worker_shutdown_timeout

Test Result

Success


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added rocm Related to AMD ROCm v1 labels May 19, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a configurable shutdown timeout for the V1 engine and multiprocess executor. It adds a shutdown_timeout attribute to BackgroundResources and updates the MultiprocExecutor to use this value, ensuring a minimum grace period during worker termination. A review comment correctly identified a potential TypeError in multiproc_executor.py that could occur if the timeout configuration is None, suggesting a default value to prevent the crash.

Comment thread vllm/v1/executor/multiproc_executor.py Outdated
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from a947bd5 to d895571 Compare May 20, 2026 17:50
@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented May 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable shutdown timeout for the MultiprocExecutor in the V1 engine. Changes include adding a shutdown_timeout field to BackgroundResources, passing this value to the engine manager during shutdown, and updating MultiprocExecutor to use the configured timeout with a 4-second minimum. Unit tests were added to verify worker termination behavior. Feedback points out a potential TypeError in MultiprocExecutor if the shutdown_timeout is None and provides a suggestion to handle this case safely.

Comment thread vllm/v1/executor/multiproc_executor.py Outdated
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from 1a048aa to dbb1bf8 Compare May 20, 2026 18:18
@rjrock rjrock marked this pull request as ready for review May 20, 2026 18:36
@rjrock rjrock requested a review from njhill as a code owner May 20, 2026 18:36
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 20, 2026

Hi @rjrock, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from dbb1bf8 to eaf54b2 Compare May 20, 2026 19:33
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Member

cc @njhill PTAL

Copy link
Copy Markdown
Collaborator

@dllehr-amd dllehr-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a quick peak at my note? I'm trying to confirm that we won't negatively impact the default operation mode if we don't set the time ourselves.

Comment thread vllm/v1/engine/core_client.py Outdated
# when the client is garbage collected, even if an
# exception is raised mid-construction.
self.resources = BackgroundResources(ctx=sync_ctx)
self.resources = BackgroundResources(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjrock Can you confirm here that we won't change the default behavior? If we're always passing in the vllm_config.shutdown_timeout. Will this change the shutdown_timeout to always be non-None?

I'm worried this may cause an issue if the vllm_config's values are set to 0 (https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L376)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown_timeout will always be non-None, since this is the only constructor call to BackgroundResources. I added | None = None to match the other attributes.

shutdown_timeout did not previously exist for BackgroundResources. I had to add it such that the CLI option affects

vllm/vllm/v1/utils.py

Lines 563 to 580 in 266b9d9

if timeout is None:
# Keep a small grace period for best-effort cleanup paths that do not
# have a user-configured shutdown timeout.
timeout = 5.0
# Shutdown the process.
for proc in procs:
if proc.is_alive():
proc.terminate()
# Allow time for remaining procs to terminate.
deadline = time.monotonic() + timeout
for proc in procs:
remaining = deadline - time.monotonic()
if remaining <= 0:
break
if proc.is_alive():
proc.join(remaining)
via self.engine_manager.shutdown.

So the default behavior was for shutdown_timeout to be None and then 5 in v1.utils.shutdown.

Copy link
Copy Markdown
Contributor Author

@rjrock rjrock Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit this originally branched from had a max call such that whether timeout was 0 or None didn't matter:

vllm/vllm/v1/utils.py

Lines 335 to 339 in 2a43b40

if timeout is None:
timeout = 0.0
# Allow at least 5 seconds for remaining procs to terminate.
timeout = max(timeout, 5.0)

Looks like that max call was removed by @AndreasKaratzas , though, #43016.

So these changes would affect the new default, since timeout in v1.utils.shutdown would become 0.

rjrock and others added 5 commits June 1, 2026 15:52
rocprofv3 requires a grace period during process shutdown in
order to emit trace data.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
This reverts commit c20b9a8.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock rjrock force-pushed the rocprof-worker-shutdown branch from eaf54b2 to 0a59310 Compare June 1, 2026 20:52
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented Jun 1, 2026

Added a max call to BackgroundResources to maintain the previous behavior.

@rjrock rjrock requested a review from dllehr-amd June 1, 2026 21:47
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 3, 2026

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rjrock.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2026
rjrock added 4 commits June 4, 2026 19:48
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@mergify mergify Bot removed the needs-rebase label Jun 5, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented Jun 6, 2026

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

That makes sense. I rewrote it to use the env var VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS. Please take another look when you get a chance, @njhill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants