[Core][AMD] Propagate shutdown timeout to MultiprocExecutor by rjrock · Pull Request #43154 · vllm-project/vllm

rjrock · 2026-05-19T22:10:13Z

Purpose

rocprofv3 requires a grace period during process shutdown in order to emit trace data. This PR adds the environment variable VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS that sets a shutdown grace period for worker processes of MultiProcExecutor. The env var is also passed to the engine manager shutdown.

Previously, running a command like the below would fail.

rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Similarly, any rocprofv3 trace command that took longer than the 4 second shutdown period in multiproc_executor.py::_ensure_worker_termination would fail.

With this change merged, a successful run would look like the below.

export VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS=120
rocprofv3 \
  --disable-signal-handlers \
  --output-format pftrace \
  -r -- \
    vllm \
      bench throughput \
      --shutdown-timeout 60 \
      --model Qwen/Qwen3-32B \
      --num-prompts=1 \
      --tensor-parallel-size 2

Test Plan

pytest tests/v1/executor/test_executor.py::test_multiproc_executor_worker_termination_timeout
pytest -s -v tests/v1/engine/test_core_engine_actor_manager.py::test_background_resources_passes_worker_shutdown_timeout

Test Result

Success

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

gemini-code-assist

Code Review

This pull request implements a configurable shutdown timeout for the V1 engine and multiprocess executor. It adds a shutdown_timeout attribute to BackgroundResources and updates the MultiprocExecutor to use this value, ensuring a minimum grace period during worker termination. A review comment correctly identified a potential TypeError in multiproc_executor.py that could occur if the timeout configuration is None, suggesting a default value to prevent the crash.

rjrock · 2026-05-20T18:01:23Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a configurable shutdown timeout for the MultiprocExecutor in the V1 engine. Changes include adding a shutdown_timeout field to BackgroundResources, passing this value to the engine manager during shutdown, and updating MultiprocExecutor to use the configured timeout with a 4-second minimum. Unit tests were added to verify worker termination behavior. Feedback points out a potential TypeError in MultiprocExecutor if the shutdown_timeout is None and provides a suggestion to handle this case safely.

mergify · 2026-05-20T18:41:47Z

Hi @rjrock, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

AndreasKaratzas · 2026-06-01T18:51:38Z

cc @njhill PTAL

dllehr-amd

Can you take a quick peak at my note? I'm trying to confirm that we won't negatively impact the default operation mode if we don't set the time ourselves.

dllehr-amd · 2026-06-01T19:33:59Z

        # when the client is garbage collected, even if an
        # exception is raised mid-construction.
-        self.resources = BackgroundResources(ctx=sync_ctx)
+        self.resources = BackgroundResources(


@rjrock Can you confirm here that we won't change the default behavior? If we're always passing in the vllm_config.shutdown_timeout. Will this change the shutdown_timeout to always be non-None?

I'm worried this may cause an issue if the vllm_config's values are set to 0 (https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L376)

shutdown_timeout will always be non-None, since this is the only constructor call to BackgroundResources. I added | None = None to match the other attributes.

shutdown_timeout did not previously exist for BackgroundResources. I had to add it such that the CLI option affects

vllm/vllm/v1/utils.py

Lines 563 to 580 in 266b9d9

if timeout is None:

# Keep a small grace period for best-effort cleanup paths that do not

# have a user-configured shutdown timeout.

timeout = 5.0

# Shutdown the process.

for proc in procs:

if proc.is_alive():

proc.terminate()

# Allow time for remaining procs to terminate.

deadline = time.monotonic() + timeout

for proc in procs:

remaining = deadline - time.monotonic()

if remaining <= 0:

break

if proc.is_alive():

proc.join(remaining)

via self.engine_manager.shutdown.

So the default behavior was for shutdown_timeout to be None and then 5 in v1.utils.shutdown.

The commit this originally branched from had a max call such that whether timeout was 0 or None didn't matter:

vllm/vllm/v1/utils.py

Lines 335 to 339 in 2a43b40

if timeout is None:

timeout = 0.0

# Allow at least 5 seconds for remaining procs to terminate.

timeout = max(timeout, 5.0)

Looks like that max call was removed by @AndreasKaratzas , though, #43016.

So these changes would affect the new default, since timeout in v1.utils.shutdown would become 0.

rocprofv3 requires a grace period during process shutdown in order to emit trace data. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan Rock <ryan.rock@amd.com>

This reverts commit c20b9a8. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

rjrock · 2026-06-01T21:46:02Z

Added a max call to BackgroundResources to maintain the previous behavior.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

njhill · 2026-06-03T22:27:00Z

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

mergify · 2026-06-04T18:47:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rjrock.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

rjrock · 2026-06-06T01:22:18Z

Thanks @rjrock. The shutdown_timeout option in the config is for a global graceful shutdown where we wait for in-fight requests to complete rather than immediately aborting them.

So I'm not sure we should use that value here. By the time we are shutting down the executor we are in tear-down mode and the 4 second timeout is just to allow the resources to be released/process to exit cleanly. Perhaps for this purpose it would be better to just add a new VLLM_WORKER_SHUTDOWN_TIMEOUT env var in envs.py?

That makes sense. I rewrote it to use the env var VLLM_WORKER_SHUTDOWN_TIMEOUT_SECONDS. Please take another look when you get a chance, @njhill.

mergify Bot added rocm Related to AMD ROCm v1 labels May 19, 2026

github-project-automation Bot added this to AMD May 19, 2026

github-project-automation Bot moved this to Todo in AMD May 19, 2026

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread vllm/v1/executor/multiproc_executor.py Outdated

rjrock force-pushed the rocprof-worker-shutdown branch from a947bd5 to d895571 Compare May 20, 2026 17:50

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Comment thread vllm/v1/executor/multiproc_executor.py Outdated

rjrock force-pushed the rocprof-worker-shutdown branch from 1a048aa to dbb1bf8 Compare May 20, 2026 18:18

rjrock marked this pull request as ready for review May 20, 2026 18:36

rjrock requested a review from njhill as a code owner May 20, 2026 18:36

rjrock force-pushed the rocprof-worker-shutdown branch from dbb1bf8 to eaf54b2 Compare May 20, 2026 19:33

jwzheng96 mentioned this pull request May 30, 2026

[Bugfix] Reject non-positive values for ParallelConfig int knobs #44057

Merged

3 tasks

AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026

dllehr-amd requested changes Jun 1, 2026

View reviewed changes

rjrock and others added 5 commits June 1, 2026 15:52

[Core][AMD] Propagate shutdown timeout to MultiprocExecutor

e528a29

rocprofv3 requires a grace period during process shutdown in order to emit trace data. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Update vllm/v1/executor/multiproc_executor.py

759c168

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Revert "Update vllm/v1/executor/multiproc_executor.py"

c27614f

This reverts commit c20b9a8. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Add tests

fd61043

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Condense pytest.param lines

0a59310

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

rjrock force-pushed the rocprof-worker-shutdown branch from eaf54b2 to 0a59310 Compare June 1, 2026 20:52

Set GPU worker shutdown to at least 5 seconds

309fb76

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

rjrock requested a review from dllehr-amd June 1, 2026 21:47

Fix test_engine_core_client failure

5f911de

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

mergify Bot added the needs-rebase label Jun 4, 2026

rjrock added 4 commits June 4, 2026 19:48

Revert changes

0b9052f

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Use env var instead of CLI option

fc00cf7

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Add tests

f2059ff

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

Merge branch 'main' into rocprof-worker-shutdown

a1f12ae

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

mergify Bot removed the needs-rebase label Jun 5, 2026

Remove newline from merge

f1dd9a1

Signed-off-by: Ryan Rock <ryan.rock@amd.com>

	if timeout is None:
	# Keep a small grace period for best-effort cleanup paths that do not
	# have a user-configured shutdown timeout.
	timeout = 5.0

	# Shutdown the process.
	for proc in procs:
	if proc.is_alive():
	proc.terminate()

	# Allow time for remaining procs to terminate.
	deadline = time.monotonic() + timeout
	for proc in procs:
	remaining = deadline - time.monotonic()
	if remaining <= 0:
	break
	if proc.is_alive():
	proc.join(remaining)

	if timeout is None:
	timeout = 0.0

	# Allow at least 5 seconds for remaining procs to terminate.
	timeout = max(timeout, 5.0)

Uh oh!

Conversation

rjrock commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

rjrock commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented May 20, 2026

Uh oh!

AndreasKaratzas commented Jun 1, 2026

Uh oh!

dllehr-amd left a comment

Choose a reason for hiding this comment

Uh oh!

dllehr-amd Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

rjrock Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

rjrock Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjrock commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Jun 3, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

rjrock commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rjrock commented May 19, 2026 •

edited

Loading

rjrock Jun 1, 2026 •

edited

Loading

rjrock commented Jun 1, 2026 •

edited

Loading