Skip to content

[DO NOT MERGE][Core] Revert "Fix benign error log during normal shutdown (#36270)"#36646

Closed
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:revert-fix-benign-error-log
Closed

[DO NOT MERGE][Core] Revert "Fix benign error log during normal shutdown (#36270)"#36646
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:revert-fix-benign-error-log

Conversation

@markmc
Copy link
Member

@markmc markmc commented Mar 10, 2026

Possible simpler alternative to #36628 fixing #36624

…ect#36270)"

This reverts commit 6a18d87.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc requested a review from njhill as a code owner March 10, 2026 11:43
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026
@mergify mergify bot added the v1 label Mar 10, 2026
@markmc
Copy link
Member Author

markmc commented Mar 10, 2026

Looking at:

  1. Shutdown timeout, failed - https://buildkite.com/vllm/ci/builds/54863
  2. Busy loop fixes, passed when I triggered it today - https://buildkite.com/vllm/ci/builds/54834
  3. Nick "fix benign error log", failed when I triggered it today - https://buildkite.com/vllm/ci/builds/55015

Strongly suggests simply reverting #36270 might be sufficient to fix the test

(Even if the revert gets us green again though, that doesn't mean the code and tests are correct! But we can follow-up with fixes that don't break the job)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous fix for a benign error log during shutdown. While this may be a step towards an alternative solution, the current changes re-introduce two critical issues. First, the shutdown method is no longer idempotent, which could cause errors if called multiple times. Second, a race condition is re-introduced in monitor_engine_cores, which will likely cause spurious error logs during a normal shutdown. I've added specific comments with suggestions to restore the previous, safer behavior.

Comment on lines +642 to +645
self._finalizer.detach()
if self.resources.engine_manager is not None:
self.resources.engine_manager.shutdown(timeout=timeout)
self.resources()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

By removing the check on the return value of self._finalizer.detach(), the shutdown method is no longer idempotent. If this method is called multiple times, it will attempt to shut down the engine manager and clean up resources repeatedly. This could lead to unexpected errors if the underlying shutdown and cleanup operations are not idempotent. It's safer to restore the check to ensure shutdown logic runs only once.

Suggested change
self._finalizer.detach()
if self.resources.engine_manager is not None:
self.resources.engine_manager.shutdown(timeout=timeout)
self.resources()
if self._finalizer.detach() is not None:
if self.resources.engine_manager is not None:
self.resources.engine_manager.shutdown(timeout=timeout)
self.resources()

died = multiprocessing.connection.wait(sentinels)
_self = self_ref()
if not _self or not _self._finalizer.alive or _self.resources.engine_dead:
if not _self or _self.resources.engine_dead:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Removing the not _self._finalizer.alive check re-introduces a race condition. During a normal shutdown, engine processes are terminated. This monitor thread can wake up and incorrectly interpret this as an unexpected crash, leading to spurious error logs. The check for _self._finalizer.alive is crucial to distinguish between a controlled shutdown and an actual failure.

Suggested change
if not _self or _self.resources.engine_dead:
if not _self or not _self._finalizer.alive or _self.resources.engine_dead:

@markmc
Copy link
Member Author

markmc commented Mar 10, 2026

Looking at:

  1. Shutdown timeout, failed - https://buildkite.com/vllm/ci/builds/54863
  2. Busy loop fixes, passed when I triggered it today - https://buildkite.com/vllm/ci/builds/54834

This only passed because it is the PR branch, which did not include the shutdown timeout change

The tests were still broken on main with this "busy loop fixes" PR - https://buildkite.com/vllm/ci/builds/54959

And sure enough, this revert did not help - https://buildkite.com/vllm/ci/builds/55478

@markmc markmc closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant