[DO NOT MERGE][Core] Revert "Fix benign error log during normal shutdown (#36270)"#36646
[DO NOT MERGE][Core] Revert "Fix benign error log during normal shutdown (#36270)"#36646markmc wants to merge 1 commit intovllm-project:mainfrom
Conversation
…ect#36270)" This reverts commit 6a18d87. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
Looking at:
Strongly suggests simply reverting #36270 might be sufficient to fix the test (Even if the revert gets us green again though, that doesn't mean the code and tests are correct! But we can follow-up with fixes that don't break the job) |
There was a problem hiding this comment.
Code Review
This pull request reverts a previous fix for a benign error log during shutdown. While this may be a step towards an alternative solution, the current changes re-introduce two critical issues. First, the shutdown method is no longer idempotent, which could cause errors if called multiple times. Second, a race condition is re-introduced in monitor_engine_cores, which will likely cause spurious error logs during a normal shutdown. I've added specific comments with suggestions to restore the previous, safer behavior.
| self._finalizer.detach() | ||
| if self.resources.engine_manager is not None: | ||
| self.resources.engine_manager.shutdown(timeout=timeout) | ||
| self.resources() |
There was a problem hiding this comment.
By removing the check on the return value of self._finalizer.detach(), the shutdown method is no longer idempotent. If this method is called multiple times, it will attempt to shut down the engine manager and clean up resources repeatedly. This could lead to unexpected errors if the underlying shutdown and cleanup operations are not idempotent. It's safer to restore the check to ensure shutdown logic runs only once.
| self._finalizer.detach() | |
| if self.resources.engine_manager is not None: | |
| self.resources.engine_manager.shutdown(timeout=timeout) | |
| self.resources() | |
| if self._finalizer.detach() is not None: | |
| if self.resources.engine_manager is not None: | |
| self.resources.engine_manager.shutdown(timeout=timeout) | |
| self.resources() |
| died = multiprocessing.connection.wait(sentinels) | ||
| _self = self_ref() | ||
| if not _self or not _self._finalizer.alive or _self.resources.engine_dead: | ||
| if not _self or _self.resources.engine_dead: |
There was a problem hiding this comment.
Removing the not _self._finalizer.alive check re-introduces a race condition. During a normal shutdown, engine processes are terminated. This monitor thread can wake up and incorrectly interpret this as an unexpected crash, leading to spurious error logs. The check for _self._finalizer.alive is crucial to distinguish between a controlled shutdown and an actual failure.
| if not _self or _self.resources.engine_dead: | |
| if not _self or not _self._finalizer.alive or _self.resources.engine_dead: |
This only passed because it is the PR branch, which did not include the shutdown timeout change The tests were still broken on main with this "busy loop fixes" PR - https://buildkite.com/vllm/ci/builds/54959 And sure enough, this revert did not help - https://buildkite.com/vllm/ci/builds/55478 |
Possible simpler alternative to #36628 fixing #36624