[Enhancement] Escalate stage timeout to error#1558
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2419589c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple of comments. Also:
test_wait_for_stages_ready_timeout (and the diffusion variant) currently assert len(omni._stages_ready) == 0 after constructing Omni(...) — that will blow up now since __init__ raises TimeoutError before returning. Needs a pytest.raises(TimeoutError) wrapper.
+1 on the codex bot's point about self.close() being a no-op here — _weak_finalizer isn't registered until Omni.__init__ returns from super().__init__(), so the cleanup never fires.
|
@lishunyang12 I addressed your comments and fixed pytests. Regarding |
| self.close() | ||
| raise TimeoutError( | ||
| f"{self._name}: {len(self._stages_ready)}/{num_stages} stages ready after {timeout}s. Missing stages: {not_ready}" | ||
| ) |
There was a problem hiding this comment.
Resource leak: self.close() here is a no-op because _weak_finalizer is not registered until Omni.__init__ returns from super().__init__() (see omni.py:524 and async_omni.py). The timeout occurs inside super().__init__(), so cleanup never fires and orphan workers/IPC resources leak. Consider setting up finalizer earlier or using try-finally pattern as discussed in comments.
1d35c48 to
a58d38e
Compare
|
@hsliuustc0106 @lishunyang12 I added in a |
|
Please solve conflicts. |
a58d38e to
69c821b
Compare
|
@wtomin I have resolved the conflicts. I also fixed some tests that relate to timeout errors, where it was assuming that timeout would just early return. Replacing with a mock stage that immediately is ready resolves this. |
17ffbdb to
5cbb668
Compare
|
@pi314ever Could you help resolve the recent conflicts? Thanks |
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
d67651e to
d9fc76c
Compare
|
Is this pr ready now ? |
|
@david6666666 Yes, the PR is ready now. |
|
Fixed in #1908 |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Escalates a stage timeout error to error (previously warning only). This prevents potential invalid orchestrator state of orchestrator reporting ready while individual stages are hanging or dead. Component of #1557 relating to issue #1346
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)