Skip to content

[Enhancement] Escalate stage timeout to error#1558

Closed
pi314ever wants to merge 10 commits into
vllm-project:mainfrom
pi314ever:raise-stage-timeout
Closed

[Enhancement] Escalate stage timeout to error#1558
pi314ever wants to merge 10 commits into
vllm-project:mainfrom
pi314ever:raise-stage-timeout

Conversation

@pi314ever
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Escalates a stage timeout error to error (previously warning only). This prevents potential invalid orchestrator state of orchestrator reporting ready while individual stages are hanging or dead. Component of #1557 relating to issue #1346

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2419589c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni.py Outdated
@pi314ever
Copy link
Copy Markdown
Contributor Author

@xuechendi

Comment thread vllm_omni/entrypoints/omni.py Outdated
Comment thread vllm_omni/entrypoints/omni.py Outdated
Comment thread vllm_omni/entrypoints/omni.py
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. Also:

test_wait_for_stages_ready_timeout (and the diffusion variant) currently assert len(omni._stages_ready) == 0 after constructing Omni(...) — that will blow up now since __init__ raises TimeoutError before returning. Needs a pytest.raises(TimeoutError) wrapper.

+1 on the codex bot's point about self.close() being a no-op here — _weak_finalizer isn't registered until Omni.__init__ returns from super().__init__(), so the cleanup never fires.

@pi314ever
Copy link
Copy Markdown
Contributor Author

@lishunyang12 I addressed your comments and fixed pytests. Regarding _weak_finalizer, I proposed two ideas above. Are any of those ideas acceptable to you?

self.close()
raise TimeoutError(
f"{self._name}: {len(self._stages_ready)}/{num_stages} stages ready after {timeout}s. Missing stages: {not_ready}"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource leak: self.close() here is a no-op because _weak_finalizer is not registered until Omni.__init__ returns from super().__init__() (see omni.py:524 and async_omni.py). The timeout occurs inside super().__init__(), so cleanup never fires and orphan workers/IPC resources leak. Consider setting up finalizer earlier or using try-finally pattern as discussed in comments.

@pi314ever pi314ever force-pushed the raise-stage-timeout branch from 1d35c48 to a58d38e Compare March 3, 2026 07:49
@pi314ever
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @lishunyang12 I added in a BackgroundResources dataclass that takes care of resource management by Omni/AsyncOmni classes. What are your thoughts on this implementation?

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 9, 2026

Please solve conflicts.

@pi314ever pi314ever force-pushed the raise-stage-timeout branch from a58d38e to 69c821b Compare March 9, 2026 19:13
@pi314ever
Copy link
Copy Markdown
Contributor Author

@wtomin I have resolved the conflicts.

I also fixed some tests that relate to timeout errors, where it was assuming that timeout would just early return. Replacing with a mock stage that immediately is ready resolves this.

@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented Mar 13, 2026

@pi314ever Could you help resolve the recent conflicts? Thanks

pi314ever and others added 8 commits March 13, 2026 09:19
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever pi314ever force-pushed the raise-stage-timeout branch from d67651e to d9fc76c Compare March 13, 2026 16:39
@david6666666
Copy link
Copy Markdown
Collaborator

Is this pr ready now ?

@pi314ever
Copy link
Copy Markdown
Contributor Author

@david6666666 Yes, the PR is ready now.

@pi314ever
Copy link
Copy Markdown
Contributor Author

Fixed in #1908

@pi314ever pi314ever closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants