Skip to content

[Enhancement] Resolve Various Hanging Issues During Init Process#1557

Closed
pi314ever wants to merge 4 commits intovllm-project:mainfrom
pi314ever:error-on-stage-timeout
Closed

[Enhancement] Resolve Various Hanging Issues During Init Process#1557
pi314ever wants to merge 4 commits intovllm-project:mainfrom
pi314ever:error-on-stage-timeout

Conversation

@pi314ever
Copy link
Copy Markdown
Contributor

@pi314ever pi314ever commented Feb 27, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves various issues that lead to processes hanging or timeout mentioned in #1346. Namely:

  • Raise TimeoutError on stage initialization timeout to prevent invalid orchestrator state (ready with one or more stages timed out)
  • Exit early from test OmniServer if server process is dead
  • Resolve hanging issues in OmniStage.try_collect() by checking for dead proccesses

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please pasting the results comparison before and after, or e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
…eue get errors

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever pi314ever changed the title Error on stage timeout [BUGFIX] Resolve Various Timeout Issues Feb 27, 2026
@pi314ever pi314ever changed the title [BUGFIX] Resolve Various Timeout Issues [BUGFIX] Resolve Various Timeout/Hanging Issues Feb 27, 2026
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever pi314ever changed the title [BUGFIX] Resolve Various Timeout/Hanging Issues [Enhancement] Resolve Various Timeout/Hanging Issues Feb 27, 2026
@pi314ever pi314ever changed the title [Enhancement] Resolve Various Timeout/Hanging Issues [Enhancement] Resolve Various Hanging Issues During Init Process Feb 27, 2026
return profiler is not None
logger.error(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}")
self.close()
raise TimeoutError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsliuustc0106 , when one initialization stage failed due timeout, why we want to keep it running instead of terminate?
Is it OK to switch to terminate?

@pi314ever
Copy link
Copy Markdown
Contributor Author

I split this into 4 pull requests due to potential discussions required around each integration. This issue will be closed in favor of them:

@pi314ever pi314ever closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants