[Enhancement] Patch AsyncOmniEngine try_get_output[_async] hanging issues#2153
Conversation
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
|
@yinpeiqi @Bounty-hunter ptal thx |
|
@codex review |
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
LGTM |
There was a problem hiding this comment.
I think it's not that necessary to add tests for the log info, since the logic is straightforward
There was a problem hiding this comment.
It was requested by @david6666666 in #1346, but I can remove it if the consensus is that it is not needed.
There was a problem hiding this comment.
I think it's necessary to add test to corer handle out-of-memory (OOM) errors in a manner consistent with vLLM.
| return self.output_queue.sync_q.get(timeout=timeout) | ||
| except queue.Empty: | ||
| if not self.is_alive(): | ||
| raise RuntimeError("Orchestrator died unexpectedly. See logs above.") |
There was a problem hiding this comment.
See logs above. seems too vague. Is it really helpful for users to debug the reason why Orchestrator died?
There was a problem hiding this comment.
What message would you suggest instead? There is no meaningful way currently to obtain the specific error thrown by the Orchestrator thread. This is meant to be the catch-all worst-case scenario when orchestrator aborts ungracefully or forcefully that was not caught by the standard paths of sending to the output queue.
| return self.output_queue.sync_q.get_nowait() | ||
| except queue.Empty: | ||
| if not self.is_alive(): | ||
| raise RuntimeError("Orchestrator died unexpectedly. See logs above.") |
There was a problem hiding this comment.
I guess the error messages might be asynchronously printed in stdout. Is the error message really shown above?
…sues (vllm-project#2153) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: Zhang <jianmusings@gmail.com>
…sues (vllm-project#2153) Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
…sues (vllm-project#2153) Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
…sues (vllm-project#2153) Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Adds #1560 functionality post #1908 refactor. Fixes issues during init for #1346
Test Plan
Added L1 test in
tests/engine/test_async_omni_engine_outputs.py.Test Result
Both passed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)