[Enhancement] Patch OmniStage.try_collect() with _proc alive checks#1560
Conversation
|
@ApsarasX PTAL |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments. The main one is a race between the is_alive() check and the queue read — the current ordering can discard valid results that are already queued when the worker exits normally.
|
@lishunyang12 I resolved your comments with minor tweaks. |
|
@lishunyang12 Bumping, can you please take a look? |
lishunyang12
left a comment
There was a problem hiding this comment.
LGTM — race condition fixed, queue-first ordering is correct
|
LGTM. |
|
@wtomin , seems the AMD CI fail issue is HW resource issue. |
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
328b87c to
8e79a78
Compare
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
| fake_process_instance.start = mocker.MagicMock() | ||
| fake_process_instance.join = mocker.MagicMock() | ||
| fake_process_instance.is_alive = mocker.MagicMock(return_value=False) | ||
| fake_process_instance.is_alive = mocker.MagicMock(return_value=True) |
There was a problem hiding this comment.
This value needs to be set True to allow _proc.is_alive() check on mocks to pass. There is no reason to set is_alive to False unless testing process failure test case.
…llm-project#1560) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: Megha Agarwal <agarwalmegha1308@gmail.com>
…llm-project#1560) Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
…llm-project#1560) Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Waiting for OmniStage involves checking the output queue for results. However,
try_collect()does not check if process has died and will hang indefinitely. This fixes this issue by explicitly checking that the process is alive before attempting to read the output queue. Component of #1557 relating to issue #1346Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)