[Entrypoint][Refactor]Stage CLI Refactor#2020
[Entrypoint][Refactor]Stage CLI Refactor#2020hsliuustc0106 merged 9 commits intovllm-project:mainfrom
Conversation
2da5ae7 to
916b1b0
Compare
There was a problem hiding this comment.
A few concerns on this refactor:
-
stage_id: int→int | None(arg_utils.py) — downstream code doing arithmetic onstage_idwithout a None check will break._effective_stage_idonly coverscreate_model_config; other callers are unguarded. -
init_timeoutdefault silently dropped from 600 → 300 — existing deployments relying on 600s will start timing out. -
Unsafe manual
__enter__/__exit__in_create_remote_llm_stage— if anything fails between them, the socket leaks. Use awithstatement orcontextlib.ExitStack.
c1767ee to
4e39584
Compare
|
@princepride please help validate bagel for other input/output pairs |
| @@ -0,0 +1,589 @@ | |||
| """Helpers for launching and handshaking omni engine cores.""" | |||
There was a problem hiding this comment.
omni_core_engine.py name is misleading
There was a problem hiding this comment.
Yes, renamed it to stage_engine_startup.py
| close_started_llm_stage(started_stage) | ||
| raise | ||
|
|
||
| def _launch_registered_diffusion_stage( |
There was a problem hiding this comment.
why not call it _launch_diffusion_stage
There was a problem hiding this comment.
why the input args looks so different compared with:
def _launch_llm_stage(
self,
stage_cfg: Any,
metadata: Any,
stage_connector_spec: dict[str, Any],
stage_init_timeout: int,
llm_stage_launch_lock: threading.Lock,
omni_kv_connector: tuple[dict[str, Any] | None, str | None, str | None] = (None, None, None),
) -> StartedLlmStage:
There was a problem hiding this comment.
why not call it _launch_diffusion_stage
Yeah, just rename it.
There was a problem hiding this comment.
why the input args looks so different compared with:
def _launch_llm_stage( self, stage_cfg: Any, metadata: Any, stage_connector_spec: dict[str, Any], stage_init_timeout: int, llm_stage_launch_lock: threading.Lock, omni_kv_connector: tuple[dict[str, Any] | None, str | None, str | None] = (None, None, None), ) -> StartedLlmStage:
Because they are initially writed in different ways. I think it could be redesigned to unify the code logic.
8e24dea to
169e121
Compare
| # Register the stage once and reuse the returned per-stage handshake | ||
| # address for all local engine-core processes. | ||
| handshake_address = register_stage_with_omni_master( | ||
| omni_master_address=omni_master_server._address, |
There was a problem hiding this comment.
omni_master_server._address / ._port are private. This pattern repeats in _launch_diffusion_stage and launch_omni_core_engines — can you expose address and port as public properties on OmniMasterServer?
There was a problem hiding this comment.
Good catch — I exposed address and port as public properties on OmniMasterServer and updated _launch_diffusion_stage and launch_omni_core_engines to use them instead of the private fields.
| return started_stage | ||
| except Exception: | ||
| except BaseException as exc: | ||
| launch_stack.__exit__(type(exc), exc, exc.__traceback__) |
There was a problem hiding this comment.
Manual __exit__ call is fragile — if the stack has already been .close()'d on the happy path and then something throws between close and return, this double-calls exit. Wrapping launch_stack in a with statement (or at least guarding with a flag) would be safer.
There was a problem hiding this comment.
Good catch — I updated _launch_llm_stage to manage launch_stack with with ExitStack() so cleanup is handled safely without a manual __exit__ call.
| assert started_stage is not None | ||
| return started_stage | ||
| except Exception: | ||
| except BaseException as exc: |
There was a problem hiding this comment.
Why BaseException instead of Exception? KeyboardInterrupt / SystemExit propagation through ExitStack.__exit__ can mask the original signal.
There was a problem hiding this comment.
Good catch — now that the manual ExitStack.__exit__ call is gone, I’ve narrowed this to Exception so KeyboardInterrupt and SystemExit still propagate cleanly.
02f1442 to
cb9f559
Compare
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: wuhang <wuhang6@huawei.com>
|
Do we have a usage doc for the new CLI way? |
I'm working on this in #1462. |
Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Since a refactor of entrypoint( #1908 ), stage based CLI( #939 ) is needed to refactor too because it is mainly implemented in entrypoint module.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)