[Core] Refactor CFG companion tracker and use in Orchestrator#2623
Conversation
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
|
do we have any perf/acc regression example tests or just run the CI? |
Just need to run the CI for BAGEL, this PR do not change the execution logic. |
You can refer to for both online and offline L3 test @yinpeiqi |
| @@ -1,89 +1,29 @@ | |||
| """CFG companion request tracker for the Omni orchestrator. | |||
There was a problem hiding this comment.
At first glance, it feels very strange to put this file under the entrypoint.
There was a problem hiding this comment.
agree, this too feature specific
There was a problem hiding this comment.
Moved to engine folder.
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
|
This PR could be merge? @gcanlin @hsliuustc0106 |
gcanlin
left a comment
There was a problem hiding this comment.
LGTM. @princepride Could you double-check it?
lishunyang12
left a comment
There was a problem hiding this comment.
Good refactor. Moving CFG companion state out of Orchestrator and into a dedicated CfgCompanionTracker class makes the responsibilities clearer and the orchestrator loop easier to follow. The key rename from "output" to "engine_outputs" is consistently applied across defer_parent, pop_pending_parent, and the consuming code in _handle_cfg_companion_ready. Tests cover the main paths well.
A few observations:
1. abort_parents does not clean up companion-only aborts
If abort_parents receives a companion ID (not a parent), it passes it through without removing it from _companion_ids, _companion_to_parent, or _done. This leaves stale tracking state. In practice this is probably fine because _handle_abort is always called with parent IDs from the external layer, but it is worth a comment or a defensive cleanup to avoid confusion if the method is reused elsewhere.
2. on_companion_completed assumes _done[parent_id] exists
Line self._done[parent_id].add(companion_id) will KeyError if the parent was never registered. This is safe today because register_companion always initializes _done via register_parent, but a guard or assertion (assert parent_id in self._done) would make the invariant explicit and give a better error message if the contract is violated.
3. Removed _upgrade_to_omni_request call in async_omni_engine.py
The deletion of request = _upgrade_to_omni_request(request, companion_prompt) at line 1075 looks intentional but is not mentioned in the PR description. Is this line dead code after #1908, or was it still doing something? Worth confirming this does not affect the companion request payload.
4. Removal of timeout / failure-propagation paths
The old tracker had check_timeouts(), on_companion_error(), is_parent_failed(), and consume_parent_failure(). The PR description says these are unused in the current orchestrator flow. That seems correct from the diff -- the orchestrator never called them. Just confirming this is intentional: if a companion hangs indefinitely the parent will remain deferred forever. If that is an acceptable risk today, a TODO comment noting the lack of timeout would be helpful for future readers.
5. Minor nit: typo in PR title
"compaion" should be "companion".
Overall this is clean and well-scoped. LGTM.
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Fixed according to the comments. For 3, _upgrade_to_omni_request is used only for the TTS model with input additional_information. Not affect to BAGEL, so i directly remove this one here. |
…roject#2623) Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
…roject#2623) Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
…roject#2623) Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
The CfgCompaionTracker, originally used in the AsyncOmni and Omni class (before #1908 refactor), and now is deperated.
Before this change, the CFG companion lifecycle was managed directly inside Orchestrator. That made the orchestration layer responsible for too many low-level details:
This had a few problems:
What changed
This refactor moves CFG companion state ownership out of Orchestrator and into CfgCompanionTracker.
Orchestrator now initializes a single tracker instance and delegates CFG-specific state management to it. The tracker is responsible for:
At the same time, old tracker logic that no longer matches the current orchestrator flow was removed, including unused prompt-expansion, timeout, and failure-propagation paths.
cc: @fake0fan @princepride @natureofnature @gcanlin @hsliuustc0106
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)