Skip to content

[Core] Refactor CFG companion tracker and use in Orchestrator#2623

Merged
gcanlin merged 10 commits into
vllm-project:mainfrom
yinpeiqi:ref-cfg-orch
Apr 17, 2026
Merged

[Core] Refactor CFG companion tracker and use in Orchestrator#2623
gcanlin merged 10 commits into
vllm-project:mainfrom
yinpeiqi:ref-cfg-orch

Conversation

@yinpeiqi
Copy link
Copy Markdown
Contributor

@yinpeiqi yinpeiqi commented Apr 9, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

The CfgCompaionTracker, originally used in the AsyncOmni and Omni class (before #1908 refactor), and now is deperated.

Before this change, the CFG companion lifecycle was managed directly inside Orchestrator. That made the orchestration layer responsible for too many low-level details:

  • parent -> companion mapping
  • companion completion tracking
  • deferred parent forwarding
  • cleanup and abort handling
  • diffusion-stage CFG KV request ID attachment

This had a few problems:

  • Orchestrator exposed too much CFG-specific state and bookkeeping logic.
  • The CFG flow was spread across multiple methods, which made the control flow harder to follow.
  • State ownership was unclear: companion-related data lived in Orchestrator, while a separate CfgCompanionTracker already existed but did not match the current runtime path.
  • The abstraction boundary was weak, which made future CFG changes riskier and harder to reason about.

What changed

This refactor moves CFG companion state ownership out of Orchestrator and into CfgCompanionTracker.

Orchestrator now initializes a single tracker instance and delegates CFG-specific state management to it. The tracker is responsible for:

  • registering parent/companion relationships
  • checking whether a request is a companion
  • tracking companion completion
  • deferring parent forwarding until companions are ready
  • attaching cfg_kv_request_ids for diffusion requests
  • cleaning up companion state
  • expanding aborts from parent requests to companion requests

At the same time, old tracker logic that no longer matches the current orchestrator flow was removed, including unused prompt-expansion, timeout, and failure-propagation paths.

cc: @fake0fan @princepride @natureofnature @gcanlin @hsliuustc0106

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

yinpeiqi added 2 commits April 9, 2026 12:49
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
@yinpeiqi yinpeiqi requested a review from hsliuustc0106 as a code owner April 9, 2026 05:42
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

do we have any perf/acc regression example tests or just run the CI?

@yinpeiqi
Copy link
Copy Markdown
Contributor Author

yinpeiqi commented Apr 9, 2026

do we have any perf/acc regression example tests or just run the CI?

Just need to run the CI for BAGEL, this PR do not change the execution logic.

@natureofnature
Copy link
Copy Markdown
Contributor

natureofnature commented Apr 9, 2026

do we have any perf/acc regression example tests or just run the CI?

Just need to run the CI for BAGEL, this PR do not change the execution logic.

You can refer to

VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_TEST_CLEAN_GPU_MEMORY=1 VLLM_IMAGE_FETCH_TIMEOUT=60  pytest tests/e2e/offline_inference/test_bagel_img2img.py tests/e2e/offline_inference/test_bagel_text2img.py tests/e2e/online_serving/test_bagel_online.py tests/e2e/online_serving/test_bagel_expansion.py -v -m "advanced_model" --run-level advanced_model

for both online and offline L3 test @yinpeiqi

@@ -1,89 +1,29 @@
"""CFG companion request tracker for the Omni orchestrator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, it feels very strange to put this file under the entrypoint.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, this too feature specific

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to engine folder.

yinpeiqi and others added 2 commits April 9, 2026 16:26
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
@gcanlin gcanlin added the ready label to trigger buildkite CI label Apr 11, 2026
@gcanlin gcanlin requested a review from princepride April 11, 2026 14:09
Comment thread vllm_omni/engine/cfg_companion_tracker.py Outdated
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
@yinpeiqi
Copy link
Copy Markdown
Contributor Author

This PR could be merge? @gcanlin @hsliuustc0106

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @princepride Could you double-check it?

lishunyang12
lishunyang12 previously approved these changes Apr 16, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refactor. Moving CFG companion state out of Orchestrator and into a dedicated CfgCompanionTracker class makes the responsibilities clearer and the orchestrator loop easier to follow. The key rename from "output" to "engine_outputs" is consistently applied across defer_parent, pop_pending_parent, and the consuming code in _handle_cfg_companion_ready. Tests cover the main paths well.

A few observations:

1. abort_parents does not clean up companion-only aborts
If abort_parents receives a companion ID (not a parent), it passes it through without removing it from _companion_ids, _companion_to_parent, or _done. This leaves stale tracking state. In practice this is probably fine because _handle_abort is always called with parent IDs from the external layer, but it is worth a comment or a defensive cleanup to avoid confusion if the method is reused elsewhere.

2. on_companion_completed assumes _done[parent_id] exists
Line self._done[parent_id].add(companion_id) will KeyError if the parent was never registered. This is safe today because register_companion always initializes _done via register_parent, but a guard or assertion (assert parent_id in self._done) would make the invariant explicit and give a better error message if the contract is violated.

3. Removed _upgrade_to_omni_request call in async_omni_engine.py
The deletion of request = _upgrade_to_omni_request(request, companion_prompt) at line 1075 looks intentional but is not mentioned in the PR description. Is this line dead code after #1908, or was it still doing something? Worth confirming this does not affect the companion request payload.

4. Removal of timeout / failure-propagation paths
The old tracker had check_timeouts(), on_companion_error(), is_parent_failed(), and consume_parent_failure(). The PR description says these are unused in the current orchestrator flow. That seems correct from the diff -- the orchestrator never called them. Just confirming this is intentional: if a companion hangs indefinitely the parent will remain deferred forever. If that is an acceptable risk today, a TODO comment noting the lack of timeout would be helpful for future readers.

5. Minor nit: typo in PR title
"compaion" should be "companion".

Overall this is clean and well-scoped. LGTM.

@lishunyang12 lishunyang12 dismissed their stale review April 16, 2026 14:56

Replacing with inline comments

@yinpeiqi yinpeiqi changed the title [Core] Refactor CFG compaion tracker and use in Orchestrator [Core] Refactor CFG companion tracker and use in Orchestrator Apr 17, 2026
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
yinpeiqi and others added 2 commits April 17, 2026 11:15
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
@yinpeiqi
Copy link
Copy Markdown
Contributor Author

Good refactor. Moving CFG companion state out of Orchestrator and into a dedicated CfgCompanionTracker class makes the responsibilities clearer and the orchestrator loop easier to follow. The key rename from "output" to "engine_outputs" is consistently applied across defer_parent, pop_pending_parent, and the consuming code in _handle_cfg_companion_ready. Tests cover the main paths well.

A few observations:

1. abort_parents does not clean up companion-only aborts If abort_parents receives a companion ID (not a parent), it passes it through without removing it from _companion_ids, _companion_to_parent, or _done. This leaves stale tracking state. In practice this is probably fine because _handle_abort is always called with parent IDs from the external layer, but it is worth a comment or a defensive cleanup to avoid confusion if the method is reused elsewhere.

2. on_companion_completed assumes _done[parent_id] exists Line self._done[parent_id].add(companion_id) will KeyError if the parent was never registered. This is safe today because register_companion always initializes _done via register_parent, but a guard or assertion (assert parent_id in self._done) would make the invariant explicit and give a better error message if the contract is violated.

3. Removed _upgrade_to_omni_request call in async_omni_engine.py The deletion of request = _upgrade_to_omni_request(request, companion_prompt) at line 1075 looks intentional but is not mentioned in the PR description. Is this line dead code after #1908, or was it still doing something? Worth confirming this does not affect the companion request payload.

4. Removal of timeout / failure-propagation paths The old tracker had check_timeouts(), on_companion_error(), is_parent_failed(), and consume_parent_failure(). The PR description says these are unused in the current orchestrator flow. That seems correct from the diff -- the orchestrator never called them. Just confirming this is intentional: if a companion hangs indefinitely the parent will remain deferred forever. If that is an acceptable risk today, a TODO comment noting the lack of timeout would be helpful for future readers.

5. Minor nit: typo in PR title "compaion" should be "companion".

Overall this is clean and well-scoped. LGTM.

Fixed according to the comments. For 3, _upgrade_to_omni_request is used only for the TTS model with input additional_information. Not affect to BAGEL, so i directly remove this one here.

@gcanlin gcanlin enabled auto-merge (squash) April 17, 2026 07:39
@gcanlin gcanlin merged commit bbd6a44 into vllm-project:main Apr 17, 2026
8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants