Skip to content

[BugFix]: Fix multi-stage cfg bug#2801

Merged
princepride merged 9 commits into
vllm-project:mainfrom
princepride:fix-multi-stage-cfg-bug
Apr 18, 2026
Merged

[BugFix]: Fix multi-stage cfg bug#2801
princepride merged 9 commits into
vllm-project:mainfrom
princepride:fix-multi-stage-cfg-bug

Conversation

@princepride
Copy link
Copy Markdown
Collaborator

@princepride princepride commented Apr 14, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

In bagel, cfg_text use [] rather then use "<|im_start|><|im_end|>" as the default negative prompt.

Test Plan

python3 examples/offline_inference/bagel/end2end.py --model ByteDance-Seed/BAGEL-7B-MoT   --modality text2img
python3 examples/offline_inference/bagel/end2end.py --model ByteDance-Seed/BAGEL-7B-MoT   --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml  --modality text2img

Test Result

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@princepride
Copy link
Copy Markdown
Collaborator Author

@natureofnature PTAL

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Blocker scan:

  • Correctness: PASS
  • Reliability/Safety: PASS
  • Breaking Changes: ISSUES
  • Test Coverage: ISSUES
  • Documentation: ISSUES
  • Security: PASS

OVERALL: 3 BLOCKERS FOUND

VERDICT: REQUEST_CHANGES

Issues:

  1. Breaking Change without explanation: returns "" (empty string) instead of "<|im_start|><|im_end|>". This changes default CFG behavior for all BAGEL models. PR description doesn't explain why this change is necessary or what impact it has on generation quality.

  2. Missing regression test: No test added to verify the multi-stage CFG bug is actually fixed. The pixel reference updates just show output changed — need a test that:

    • Reproduces original bug (multi-stage CFG with empty negative prompt)
    • Verifies new behavior is correct
    • Doesn't rely on exact pixel matching
  3. Incomplete PR description:

    • Checklist items not checked (all items are unchecked)
    • Missing test results comparison (only shows one image, no before/after)
    • Purpose statement is vague: "use [] rather then use <|im_start|><|im_end|> as default negative prompt" — doesn't explain WHY this fixes the bug or what the bug actually was

Also, the load_format: dummy addition to test configs is unrelated to the bugfix — split into separate PR if it's for CI resource optimization.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Blocker scan:

  • Correctness: PASS
  • Reliability/Safety: PASS
  • Breaking Changes: PASS (behavioral change is the bugfix)
  • Test Coverage: PASS (updated e2e tests)
  • Documentation: PASS
  • Security: PASS

OVERALL: NO BLOCKERS

VERDICT: REQUEST_CHANGES

The fix is reasonable (empty string for default negative prompt, early return if neg_prompt is empty), but a few issues:

  1. Test reference pixels: The PR updates REFERENCE_PIXELS values without documenting the change. Is this due to the CFG fix? What's the magnitude of the change? Provide before/after comparison or comment the reason.

  2. Multi-stage testing: PR title says "multi-stage cfg bug" but only offline_inference tests are updated. Consider adding a test for the actual multi-stage scenario (mooncake/sharedmemory configs) to ensure the fix works there.

  3. Checklist items: None are checked in the PR description. Please fill in the checklist (especially "The purpose of the PR" and "The test results").

  4. Documentation: The comment in pipeline_bagel.py ("original BAGEL uses an empty KV cache (0 tokens)") is helpful. Consider adding a similar comment in bagel.py explaining why the default negative prompt is empty (not empty tokens, but no prompt at all).

Minor: load_format: dummy change in CI configs is unrelated to the bugfix. Split into a separate PR if it's intentional.

@princepride
Copy link
Copy Markdown
Collaborator Author

Blocker scan:

  • Correctness: PASS
  • Reliability/Safety: PASS
  • Breaking Changes: PASS (behavioral change is the bugfix)
  • Test Coverage: PASS (updated e2e tests)
  • Documentation: PASS
  • Security: PASS

OVERALL: NO BLOCKERS

VERDICT: REQUEST_CHANGES

The fix is reasonable (empty string for default negative prompt, early return if neg_prompt is empty), but a few issues:

  1. Test reference pixels: The PR updates REFERENCE_PIXELS values without documenting the change. Is this due to the CFG fix? What's the magnitude of the change? Provide before/after comparison or comment the reason.
  2. Multi-stage testing: PR title says "multi-stage cfg bug" but only offline_inference tests are updated. Consider adding a test for the actual multi-stage scenario (mooncake/sharedmemory configs) to ensure the fix works there.
  3. Checklist items: None are checked in the PR description. Please fill in the checklist (especially "The purpose of the PR" and "The test results").

Minor: load_format: dummy change in CI configs is unrelated to the bugfix. Split into a separate PR if it's intentional.

  1. Yes, cfg fix result pixel value change
  2. Current test already include mooncake and shm config
  3. Update

@hsliuustc0106 Can you help approve it

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride force-pushed the fix-multi-stage-cfg-bug branch from ba46a24 to f1b230e Compare April 16, 2026 08:46
@princepride
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 conflict resolved

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 16, 2026
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
lishunyang12
lishunyang12 previously approved these changes Apr 16, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [BugFix]: Fix multi-stage cfg bug

Verdict: Approve

The core logic change is correct and well-reasoned. For text2img, the original BAGEL model uses an empty KV cache (0 tokens) as the text-unconditional CFG branch. The previous code was sending <|im_start|><|im_end|> through the LLM stage to produce a KV cache for this branch, which was incorrect in the multi-stage setup. The fix correctly uses the default empty NaiveCache instead, preserving CFG guidance with the right unconditional baseline.

What I verified

  1. _get_negative_prompt returning "": This correctly causes expand_cfg_prompts / expand_cfg_prompts_think to skip creating a companion request for text2img when no user-specified negative prompt exists. The empty KV cache (NaiveCache) in pipeline_bagel.py serves as the text-unconditional branch, matching reference BAGEL behavior.

  2. cfg_img_kv fallback change (if cfg_img_kv is None instead of if cfg_img_kv is None and cfg_text_kv is not None): This is correct. For text2img, cfg_img should always fall back to the gen KV cache (injected_kv) regardless of whether cfg_text_kv was received. The old guard incorrectly skipped this assignment when cfg_text_kv was also None.

  3. Removal of cfg_parallel_contract fallback: The old code disabled CFG entirely (scales=1.0) when no companion KV caches were received. This was wrong for text2img because text2img legitimately has no cfg_text companion -- its unconditional branch is an empty cache. The removal is correct.

  4. img2img path is unaffected: The early return if not neg_prompt: return [] is correctly placed only inside the "image" in modalities branch, not the "img2img" branch. For img2img, even an empty negative prompt is meaningful because the companion still carries the image data.

  5. Test updates: Reference pixel values changed, which is expected since the CFG is now actually being applied (previously it was silently disabled, producing degraded results).

Minor nit (non-blocking)

The docstring of _get_negative_prompt still reads:

"An empty string is treated the same as absent (falls through to the Bagel default token pair), because an empty negative prompt is not meaningful for CFG guidance."

This is now stale -- the function returns "" rather than a token pair. Consider updating the docstring to reflect the new behavior, e.g. "Returns an empty string when no negative prompt is configured, which signals the caller to skip creating a CFG text companion."

@lishunyang12 lishunyang12 dismissed their stale review April 16, 2026 14:55

Replacing with inline comments

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Comment thread vllm_omni/diffusion/models/bagel/pipeline_bagel.py Outdated
princepride and others added 2 commits April 17, 2026 20:13
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Fix indentation error in pipeline_bagel.py

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride princepride enabled auto-merge (squash) April 18, 2026 03:16
pass

if cfg_img_kv is None:
cfg_img_kv = injected_kv
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the else, cfg_img_kv = injected_kv is dead — cfg_img_kv isn't read after this block. Was the intent to also populate cfg_img_context with injected_kv when it's None? The single-stage path populates cfg_img_context with the positive prompt KV via forward_cache_update_text, and for text2img multi-stage injected_kv is the equivalent. If leaving cfg_img_context empty here is intentional, drop the assignment.

@@ -300,4 +304,4 @@ def _get_negative_prompt(
if neg:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring above (line 293-296) is stale — still says "falls through to the Bagel default token pair" but we now return "". Please update.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride merged commit 768931e into vllm-project:main Apr 18, 2026
8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants