[Bagel] Eliminate broadcast in CFG parallel denoising loop by nussejzz · Pull Request #1695 · vllm-project/vllm-omni

nussejzz · 2026-03-06T02:50:52Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Optimize _generate_image_parallel in Bagel's CFG parallel denoising loop by eliminating all broadcast calls.

Before: Every timestep (e.g. 50 steps) required an all_gather + broadcast. Only rank 0 performed CFG combine and x_t update, then broadcast the result. Outside the CFG interval, only rank 0 computed while rank 1/2 idled.

After:

CFG interval steps: All ranks perform all_gather → each rank independently runs _combine_cfg and updates x_t. No broadcast needed since all ranks have identical gathered tensors and _combine_cfg is a deterministic pure function.
Non-CFG interval steps: All ranks redundantly compute with gen branch inputs. Since update_past_key_values=False (KV cache is frozen/read-only during denoising), identical inputs produce identical v_t across all ranks. No communication needed.

Result: Broadcast reduced from N per loop (N = num_timesteps) to zero. Communication is now only all_gather during CFG interval steps, which is the theoretical minimum.

Metric	Before	After
`broadcast` per loop	N (e.g. 50)	0
`all_gather` per loop	CFG steps only (~30)	CFG steps only (~30)
Rank 1/2 utilization (non-CFG steps)	idle	computing (redundant)

Test Plan

Existing unit tests for _combine_cfg cover correctness: pytest tests/diffusion/models/bagel/test_combine_cfg.py
No new tests needed: this is a communication-only optimization with no change to math logic. The _combine_cfg function, _forward_flow_single_branch inputs, and update_past_key_values=False semantics are all unchanged.
E2E validation with CFG parallel mode (cfg_parallel_size=2/3) to verify generated images are identical.

Test Result

Pending E2E validation on multi-GPU setup.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b25a5dc69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

nussejzz · 2026-03-06T03:18:34Z

@princepride PTAL😊
While preparing the project presentation, I discovered that the code could be simplified and the logic optimized. Additionally, I believe that although the non-associative nature of GPU floating-point operations might theoretically lead to minute floating-point differences, these should be invisible in the denoising model.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd9b5342ea

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-Authored-By: princepride <wangzhipeng628@gmail.com>

hsliuustc0106

Review

Rating: 9/10 | Verdict: ✅ Approved (pending E2E validation)

Summary

Excellent communication optimization for BAGEL CFG parallel mode, eliminating all broadcast calls (N → 0) while maintaining correctness through deterministic computation and frozen KV cache.

Multi-Category Review Coverage

Primary: [Perf] (vllm-omni-perf)

✅ Communication reduction: broadcast N→0 per loop
✅ Rank utilization: non-CFG steps now computing (vs idle)
⚠️ Missing: actual benchmark data (latency improvement?)

Secondary: [Distributed] (implicit)

✅ Communication pattern: only all_gather in CFG interval
✅ Deterministic assumption: _combine_cfg is pure function ✅
✅ KV cache invariant: update_past_key_values=False ✅
⚠️ Pending: E2E validation on multi-GPU

Key Optimizations

Initial broadcast only: Ensures all ranks start with same x_t (noise)
Deterministic _combine_cfg: All ranks independently compute identical result
Redundant non-CFG computation: All ranks compute (no idle time) vs rank 0 only
Frozen KV cache: No side effects, safe for redundant computation

Correctness Analysis

Property	Status	Reason
`_combine_cfg` deterministic	✅	Pure function, same inputs → same outputs
`update_past_key_values=False`	✅	KV cache frozen/read-only during denoising
Identical gen inputs	✅	All ranks use same x_t after initial broadcast
No side effects	✅	Redundant computation safe

Highlights

✅ Clean mathematical reasoning (deterministic pure function)
✅ Leverages frozen KV cache property
✅ Minimal code change (33 lines, focused)
✅ Existing unit tests cover _combine_cfg correctness

Concerns

Missing benchmark data: PR description mentions optimization but doesn't provide actual latency measurements. How much faster is this in practice?
Pending E2E validation: "Pending E2E validation on multi-GPU setup" - please provide results before merge.
Redundant computation trade-off: Non-CFG steps now compute on all ranks (vs rank 0 only). This increases compute but reduces communication. Verify this is a net win.

Minor Suggestions (non-blocking)

Add benchmark: Even a simple latency comparison (e.g., "50 steps, cfg_parallel_size=3, before/after") would help quantify the improvement.
Comment clarity: Consider adding a comment explaining why redundant computation is safe: # Safe because update_past_key_values=False (KV cache is frozen)

Pitfalls Check

Directory	Pitfall	Status
`diffusion/models/bagel/`	Communication pattern	✅ Optimized
`diffusion/models/bagel/`	Side effects	✅ None (frozen KV)

Recommendation

Approve with requirement: provide E2E validation results before merge. The optimization is sound, but empirical confirmation is needed.

Reviewed by OpenClaw with vllm-omni-skills 🦐

Multi-Category Review: Primary=vllm-omni-perf, Secondary=distributed patterns

…ect#1695) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-authored-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com>

nussejzz requested a review from hsliuustc0106 as a code owner March 6, 2026 02:50

nussejzz force-pushed the optimize-cfg-parallel-comm branch from 2b25a5d to fd9b534 Compare March 6, 2026 02:52

nussejzz marked this pull request as draft March 6, 2026 02:56

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

nussejzz marked this pull request as ready for review March 6, 2026 03:14

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

nussejzz force-pushed the optimize-cfg-parallel-comm branch from fd9b534 to ca8387d Compare March 6, 2026 03:21

Optimize _generate_image_parallel logic to reduce communication

347481c

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-Authored-By: princepride <wangzhipeng628@gmail.com>

nussejzz force-pushed the optimize-cfg-parallel-comm branch from ca8387d to 347481c Compare March 6, 2026 03:24

hsliuustc0106 approved these changes Mar 6, 2026

View reviewed changes

princepride reviewed Mar 6, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py

princepride approved these changes Mar 6, 2026

View reviewed changes

princepride enabled auto-merge (squash) March 6, 2026 16:37

Merge branch 'main' into optimize-cfg-parallel-comm

b7b421e

princepride approved these changes Mar 8, 2026

View reviewed changes

princepride added the ready label to trigger buildkite CI label Mar 8, 2026

Merge branch 'main' into optimize-cfg-parallel-comm

2e303c6

princepride merged commit fb717a4 into vllm-project:main Mar 9, 2026
6 of 7 checks passed

nussejzz mentioned this pull request Mar 28, 2026

[Diffusion] Refactor CFG parallel for extensibility and performance #2063

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695

[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695
princepride merged 3 commits intovllm-project:mainfrom
nussejzz:optimize-cfg-parallel-comm

nussejzz commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

nussejzz commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nussejzz commented Mar 6, 2026

Purpose

Test Plan

Test Result

Pending E2E validation on multi-GPU setup.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

nussejzz commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review

Summary

Multi-Category Review Coverage

Key Optimizations

Correctness Analysis

Highlights

Concerns

Minor Suggestions (non-blocking)

Pitfalls Check

Recommendation

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants