Skip to content

[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695

Merged
princepride merged 3 commits intovllm-project:mainfrom
nussejzz:optimize-cfg-parallel-comm
Mar 9, 2026
Merged

[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695
princepride merged 3 commits intovllm-project:mainfrom
nussejzz:optimize-cfg-parallel-comm

Conversation

@nussejzz
Copy link
Copy Markdown
Contributor

@nussejzz nussejzz commented Mar 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Optimize _generate_image_parallel in Bagel's CFG parallel denoising loop by eliminating all broadcast calls.

Before: Every timestep (e.g. 50 steps) required an all_gather + broadcast. Only rank 0 performed CFG combine and x_t update, then broadcast the result. Outside the CFG interval, only rank 0 computed while rank 1/2 idled.

After:

  • CFG interval steps: All ranks perform all_gather → each rank independently runs _combine_cfg and updates x_t. No broadcast needed since all ranks have identical gathered tensors and _combine_cfg is a deterministic pure function.
  • Non-CFG interval steps: All ranks redundantly compute with gen branch inputs. Since update_past_key_values=False (KV cache is frozen/read-only during denoising), identical inputs produce identical v_t across all ranks. No communication needed.

Result: Broadcast reduced from N per loop (N = num_timesteps) to zero. Communication is now only all_gather during CFG interval steps, which is the theoretical minimum.

Metric Before After
broadcast per loop N (e.g. 50) 0
all_gather per loop CFG steps only (~30) CFG steps only (~30)
Rank 1/2 utilization (non-CFG steps) idle computing (redundant)

Test Plan

  • Existing unit tests for _combine_cfg cover correctness: pytest tests/diffusion/models/bagel/test_combine_cfg.py
  • No new tests needed: this is a communication-only optimization with no change to math logic. The _combine_cfg function, _forward_flow_single_branch inputs, and update_past_key_values=False semantics are all unchanged.
  • E2E validation with CFG parallel mode (cfg_parallel_size=2/3) to verify generated images are identical.

Test Result

Pending E2E validation on multi-GPU setup.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@nussejzz nussejzz requested a review from hsliuustc0106 as a code owner March 6, 2026 02:50
@nussejzz nussejzz force-pushed the optimize-cfg-parallel-comm branch from 2b25a5d to fd9b534 Compare March 6, 2026 02:52
@nussejzz nussejzz marked this pull request as draft March 6, 2026 02:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b25a5dc69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
@nussejzz nussejzz marked this pull request as ready for review March 6, 2026 03:14
@nussejzz
Copy link
Copy Markdown
Contributor Author

nussejzz commented Mar 6, 2026

@princepride PTAL😊
While preparing the project presentation, I discovered that the code could be simplified and the logic optimized. Additionally, I believe that although the non-associative nature of GPU floating-point operations might theoretically lead to minute floating-point differences, these should be invisible in the denoising model.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd9b5342ea

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
@nussejzz nussejzz force-pushed the optimize-cfg-parallel-comm branch from fd9b534 to ca8387d Compare March 6, 2026 03:21
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-Authored-By: princepride <wangzhipeng628@gmail.com>
@nussejzz nussejzz force-pushed the optimize-cfg-parallel-comm branch from ca8387d to 347481c Compare March 6, 2026 03:24
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Rating: 9/10 | Verdict: ✅ Approved (pending E2E validation)

Summary

Excellent communication optimization for BAGEL CFG parallel mode, eliminating all broadcast calls (N → 0) while maintaining correctness through deterministic computation and frozen KV cache.

Multi-Category Review Coverage

Primary: [Perf] (vllm-omni-perf)

  • ✅ Communication reduction: broadcast N→0 per loop
  • ✅ Rank utilization: non-CFG steps now computing (vs idle)
  • ⚠️ Missing: actual benchmark data (latency improvement?)

Secondary: [Distributed] (implicit)

  • ✅ Communication pattern: only all_gather in CFG interval
  • ✅ Deterministic assumption: _combine_cfg is pure function ✅
  • ✅ KV cache invariant: update_past_key_values=False
  • ⚠️ Pending: E2E validation on multi-GPU

Key Optimizations

  1. Initial broadcast only: Ensures all ranks start with same x_t (noise)
  2. Deterministic _combine_cfg: All ranks independently compute identical result
  3. Redundant non-CFG computation: All ranks compute (no idle time) vs rank 0 only
  4. Frozen KV cache: No side effects, safe for redundant computation

Correctness Analysis

Property Status Reason
_combine_cfg deterministic Pure function, same inputs → same outputs
update_past_key_values=False KV cache frozen/read-only during denoising
Identical gen inputs All ranks use same x_t after initial broadcast
No side effects Redundant computation safe

Highlights

  • ✅ Clean mathematical reasoning (deterministic pure function)
  • ✅ Leverages frozen KV cache property
  • ✅ Minimal code change (33 lines, focused)
  • ✅ Existing unit tests cover _combine_cfg correctness

Concerns

  1. Missing benchmark data: PR description mentions optimization but doesn't provide actual latency measurements. How much faster is this in practice?

  2. Pending E2E validation: "Pending E2E validation on multi-GPU setup" - please provide results before merge.

  3. Redundant computation trade-off: Non-CFG steps now compute on all ranks (vs rank 0 only). This increases compute but reduces communication. Verify this is a net win.

Minor Suggestions (non-blocking)

  1. Add benchmark: Even a simple latency comparison (e.g., "50 steps, cfg_parallel_size=3, before/after") would help quantify the improvement.

  2. Comment clarity: Consider adding a comment explaining why redundant computation is safe: # Safe because update_past_key_values=False (KV cache is frozen)

Pitfalls Check

Directory Pitfall Status
diffusion/models/bagel/ Communication pattern ✅ Optimized
diffusion/models/bagel/ Side effects ✅ None (frozen KV)

Recommendation

Approve with requirement: provide E2E validation results before merge. The optimization is sound, but empirical confirmation is needed.


Reviewed by OpenClaw with vllm-omni-skills 🦐

Multi-Category Review: Primary=vllm-omni-perf, Secondary=distributed patterns

Comment thread vllm_omni/diffusion/models/bagel/bagel_transformer.py
@princepride princepride enabled auto-merge (squash) March 6, 2026 16:37
@princepride princepride added the ready label to trigger buildkite CI label Mar 8, 2026
@princepride princepride merged commit fb717a4 into vllm-project:main Mar 9, 2026
6 of 7 checks passed
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
…ect#1695)

Signed-off-by: Ding Zuhao <e1583181@u.nus.edu>
Co-authored-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants