[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695
[Bagel] Eliminate broadcast in CFG parallel denoising loop#1695princepride merged 3 commits intovllm-project:mainfrom
Conversation
2b25a5d to
fd9b534
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b25a5dc69
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@princepride PTAL😊 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd9b5342ea
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
fd9b534 to
ca8387d
Compare
Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-Authored-By: princepride <wangzhipeng628@gmail.com>
ca8387d to
347481c
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review
Rating: 9/10 | Verdict: ✅ Approved (pending E2E validation)
Summary
Excellent communication optimization for BAGEL CFG parallel mode, eliminating all broadcast calls (N → 0) while maintaining correctness through deterministic computation and frozen KV cache.
Multi-Category Review Coverage
Primary: [Perf] (vllm-omni-perf)
- ✅ Communication reduction: broadcast N→0 per loop
- ✅ Rank utilization: non-CFG steps now computing (vs idle)
⚠️ Missing: actual benchmark data (latency improvement?)
Secondary: [Distributed] (implicit)
- ✅ Communication pattern: only all_gather in CFG interval
- ✅ Deterministic assumption:
_combine_cfgis pure function ✅ - ✅ KV cache invariant:
update_past_key_values=False✅ ⚠️ Pending: E2E validation on multi-GPU
Key Optimizations
- Initial broadcast only: Ensures all ranks start with same x_t (noise)
- Deterministic
_combine_cfg: All ranks independently compute identical result - Redundant non-CFG computation: All ranks compute (no idle time) vs rank 0 only
- Frozen KV cache: No side effects, safe for redundant computation
Correctness Analysis
| Property | Status | Reason |
|---|---|---|
_combine_cfg deterministic |
✅ | Pure function, same inputs → same outputs |
update_past_key_values=False |
✅ | KV cache frozen/read-only during denoising |
| Identical gen inputs | ✅ | All ranks use same x_t after initial broadcast |
| No side effects | ✅ | Redundant computation safe |
Highlights
- ✅ Clean mathematical reasoning (deterministic pure function)
- ✅ Leverages frozen KV cache property
- ✅ Minimal code change (33 lines, focused)
- ✅ Existing unit tests cover
_combine_cfgcorrectness
Concerns
-
Missing benchmark data: PR description mentions optimization but doesn't provide actual latency measurements. How much faster is this in practice?
-
Pending E2E validation: "Pending E2E validation on multi-GPU setup" - please provide results before merge.
-
Redundant computation trade-off: Non-CFG steps now compute on all ranks (vs rank 0 only). This increases compute but reduces communication. Verify this is a net win.
Minor Suggestions (non-blocking)
-
Add benchmark: Even a simple latency comparison (e.g., "50 steps, cfg_parallel_size=3, before/after") would help quantify the improvement.
-
Comment clarity: Consider adding a comment explaining why redundant computation is safe:
# Safe because update_past_key_values=False (KV cache is frozen)
Pitfalls Check
| Directory | Pitfall | Status |
|---|---|---|
diffusion/models/bagel/ |
Communication pattern | ✅ Optimized |
diffusion/models/bagel/ |
Side effects | ✅ None (frozen KV) |
Recommendation
Approve with requirement: provide E2E validation results before merge. The optimization is sound, but empirical confirmation is needed.
Reviewed by OpenClaw with vllm-omni-skills 🦐
Multi-Category Review: Primary=vllm-omni-perf, Secondary=distributed patterns
…ect#1695) Signed-off-by: Ding Zuhao <e1583181@u.nus.edu> Co-authored-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Optimize
_generate_image_parallelin Bagel's CFG parallel denoising loop by eliminating allbroadcastcalls.Before: Every timestep (e.g. 50 steps) required an
all_gather+broadcast. Only rank 0 performed CFG combine andx_tupdate, then broadcast the result. Outside the CFG interval, only rank 0 computed while rank 1/2 idled.After:
all_gather→ each rank independently runs_combine_cfgand updatesx_t. No broadcast needed since all ranks have identicalgatheredtensors and_combine_cfgis a deterministic pure function.update_past_key_values=False(KV cache is frozen/read-only during denoising), identical inputs produce identicalv_tacross all ranks. No communication needed.Result: Broadcast reduced from N per loop (N = num_timesteps) to zero. Communication is now only
all_gatherduring CFG interval steps, which is the theoretical minimum.broadcastper loopall_gatherper loopTest Plan
_combine_cfgcover correctness:pytest tests/diffusion/models/bagel/test_combine_cfg.py_combine_cfgfunction,_forward_flow_single_branchinputs, andupdate_past_key_values=Falsesemantics are all unchanged.Test Result
Pending E2E validation on multi-GPU setup.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)