[Diffusion] Refactor CFG parallel for extensibility and performance#2063
[Diffusion] Refactor CFG parallel for extensibility and performance#2063princepride merged 6 commits intovllm-project:mainfrom
Conversation
6691f5d to
c40e4a8
Compare
936401e to
7f3b059
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f3b05931d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
042f6d3 to
dfed6ba
Compare
|
I agree with philosophy of this refactoring PR, as it brings better extensibility.
I am afraid some diffusion schedulers are not deterministic, for example, DDPM solver. Although in vLLM-Omni, most of diffusion models use deterministic schedulers, like flow matching. Therefore, I suggest to better handle the randomness of |
|
Bagel, GLM-Image, dreamid-omni, nextstep1.1, and LTX-2 do not inherit from |
Thanks for the thorough review! Here's what's been addressed: 1. Non-deterministic scheduler (generator support)Done in 2 & 3. Doc updates (
|
| Model | Inherits Mixin? | Affected? |
|---|---|---|
| Bagel | No | No — fully independent CFG impl |
| GLM-Image | No | No — uses parallel_state directly |
| NextStep 1.1 | No | No — uses parallel_state directly |
| DreamID-Omni | Yes | No — only calls combine_cfg_noise() with plain tensors; _wrap/_unwrap preserves backward compat |
| LTX-2 | Yes | No — same as above; uses its own predict_noise_av_maybe_with_cfg |
DreamID-Omni and LTX-2 inherit CFGParallelMixin but only use combine_cfg_noise(), which now handles both plain Tensor and tuple[Tensor, ...] transparently. Verified with standalone unit tests covering both input types.
btw, I have an on-going draft PR refactoring LTX-2 on the new cfg_parallel.
d53df5a to
384274a
Compare
9d2bdf5 to
5fa1e5f
Compare
Key changes to CFGParallelMixin: 1. Remove redundant broadcast in scheduler_step_maybe_with_cfg: After all_gather, all ranks already have identical positive and negative predictions. CFG combine and scheduler step are deterministic, so all ranks compute locally with identical results. The broadcast in scheduler_step_maybe_with_cfg is eliminated. 2. All ranks return valid results from predict_noise_maybe_with_cfg: Previously rank 0 returned the combined prediction and rank 1 returned None. Now all ranks compute combine locally and return valid tensors. This simplifies downstream code. 3. Tuple output support via combine_cfg_noise override: predict_noise() can now return tuple[Tensor, ...] for multi-output models. The mixin internally normalizes via _wrap/_unwrap helpers. combine_cfg_noise() accepts tuple and applies CFG per-element. Multi-output models override combine_cfg_noise() for custom logic (e.g., CFG on video, positive-only on action). Backward compatible: all existing callers use default parameters and see no behavior change. Bit-identical outputs verified on Qwen-Image-2512 (25B). Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
…uper() calls Fixes the documented pattern where super().combine_cfg_noise((tensor,), ...) would return a plain tensor via _unwrap, causing tuple unpacking to fail. Updated examples pass plain tensors to super() and receive plain tensors back. Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
…ic scheduler support Address review feedback from wtomin: after removing the broadcast in scheduler_step_maybe_with_cfg, non-deterministic schedulers (e.g., DDPM) would diverge across CFG parallel ranks. Add generator parameter to scheduler_step() and scheduler_step_maybe_with_cfg() so callers can pass a seeded generator to keep both ranks in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
…rministic scheduler note Address review feedback: update VideoAudioScheduler.step() signature to include generator parameter, pass it through to sub-schedulers, and add a note about using explicit generators for non-deterministic schedulers (e.g., DDPM) in CFG parallel mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
- Remove outer _unwrap() around combine_cfg_noise() calls in both CFG-parallel and sequential paths. combine_cfg_noise() already unwraps internally, so the double _unwrap stripped the batch dimension when batch_size == 1 for single-output models. - Only pass generator kwarg to sched.step() when generator is not None, avoiding TypeError for schedulers without that parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yangshen Deng <yangshen.d@outlook.com>
ba8f3a3 to
c27b8ef
Compare
…llm-project#2063) Signed-off-by: Yangshen Deng <yangshen.d@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Refactor
CFGParallelMixinto improve performance, reduce code complexity, and enable extensibility for multi-output models (e.g., world models producing both video and action predictions).Problem
Current
CFGParallelMixinhas three issues:Redundant communication: Every denoising step does
all_gather+broadcast. Afterall_gather, both ranks already have identical positive and negative predictions — thebroadcastinscheduler_step_maybe_with_cfgis unnecessary.Code complexity:
predict_noise_maybe_with_cfgreturns valid result only on rank 0,Noneon rank 1.scheduler_step_maybe_with_cfghas rank-0-only step + broadcast. All downstream code must handle None/rank-branching.Limited extensibility:
predict_noise_maybe_with_cfgassumes single tensor output. Models with multi-output (e.g., DreamZero world model returning(video_pred, action_pred)) cannot use the mixin without overriding the entire method.Changes
Change 1: Remove redundant broadcast
all_gather, all ranks compute CFG combine + scheduler step locallybroadcastper denoising steppredict_noise_maybe_with_cfgreturns valid tensor on all ranks (not just rank 0)scheduler_step_maybe_with_cfgsimplified: just callsscheduler_stepon all ranksChange 2: Tuple output support via
combine_cfg_noiseoverridepredict_noise()can now returntuple[Tensor, ...]for multi-output models_wrap/_unwraphelpers, processes uniformlycombine_cfg_noise()accepts tuple, applies CFG formula per-element by defaultcombine_cfg_noise()for custom per-element logic:Correctness Argument
After
all_gather, both ranks have bit-identical copies of positive/negative predictions. CFG combine and scheduler step are deterministic (pure arithmetic, no randomness). Same inputs + same ops = same results on both ranks.Benchmark
Setup:
cb6d012d(latest origin/main)Performance
No performance regression. Broadcast cost is negligible on this hardware/model (PCIe Gen5, ~0.1ms vs ~250ms transformer forward per step). The value of this PR is code simplification and extensibility, not raw speed.
Correctness
972a8c57)c218a5aa)Bit-identical outputs confirm the refactor produces exactly the same results.
Test plan
tests/diffusion/distributed/test_cfg_parallel.pynow verifies all CFG ranks receive the same combined output and match the sequential CFG baselineNext step