[Feature] Add CFG parallel to Omnigen2 by zzhuoxin1508 · Pull Request #2074 · vllm-project/vllm-omni

zzhuoxin1508 · 2026-03-22T14:12:21Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add CFG parallel support to OmniGen2 pipeline (cfg_parallel_size=2 for text-to-image, cfg_parallel_size=3 for image editing with ref image)
Fall back to sequential CFG when image guidance is enabled without sufficient ranks (cfg_world_size < 3)

Test Plan

text2image
python text_to_image.py
--model "OmniGen2/OmniGen2"
--prompt "A classroom with bright lighting and wooden desks."
--negative-prompt "(((deformed))), blurry, over saturation, bad anatomy, disfigured, poorly drawn face, mutation, mutated, (extra_limb), (ugly), (poorly drawn hands), fused fingers, messy drawing, broken legs censor, censored, censor_bar"
--num-inference-steps 50
--seed 0
--guidance-scale 5.0
--cfg-parallel-size 2
--output /workspace/outputs/image_t2icfg3.png
image2image
-python image_edit.py
--image /workspace/image1.png
--model "OmniGen2/OmniGen2"
--prompt "Change the background to classroom."
--negative-prompt "(((deformed))), blurry, over saturation, bad anatomy, disfigured, poorly drawn face, mutation, mutated, (extra_limb), (ugly), (poorly drawn hands), fused fingers, messy drawing, broken legs censor, censored, censor_bar"
--num-inference-steps 50
--seed 0
--guidance-scale 5.0
--guidance-scale-2 2.0
--cfg-parallel-size 3
--output /workspace/outputs/image_edit.png

Test Result

Text-to-Image (1024×1024,)

cfg_parallel_size	GPUs	Generation Time	Speedup
1	1	54.4s	1.0x
2	2	27.3s	2.0x

Output images are visually identical across sequential and parallel modes.

Image Editing (inputsize=1696×2528)

cfg_parallel_size	GPUs	Generation Time	Speedup
1	1	90.2s	1.0x
3	3	36.6s	2.46x

Output images are visually identical across sequential and parallel modes.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

zzhuoxin1508 · 2026-03-22T14:13:42Z

@nussejzz

zzhuoxin1508 · 2026-03-22T14:15:30Z

text2image output

image2image output

Co-authored-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

nussejzz

PTAL @zzhuoxin1508

nussejzz · 2026-03-22T14:39:15Z

+    ) -> torch.Tensor:
+        """CFG parallel denoising loop: each rank computes one CFG branch, returns latents.
+
+        Rank 0: cond branch  (prompt_embeds,          ref_latents)


Delete the spaces

lishunyang12

Left a few comments on the parallel path.

lishunyang12 · 2026-03-22T17:26:31Z

-                )
-            elif text_guidance_scale > 1.0:
-                model_pred_uncond = self.predict(
+                gathered = cfg_group.all_gather(local_pred, separate_tensors=True)


all_gather_into_tensor requires contiguous input. local_pred coming out of self.predict() may not be contiguous depending on the transformer output layout. Add local_pred = local_pred.contiguous() before the all-gather, same way you already do for latents at line 1291.

lishunyang12 · 2026-03-22T17:26:31Z

+
+        for i, t in enumerate(timesteps):
+            in_cfg_range = self.cfg_range[0] <= i / len(timesteps) <= self.cfg_range[1]
+            use_cfg_this_step = in_cfg_range and self.text_guidance_scale > 1.0


self.text_guidance_scale > 1.0 is always true inside _processing_parallel (the caller checks it in cfg_parallel_ready). This makes use_cfg_this_step equivalent to just in_cfg_range. Not a bug, but it's confusing — consider simplifying to use_cfg_this_step = in_cfg_range.

lishunyang12 · 2026-03-22T17:26:31Z

-                model_pred_uncond = self.predict(
+                gathered = cfg_group.all_gather(local_pred, separate_tensors=True)
+                model_pred, model_pred_uncond = gathered[0], gathered[1]
+                if use_cfg_img_this_step and len(gathered) > 2:


Nit: len(gathered) > 2 is always true when use_cfg_img_this_step is true, since the caller guarantees cfg_world_size >= 3 whenever use_cfg_img. The double-check reads like there's a case where it could be false. Drop it or add a comment explaining it's a defensive check.

lishunyang12 · 2026-03-22T17:26:31Z

+                    model_pred = model_pred_uncond + self.text_guidance_scale * (model_pred - model_pred_uncond)
+                latents = self.scheduler.step(model_pred, t, latents, return_dict=False)[0]
+            else:
+                # Outside CFG interval: all ranks use cond branch, no comm


Outside the CFG range every rank computes the same cond prediction independently — wasted FLOPs on ranks 1+. Consider having only rank 0 run predict and broadcasting model_pred, like you already do for the initial latents sync.

Thanks,but data is the same on all cards, if we only run it on Rank 0, we’d just be adding an extra broadcast step. it wouldn't really save any time.

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

zzhuoxin1508 · 2026-03-23T12:26:58Z

@princepride Could you please review this

princepride

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

zzhuoxin1508 · 2026-03-23T12:36:17Z

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

Thanks, I'll look into it

TKONIY · 2026-03-23T14:49:00Z

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

Thanks, I'll look into it

There are similar problems in integrating existing cfg_p (even after #2063) with Omnigen2 and DreamID-Omni: cfg_p now only supports defining no more than 2 branches, while Omnigen2 got 3 branches and DreamID-Omni got 4 branches.

Further refractor is needed to support multi-branches (>2) diffusion models. It will be good for us to have a discussion about a more general cfg parallel api, and the plan to refractor DreamID-Omni and Omnigen2 on it.

@zzhuoxin1508
@princepride

wtomin · 2026-03-30T06:32:42Z

A recent PR changed the diffusion features docs strucure. Pls PTAL #1928.

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

wtomin · 2026-04-02T07:38:22Z

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2 (Excluding --cfg-parallel-size=3 for now, because our nightly CI runs on two-gpus devices . For L4 test, please refer to #1832 .

Documentation incomplete:

Feature support table not updated for OmniGen2 CFG parallel
Usage example missing for --cfg-parallel-size=2/3 in examples/offline_inference/image_to_image/image_to_image.md and examples/online_serving/image_to_image/image_to_image.md

Can you also report the peak VRAM usage in your PR body?

zzhuoxin1508 · 2026-04-02T07:46:23Z

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2 (Excluding --cfg-parallel-size=3 for now, because our nightly CI runs on two-gpus devices . For L4 test, please refer to #1832 .

Documentation incomplete:

Feature support table not updated for OmniGen2 CFG parallel

Usage example missing for --cfg-parallel-size=2/3 in examples/offline_inference/image_to_image/image_to_image.md and examples/online_serving/image_to_image/image_to_image.md

Can you also report the peak VRAM usage in your PR body?

Thanks for the review!
I'm currently working on a CFG parallel refactor in PR #2423 (N-branch CFG dispatch). I'll rebase this PR on top of that once it's merged, then add the test, update the documentation and feature support table, and report peak VRAM usage in the PR @wtomin

zzhuoxin1508 · 2026-04-14T12:12:48Z

done in #2423

feat: add CFG parallelism to OmniGen2 pipeline

63e520e

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

zzhuoxin1508 marked this pull request as ready for review March 22, 2026 14:13

zzhuoxin1508 requested a review from hsliuustc0106 as a code owner March 22, 2026 14:13

feat: add CFG parallelism to OmniGen2 pipeline

76cc9c7

Co-authored-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Ding Zuhao <e1583181@u.nus.edu> Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

zzhuoxin1508 force-pushed the omnigen2-CFGparallel branch from 23c4ec5 to 76cc9c7 Compare March 22, 2026 14:23

nussejzz suggested changes Mar 22, 2026

View reviewed changes

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

docs: add OmniGen2 CFG parallel to parallelism acceleration guide

de46e33

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

zzhuoxin1508 force-pushed the omnigen2-CFGparallel branch from 1784777 to de46e33 Compare March 23, 2026 10:41

Merge branch 'main' into omnigen2-CFGparallel

ab30b94

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

princepride requested changes Mar 23, 2026

View reviewed changes

wtomin mentioned this pull request Mar 23, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

zzhuoxin1508 added 2 commits March 31, 2026 15:10

Delete docs/user_guide/diffusion/parallelism_acceleration.md

a45ead1

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>

Merge branch 'vllm-project:main' into omnigen2-CFGparallel

9581777

wtomin mentioned this pull request Apr 1, 2026

[Test] Add OmniGen2 online serving expansion L4 tests for Ulysses-SP and CFG-Parallel #2326

Open

5 tasks

zzhuoxin1508 mentioned this pull request Apr 7, 2026

[Refactor] Extend CFG Parallel to support 3 or 4 branch dispatch across M GPUs #2423

Merged

5 tasks

zzhuoxin1508 closed this Apr 14, 2026

Conversation

zzhuoxin1508 commented Mar 22, 2026

Purpose

Test Plan

Test Result

Text-to-Image (1024×1024,)

Image Editing (inputsize=1696×2528)

Uh oh!

zzhuoxin1508 commented Mar 22, 2026

Uh oh!

zzhuoxin1508 commented Mar 22, 2026

Uh oh!

nussejzz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zzhuoxin1508 commented Mar 23, 2026

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Uh oh!

zzhuoxin1508 commented Mar 23, 2026

Uh oh!

TKONIY commented Mar 23, 2026

Uh oh!

wtomin commented Mar 30, 2026

Uh oh!

wtomin commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zzhuoxin1508 commented Apr 2, 2026

Uh oh!

zzhuoxin1508 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wtomin commented Apr 2, 2026 •

edited

Loading