Skip to content

[Feature] Add CFG parallel to Omnigen2 #2074

Closed
zzhuoxin1508 wants to merge 6 commits into
vllm-project:mainfrom
zzhuoxin1508:omnigen2-CFGparallel
Closed

[Feature] Add CFG parallel to Omnigen2 #2074
zzhuoxin1508 wants to merge 6 commits into
vllm-project:mainfrom
zzhuoxin1508:omnigen2-CFGparallel

Conversation

@zzhuoxin1508
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  • Add CFG parallel support to OmniGen2 pipeline (cfg_parallel_size=2 for text-to-image, cfg_parallel_size=3 for image editing with ref image)
  • Fall back to sequential CFG when image guidance is enabled without sufficient ranks (cfg_world_size < 3)

Test Plan

  • text2image
    python text_to_image.py
    --model "OmniGen2/OmniGen2"
    --prompt "A classroom with bright lighting and wooden desks."
    --negative-prompt "(((deformed))), blurry, over saturation, bad anatomy, disfigured, poorly drawn face, mutation, mutated, (extra_limb), (ugly), (poorly drawn hands), fused fingers, messy drawing, broken legs censor, censored, censor_bar"
    --num-inference-steps 50
    --seed 0
    --guidance-scale 5.0
    --cfg-parallel-size 2
    --output /workspace/outputs/image_t2icfg3.png

  • image2image
    -python image_edit.py
    --image /workspace/image1.png
    --model "OmniGen2/OmniGen2"
    --prompt "Change the background to classroom."
    --negative-prompt "(((deformed))), blurry, over saturation, bad anatomy, disfigured, poorly drawn face, mutation, mutated, (extra_limb), (ugly), (poorly drawn hands), fused fingers, messy drawing, broken legs censor, censored, censor_bar"
    --num-inference-steps 50
    --seed 0
    --guidance-scale 5.0
    --guidance-scale-2 2.0
    --cfg-parallel-size 3
    --output /workspace/outputs/image_edit.png

Test Result

Text-to-Image (1024×1024,)

cfg_parallel_size GPUs Generation Time Speedup
1 1 54.4s 1.0x
2 2 27.3s 2.0x

Output images are visually identical across sequential and parallel modes.

Image Editing (inputsize=1696×2528)

cfg_parallel_size GPUs Generation Time Speedup
1 1 90.2s 1.0x
3 3 36.6s 2.46x

Output images are visually identical across sequential and parallel modes.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508 zzhuoxin1508 marked this pull request as ready for review March 22, 2026 14:13
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

@nussejzz

@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

text2image output
image_t2icfg3
image2image output
image_editcfg3

Co-authored-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Ding Zuhao <e1583181@u.nus.edu>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508 zzhuoxin1508 force-pushed the omnigen2-CFGparallel branch from 23c4ec5 to 76cc9c7 Compare March 22, 2026 14:23
Copy link
Copy Markdown
Contributor

@nussejzz nussejzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> torch.Tensor:
"""CFG parallel denoising loop: each rank computes one CFG branch, returns latents.

Rank 0: cond branch (prompt_embeds, ref_latents)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the spaces

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread vllm_omni/diffusion/models/omnigen2/pipeline_omnigen2.py
Comment thread vllm_omni/diffusion/models/omnigen2/pipeline_omnigen2.py
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments on the parallel path.

)
elif text_guidance_scale > 1.0:
model_pred_uncond = self.predict(
gathered = cfg_group.all_gather(local_pred, separate_tensors=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_gather_into_tensor requires contiguous input. local_pred coming out of self.predict() may not be contiguous depending on the transformer output layout. Add local_pred = local_pred.contiguous() before the all-gather, same way you already do for latents at line 1291.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


for i, t in enumerate(timesteps):
in_cfg_range = self.cfg_range[0] <= i / len(timesteps) <= self.cfg_range[1]
use_cfg_this_step = in_cfg_range and self.text_guidance_scale > 1.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.text_guidance_scale > 1.0 is always true inside _processing_parallel (the caller checks it in cfg_parallel_ready). This makes use_cfg_this_step equivalent to just in_cfg_range. Not a bug, but it's confusing — consider simplifying to use_cfg_this_step = in_cfg_range.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

model_pred_uncond = self.predict(
gathered = cfg_group.all_gather(local_pred, separate_tensors=True)
model_pred, model_pred_uncond = gathered[0], gathered[1]
if use_cfg_img_this_step and len(gathered) > 2:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: len(gathered) > 2 is always true when use_cfg_img_this_step is true, since the caller guarantees cfg_world_size >= 3 whenever use_cfg_img. The double-check reads like there's a case where it could be false. Drop it or add a comment explaining it's a defensive check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

model_pred = model_pred_uncond + self.text_guidance_scale * (model_pred - model_pred_uncond)
latents = self.scheduler.step(model_pred, t, latents, return_dict=False)[0]
else:
# Outside CFG interval: all ranks use cond branch, no comm
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the CFG range every rank computes the same cond prediction independently — wasted FLOPs on ranks 1+. Consider having only rank 0 run predict and broadcasting model_pred, like you already do for the initial latents sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,but data is the same on all cards, if we only run it on Rank 0, we’d just be adding an extra broadcast step. it wouldn't really save any time.

Comment thread vllm_omni/diffusion/models/omnigen2/pipeline_omnigen2.py
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508 zzhuoxin1508 force-pushed the omnigen2-CFGparallel branch from 1784777 to de46e33 Compare March 23, 2026 10:41
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

@princepride Could you please review this

Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

Thanks, I'll look into it

@TKONIY
Copy link
Copy Markdown
Contributor

TKONIY commented Mar 23, 2026

I noticed that this model also need set cfg_p=3, right now we have a pr about refactor the cfg_p: #2063, can you cooperate with him and try to use our own implementation of cfg_p.

Thanks, I'll look into it

There are similar problems in integrating existing cfg_p (even after #2063) with Omnigen2 and DreamID-Omni: cfg_p now only supports defining no more than 2 branches, while Omnigen2 got 3 branches and DreamID-Omni got 4 branches.

Further refractor is needed to support multi-branches (>2) diffusion models. It will be good for us to have a discussion about a more general cfg parallel api, and the plan to refractor DreamID-Omni and Omnigen2 on it.

@zzhuoxin1508
@princepride

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 30, 2026

A recent PR changed the diffusion features docs strucure. Pls PTAL #1928.

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Apr 2, 2026

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2 (Excluding --cfg-parallel-size=3 for now, because our nightly CI runs on two-gpus devices . For L4 test, please refer to #1832 .

Documentation incomplete:

  1. Feature support table not updated for OmniGen2 CFG parallel
  2. Usage example missing for --cfg-parallel-size=2/3 in examples/offline_inference/image_to_image/image_to_image.md and examples/online_serving/image_to_image/image_to_image.md

Can you also report the peak VRAM usage in your PR body?

@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2 (Excluding --cfg-parallel-size=3 for now, because our nightly CI runs on two-gpus devices . For L4 test, please refer to #1832 .

Documentation incomplete:

  1. Feature support table not updated for OmniGen2 CFG parallel
  2. Usage example missing for --cfg-parallel-size=2/3 in examples/offline_inference/image_to_image/image_to_image.md and examples/online_serving/image_to_image/image_to_image.md

Can you also report the peak VRAM usage in your PR body?

Thanks for the review!
I'm currently working on a CFG parallel refactor in PR #2423 (N-branch CFG dispatch). I'll rebase this PR on top of that once it's merged, then add the test, update the documentation and feature support table, and report peak VRAM usage in the PR @wtomin

@zzhuoxin1508
Copy link
Copy Markdown
Contributor Author

done in #2423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants