Support VAE parallel for Bagel #3982
Merged
Merged
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 task
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>
RuixiangMa
reviewed
May 29, 2026
| id="parallel_hsdp_2", | ||
| marks=HSDP_2_FEATURE_MARKS, | ||
| ), | ||
| # Tensor Parallelism (TP) + VAE Patch Parallelism (size=2) |
Collaborator
There was a problem hiding this comment.
The new TP + VAE-PP setup is not stage-local in deploy-config mode, so --tensor-parallel-size 2 also leaks into stage 0 while that stage is still pinned to devices: "0"
| "tile_latent_stride_height": tile_latent_stride_height, | ||
| "tile_latent_stride_width": tile_latent_stride_width, | ||
| }, | ||
| output_dtype=x.dtype, |
Collaborator
There was a problem hiding this comment.
The distributed encode() path uses x.dtype for gather/broadcast buffers, so Bagel img2img can encode latents under autocast and still repack them into float32 buffers unnecessarily
princepride
approved these changes
Jun 2, 2026
86MaxCao
pushed a commit
to 86MaxCao/vllm-omni
that referenced
this pull request
Jun 4, 2026
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>
akshatvishu
pushed a commit
to akshatvishu/vllm-omni
that referenced
this pull request
Jun 13, 2026
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Add VAE Patch Parallelism support for the Bagel (BAGEL-7B-MoT) diffusion model.
This PR lets Bagel split the latent into spatial tiles and distribute them across the DiT process group, so each rank only materializes the activations for its own tiles instead of the whole image — lowering per-GPU peak memory at high resolution.
Key points:
DistributedAutoEncoder(AutoEncoder, DistributedVaeMixin)invllm_omni/diffusion/models/bagel/autoencoder.py, implementing split / exec / merge for both decode and encode (with overlap blending to avoid seams).BagelPipelinenow instantiatesDistributedAutoEncoderso the DiT stage can run distributed e READMEs (scope, requirements, deploy YAML / CLI examples, verification via startup logs).Scope:
BagelPipeline+DistributedAutoEncoder)Test Plan
Hardware: 2× GPU, BAGEL-7B-MoT at
/data/Bagel/BAGEL-7B-MoT.End-to-end (single-stage DiT, text2img, 1024×1024, 20 steps, seed=42) — compare
tensor_parallel_size=2only vstensor_parallel_size=2 + vae_patch_parallel_size=2(vae_use_tiling=true). Metrics: per-request peak GPU memory (Peak GPU memory (this request)) and generation latency (stage_0_gen_ms).Single-stage deploy used for both runs (only
vae_patch_parallel_size/vae_use_tilingdiffer):Correctness: confirm
Bagel VAE decode running with distributed executorappears in logs when enabled, and generated images are valid.Test Result
End-to-end (1024×1024, 20 steps) — inference phase, rank0(A100)
The benefit grows with resolution: negligible at 512×512 (Transformer-bound), ~3 GB at 1024×1024 end-to-end