vllm-project · princepride · Jun 2, 2026 · May 29, 2026 · May 29, 2026 · Jun 2, 2026
@@ -195,6 +195,101 @@ stages:
     devices: "0,1"
 ```
 
+### VAE Patch Parallelism
+
+[VAE Patch Parallelism](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism/vae_patch_parallel.html) splits Bagel VAE **decode/encode** tiles across multiple GPUs on the **DiT stage**, reducing **per-GPU peak memory during VAE decode**. Use it when high-resolution `text2img` or `img2img` hits VAE OOM or large decode spikes.
+
+**Bagel-specific notes:**
+
+- Implemented in `BagelPipeline` via `DistributedAutoEncoder` (DiT stage only).
+- **Single-stage** is the simplest path: one DiT process with TP + VAE patch parallel.
+- **Two-stage**: enable on **stage 1 (DiT)** only; stage 0 (Thinker) keeps encoder-only `VAEEncoder` and does not use VAE patch parallel.
+- You need a DiT `world_size` ≥ `vae_patch_parallel_size` (typically `tensor_parallel_size=2` on that stage). VAE PP reuses the DiT process group; it is not a standalone second-GPU VAE worker.
+
+**Single-stage via deploy YAML** (recommended for `end2end.py`):
+
+```yaml
+pipeline: bagel_single_stage
+async_chunk: false
+
+stages:
+  - stage_id: 0
+    max_num_batched_tokens: 32768
+    max_num_seqs: 1
+    enforce_eager: true
+    trust_remote_code: true
+    enable_prefix_caching: false
+    devices: "0,1"
+    vae_use_tiling: true
+    parallel_config:
+      tensor_parallel_size: 2
+      vae_patch_parallel_size: 2
+    default_sampling_params:
+      seed: 52
+```
+
+```bash
+cd examples/offline_inference/bagel
+
+CUDA_VISIBLE_DEVICES=0,1 python end2end.py \
+    --model /path/to/BAGEL-7B-MoT \
+    --deploy-config /path/to/bagel_single_stage_vae_pp.yaml \
+    --modality text2img \
+    --prompts "A cute cat" \
+    --steps 10 \
+    --output ./out_vae_pp
+```
+
+**Single-stage via `Omni` kwargs** (same flags as online serving):
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+omni = Omni(
+    model="ByteDance-Seed/BAGEL-7B-MoT",
+    deploy_config="vllm_omni/deploy/bagel_single_stage.yaml",
+    tensor_parallel_size=2,
+    vae_patch_parallel_size=2,
+    vae_use_tiling=True,
+)
+# Then call omni.generate(...) as in end2end.py
+```
+
+**Two-stage (VAE PP on DiT only):**
+
+```yaml
+stages:
+  - stage_id: 0
+    devices: "0"
+    # AR Thinker — no vae_patch_parallel here
+
+  - stage_id: 1
+    devices: "0,1"
+    vae_use_tiling: true
+    parallel_config:
+      tensor_parallel_size: 2
+      vae_patch_parallel_size: 2
+```
+
+```bash
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+    --deploy-config /path/to/bagel_vae_pp.yaml \
+    --modality text2img \
+    --prompts "A cute cat"
+```
+
+**Startup log checks:**
+
+```text
+INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.
+```
+
+| Setting | Role |
+| :------ | :--- |
+| `parallel_config.tensor_parallel_size` | DiT world size / TP (must be ≥ `vae_patch_parallel_size`) |
+| `parallel_config.vae_patch_parallel_size` | Number of ranks for distributed VAE tiles (`1` = off) |
+| `vae_use_tiling` | Enable spatial tiling (auto-enabled when `vae_patch_parallel_size > 1`) |
+
 #### Hybrid Sharded Data Parallel (HSDP)
 
 For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, [`bagel.yaml`](../../../vllm_omni/deploy/bagel.yaml)). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.

@@ -68,6 +68,67 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --tensor-parallel-size
 
 Or set `tensor_parallel_size` per stage in a custom deploy YAML.
 
+### VAE Patch Parallelism
+
+[VAE Patch Parallelism](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism/vae_patch_parallel.html) distributes Bagel VAE **decode/encode** across multiple GPUs by splitting latent tiles. It lowers **per-GPU peak memory during VAE decode**, which helps high-resolution `text2img` / `img2img` when VAE becomes a bottleneck.
+
+**Scope for Bagel:**
+
+| Topology | VAE patch parallel |
+| :------- | :----------------- |
+| **Single-stage** (DiT only) | Supported on stage 0 (`BagelPipeline` + `DistributedAutoEncoder`) |
+| **Two-stage** | Supported on **stage 1 (DiT)** only; stage 0 (Thinker) uses encoder-only VAE and is unrelated |
+
+**Requirements:**
+
+- `vae_patch_parallel_size > 1` and a distributed VAE (`DistributedAutoEncoder` on the DiT pipeline).
+- The DiT process group must have at least `vae_patch_parallel_size` ranks. In practice this means the diffusion stage `world_size` must be ≥ 2 (commonly `tensor_parallel_size=2` on that stage).
+- `vae_use_tiling` must be enabled. If you set `vae_patch_parallel_size > 1` and omit tiling, the registry auto-enables `vae_use_tiling` at startup.
+
+VAE patch parallel **reuses the DiT process group** (`dit_group`); it does not create a separate VAE-only worker pool. It is not a substitute for single-GPU VAE tiling (`vae_pp=1`).
+
+**Online serving (single-stage, 2 GPUs):**
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
+    --deploy-config vllm_omni/deploy/bagel_single_stage.yaml \
+    --tensor-parallel-size 2 \
+    --vae-patch-parallel-size 2 \
+    --vae-use-tiling
+```
+
+**Online serving (two-stage, VAE PP on DiT stage 1):** use a custom deploy YAML, for example:
+
+```yaml
+stages:
+  - stage_id: 0
+    devices: "0"
+    # Thinker (AR) — no VAE patch parallel here
+
+  - stage_id: 1
+    devices: "0,1"
+    vae_use_tiling: true
+    parallel_config:
+      tensor_parallel_size: 2
+      vae_patch_parallel_size: 2
+```
+
+```bash
+vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
+    --deploy-config /path/to/bagel_vae_pp.yaml
+```
+
+**Verify it is active** (check server logs at startup):
+
+```text
+INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.
+```
+
+| CLI flag | Default | Description |
+| :------- | :------ | :---------- |
+| `--vae-patch-parallel-size` | `1` | Number of DiT ranks used for VAE tile parallelism. Set to `2` or higher to enable. Should be ≤ DiT process group size (typically match `--tensor-parallel-size` on the diffusion stage). |
+| `--vae-use-tiling` | off | Enable VAE spatial tiling. Required for VAE patch parallel (auto-enabled when `vae_patch_parallel_size > 1`). |
+
 #### Hybrid Sharded Data Parallel (HSDP)
 
 For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, [`bagel.yaml`](../../../vllm_omni/deploy/bagel.yaml)). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.
@@ -323,6 +384,8 @@ python openai_chat_client.py \
 | `stages[].gpu_memory_utilization` | per-stage | Fraction of GPU memory to use |
 | `stages[].enforce_eager` | per-stage | Disable CUDA graphs |
 | `stages[].tensor_parallel_size` | per-stage | TP degree for this stage |
+| `stages[].parallel_config.vae_patch_parallel_size` | per-stage (DiT) | VAE tile parallelism degree (DiT stage only) |
+| `stages[].vae_use_tiling` | per-stage (DiT) | Enable VAE tiling (required for VAE patch parallel) |
 | `connectors` | top-level | Define available connector instances (SHM, Mooncake) |
 | `platforms` | top-level | Platform-specific overrides (e.g. `xpu`) |
 

@@ -9,6 +9,8 @@
 - Ulysses-SP
 - Ring-Attention
 - Layerwise Offloading
+- Hybrid Sharded Data Parallel
+- Tensor Parallelism + VAE Patch Parallelism
 
 assert_diffusion_response validates successful generation and the expected
 512x512 resolution.
@@ -38,12 +40,28 @@
     BAGEL_CI_DEPLOY,
     updates={"stages": {0: {"devices": "0"}, 1: {"devices": "0,1"}}},
 )
+BAGEL_TP_VAE_PP_2_DEPLOY = modify_stage_config(
+    BAGEL_CI_DEPLOY,
+    updates={
+        "stages": {
+            0: {"devices": "0"},
+            1: {
+                "devices": "0,1",
+                "parallel_config": {
+                    "tensor_parallel_size": 2,
+                    "vae_patch_parallel_size": 2,
+                },
+            },
+        },
+    },
+)
 
 
 def _get_diffusion_feature_cases(model: str):
     """Return L4 diffusion feature cases for Bagel.
     TeaCache, Cache-DiT, CFG-Parallel,
-    Ulysses-SP, Ring-Attention, Layerwise Offloading.
+    Ulysses-SP, Ring-Attention, Layerwise Offloading,
+    Hybrid Sharded Data Parallel, Tensor Parallelism, VAE Patch Parallelism.
     """
 
     return [
@@ -135,6 +153,15 @@ def _get_diffusion_feature_cases(model: str):
             id="parallel_hsdp_2",
             marks=HSDP_2_FEATURE_MARKS,
         ),
+        # Tensor Parallelism (TP) + VAE Patch Parallelism (size=2)
+        pytest.param(
+            OmniServerParams(
+                model=model,
+                stage_config_path=BAGEL_TP_VAE_PP_2_DEPLOY,
+            ),
+            id="tp_vae_patch_parallel_2",
+            marks=PARALLEL_2_FEATURE_MARKS,
+        ),
     ]
 
 
@@ -157,6 +184,7 @@ def test_bagel(
     - Ring-Attention (degree=2)
     - Layerwise Offloading
     - Hybrid Sharded Data Parallel (size=2)
+    - Tensor Parallelism (TP) + VAE Patch Parallelism (size=2)
 
     Validation is delegated to assert_diffusion_response in tests/helpers/assertions.py,
     which checks output dimensions and basic correctness.