Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions examples/offline_inference/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,101 @@ stages:
devices: "0,1"
```

### VAE Patch Parallelism

[VAE Patch Parallelism](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism/vae_patch_parallel.html) splits Bagel VAE **decode/encode** tiles across multiple GPUs on the **DiT stage**, reducing **per-GPU peak memory during VAE decode**. Use it when high-resolution `text2img` or `img2img` hits VAE OOM or large decode spikes.

**Bagel-specific notes:**

- Implemented in `BagelPipeline` via `DistributedAutoEncoder` (DiT stage only).
- **Single-stage** is the simplest path: one DiT process with TP + VAE patch parallel.
- **Two-stage**: enable on **stage 1 (DiT)** only; stage 0 (Thinker) keeps encoder-only `VAEEncoder` and does not use VAE patch parallel.
- You need a DiT `world_size` ≥ `vae_patch_parallel_size` (typically `tensor_parallel_size=2` on that stage). VAE PP reuses the DiT process group; it is not a standalone second-GPU VAE worker.

**Single-stage via deploy YAML** (recommended for `end2end.py`):

```yaml
pipeline: bagel_single_stage
async_chunk: false

stages:
- stage_id: 0
max_num_batched_tokens: 32768
max_num_seqs: 1
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
default_sampling_params:
seed: 52
```

```bash
cd examples/offline_inference/bagel

CUDA_VISIBLE_DEVICES=0,1 python end2end.py \
--model /path/to/BAGEL-7B-MoT \
--deploy-config /path/to/bagel_single_stage_vae_pp.yaml \
--modality text2img \
--prompts "A cute cat" \
--steps 10 \
--output ./out_vae_pp
```

**Single-stage via `Omni` kwargs** (same flags as online serving):

```python
from vllm_omni.entrypoints.omni import Omni

omni = Omni(
model="ByteDance-Seed/BAGEL-7B-MoT",
deploy_config="vllm_omni/deploy/bagel_single_stage.yaml",
tensor_parallel_size=2,
vae_patch_parallel_size=2,
vae_use_tiling=True,
)
# Then call omni.generate(...) as in end2end.py
```

**Two-stage (VAE PP on DiT only):**

```yaml
stages:
- stage_id: 0
devices: "0"
# AR Thinker — no vae_patch_parallel here

- stage_id: 1
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
```

```bash
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--deploy-config /path/to/bagel_vae_pp.yaml \
--modality text2img \
--prompts "A cute cat"
```

**Startup log checks:**

```text
INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.
```

| Setting | Role |
| :------ | :--- |
| `parallel_config.tensor_parallel_size` | DiT world size / TP (must be ≥ `vae_patch_parallel_size`) |
| `parallel_config.vae_patch_parallel_size` | Number of ranks for distributed VAE tiles (`1` = off) |
| `vae_use_tiling` | Enable spatial tiling (auto-enabled when `vae_patch_parallel_size > 1`) |

#### Hybrid Sharded Data Parallel (HSDP)

For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, [`bagel.yaml`](../../../vllm_omni/deploy/bagel.yaml)). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.
Expand Down
63 changes: 63 additions & 0 deletions examples/online_serving/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,67 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --tensor-parallel-size

Or set `tensor_parallel_size` per stage in a custom deploy YAML.

### VAE Patch Parallelism

[VAE Patch Parallelism](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism/vae_patch_parallel.html) distributes Bagel VAE **decode/encode** across multiple GPUs by splitting latent tiles. It lowers **per-GPU peak memory during VAE decode**, which helps high-resolution `text2img` / `img2img` when VAE becomes a bottleneck.

**Scope for Bagel:**

| Topology | VAE patch parallel |
| :------- | :----------------- |
| **Single-stage** (DiT only) | Supported on stage 0 (`BagelPipeline` + `DistributedAutoEncoder`) |
| **Two-stage** | Supported on **stage 1 (DiT)** only; stage 0 (Thinker) uses encoder-only VAE and is unrelated |

**Requirements:**

- `vae_patch_parallel_size > 1` and a distributed VAE (`DistributedAutoEncoder` on the DiT pipeline).
- The DiT process group must have at least `vae_patch_parallel_size` ranks. In practice this means the diffusion stage `world_size` must be ≥ 2 (commonly `tensor_parallel_size=2` on that stage).
- `vae_use_tiling` must be enabled. If you set `vae_patch_parallel_size > 1` and omit tiling, the registry auto-enables `vae_use_tiling` at startup.

VAE patch parallel **reuses the DiT process group** (`dit_group`); it does not create a separate VAE-only worker pool. It is not a substitute for single-GPU VAE tiling (`vae_pp=1`).

**Online serving (single-stage, 2 GPUs):**

```bash
CUDA_VISIBLE_DEVICES=0,1 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config vllm_omni/deploy/bagel_single_stage.yaml \
--tensor-parallel-size 2 \
--vae-patch-parallel-size 2 \
--vae-use-tiling
```

**Online serving (two-stage, VAE PP on DiT stage 1):** use a custom deploy YAML, for example:

```yaml
stages:
- stage_id: 0
devices: "0"
# Thinker (AR) — no VAE patch parallel here

- stage_id: 1
devices: "0,1"
vae_use_tiling: true
parallel_config:
tensor_parallel_size: 2
vae_patch_parallel_size: 2
```

```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
--deploy-config /path/to/bagel_vae_pp.yaml
```

**Verify it is active** (check server logs at startup):

```text
INFO ... vae_patch_parallel_size=2 requires vae_use_tiling; automatically enabling it.
```

| CLI flag | Default | Description |
| :------- | :------ | :---------- |
| `--vae-patch-parallel-size` | `1` | Number of DiT ranks used for VAE tile parallelism. Set to `2` or higher to enable. Should be ≤ DiT process group size (typically match `--tensor-parallel-size` on the diffusion stage). |
| `--vae-use-tiling` | off | Enable VAE spatial tiling. Required for VAE patch parallel (auto-enabled when `vae_patch_parallel_size > 1`). |

#### Hybrid Sharded Data Parallel (HSDP)

For larger Bagel deployments on multiple GPUs, you can enable HSDP (Hybrid Sharded Data Parallel) by modifying the stage configuration (for example, [`bagel.yaml`](../../../vllm_omni/deploy/bagel.yaml)). HSDP shards transformer weights across GPUs to reduce per-GPU memory usage.
Expand Down Expand Up @@ -323,6 +384,8 @@ python openai_chat_client.py \
| `stages[].gpu_memory_utilization` | per-stage | Fraction of GPU memory to use |
| `stages[].enforce_eager` | per-stage | Disable CUDA graphs |
| `stages[].tensor_parallel_size` | per-stage | TP degree for this stage |
| `stages[].parallel_config.vae_patch_parallel_size` | per-stage (DiT) | VAE tile parallelism degree (DiT stage only) |
| `stages[].vae_use_tiling` | per-stage (DiT) | Enable VAE tiling (required for VAE patch parallel) |
| `connectors` | top-level | Define available connector instances (SHM, Mooncake) |
| `platforms` | top-level | Platform-specific overrides (e.g. `xpu`) |

Expand Down
30 changes: 29 additions & 1 deletion tests/e2e/online_serving/test_bagel_expansion.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
- Ulysses-SP
- Ring-Attention
- Layerwise Offloading
- Hybrid Sharded Data Parallel
- Tensor Parallelism + VAE Patch Parallelism

assert_diffusion_response validates successful generation and the expected
512x512 resolution.
Expand Down Expand Up @@ -38,12 +40,28 @@
BAGEL_CI_DEPLOY,
updates={"stages": {0: {"devices": "0"}, 1: {"devices": "0,1"}}},
)
BAGEL_TP_VAE_PP_2_DEPLOY = modify_stage_config(
BAGEL_CI_DEPLOY,
updates={
"stages": {
0: {"devices": "0"},
1: {
"devices": "0,1",
"parallel_config": {
"tensor_parallel_size": 2,
"vae_patch_parallel_size": 2,
},
},
},
},
)


def _get_diffusion_feature_cases(model: str):
"""Return L4 diffusion feature cases for Bagel.
TeaCache, Cache-DiT, CFG-Parallel,
Ulysses-SP, Ring-Attention, Layerwise Offloading.
Ulysses-SP, Ring-Attention, Layerwise Offloading,
Hybrid Sharded Data Parallel, Tensor Parallelism, VAE Patch Parallelism.
"""

return [
Expand Down Expand Up @@ -135,6 +153,15 @@ def _get_diffusion_feature_cases(model: str):
id="parallel_hsdp_2",
marks=HSDP_2_FEATURE_MARKS,
),
# Tensor Parallelism (TP) + VAE Patch Parallelism (size=2)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new TP + VAE-PP setup is not stage-local in deploy-config mode, so --tensor-parallel-size 2 also leaks into stage 0 while that stage is still pinned to devices: "0"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pytest.param(
OmniServerParams(
model=model,
stage_config_path=BAGEL_TP_VAE_PP_2_DEPLOY,
),
id="tp_vae_patch_parallel_2",
marks=PARALLEL_2_FEATURE_MARKS,
),
]


Expand All @@ -157,6 +184,7 @@ def test_bagel(
- Ring-Attention (degree=2)
- Layerwise Offloading
- Hybrid Sharded Data Parallel (size=2)
- Tensor Parallelism (TP) + VAE Patch Parallelism (size=2)

Validation is delegated to assert_diffusion_response in tests/helpers/assertions.py,
which checks output dimensions and basic correctness.
Expand Down
Loading
Loading