Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
ef7c35a
udpate diffusion config
wtomin Dec 4, 2025
641b626
update usp
wtomin Dec 5, 2025
15de6d0
update usp
wtomin Dec 8, 2025
5e4f865
set omni diffusion config
wtomin Dec 8, 2025
c104832
ulysses Attention
wtomin Dec 9, 2025
2657aad
impr
wtomin Dec 9, 2025
60616b6
updates
wtomin Dec 9, 2025
f590c37
test script for ulysses sp
wtomin Dec 9, 2025
883116a
updates test
wtomin Dec 9, 2025
a158c4a
fix import errors
wtomin Dec 9, 2025
bd0a851
update test script
wtomin Dec 9, 2025
1b22061
set ring and ulysses group
wtomin Dec 9, 2025
69e216b
update pg
wtomin Dec 9, 2025
2b3b897
fix test shape
wtomin Dec 9, 2025
4d3f970
destroy comm group
wtomin Dec 9, 2025
28502a2
remove redundant arg
wtomin Dec 9, 2025
c54be3b
new test parameter
wtomin Dec 9, 2025
f6c28c9
update test sp
wtomin Dec 9, 2025
1d76925
allow sp is None
wtomin Dec 9, 2025
19fa174
default config
wtomin Dec 9, 2025
61e48fe
revert utils changes
wtomin Dec 10, 2025
085ba7f
update env func
wtomin Dec 10, 2025
8f7a8c2
remove redundant
wtomin Dec 10, 2025
bd576f0
fix num_gpus default value
wtomin Dec 10, 2025
aa9826e
rm redundant package check
wtomin Dec 10, 2025
b216460
correct e2e
wtomin Dec 10, 2025
d3856a1
replace by vllm groupcoordinator
wtomin Dec 10, 2025
f368def
correct name
wtomin Dec 10, 2025
21282d3
update gpu worker: set tp size and config
wtomin Dec 10, 2025
255a2e2
Revert "replace by vllm groupcoordinator"
wtomin Dec 10, 2025
d1e4f0b
vllmconfig and omnidiffusion config share dp and tp
wtomin Dec 10, 2025
34e86a6
get tp from vllm parallel_state
wtomin Dec 10, 2025
4939ed2
sequence_parallel_size updates
wtomin Dec 10, 2025
fcf97c4
fix vllm.distributed.parallel_state import error
wtomin Dec 10, 2025
a560b82
remove local rank in sp example
wtomin Dec 10, 2025
bd2b0b0
split rotary_embed
wtomin Dec 10, 2025
bfd6026
correct field
wtomin Dec 10, 2025
43f8933
fix ci
wtomin Dec 10, 2025
36aa2b9
init model and device with context manager
wtomin Dec 10, 2025
2f23822
remove get_world_size
wtomin Dec 10, 2025
2252998
shutdown device and comm group
wtomin Dec 10, 2025
42eeb6a
set vllm_config as context manager
wtomin Dec 10, 2025
c9d7a14
record inference speed
wtomin Dec 10, 2025
de93d34
different save path
wtomin Dec 10, 2025
19bc245
ring attention not supported yet
wtomin Dec 10, 2025
696dc0a
merge two scripts into one
wtomin Dec 11, 2025
cb0ef5d
update test script
wtomin Dec 11, 2025
cc068b3
constrain test cases
wtomin Dec 12, 2025
eb7804e
smaller head size
wtomin Dec 12, 2025
6109c14
fix DOC check error
wtomin Dec 12, 2025
e68c434
no backward check
wtomin Dec 12, 2025
c1b9864
fix logging
wtomin Dec 12, 2025
aa7c063
update qwen_image transformer
wtomin Dec 12, 2025
0442526
solve conflicts
wtomin Dec 12, 2025
f050c0b
update test ut
wtomin Dec 15, 2025
01a20ea
fix ut and adapt to npu
wtomin Dec 15, 2025
253e442
update device_count and set_device
wtomin Dec 15, 2025
9cd123d
test comm
wtomin Dec 15, 2025
ccee08c
fix comm test error
wtomin Dec 15, 2025
54b1c67
set default sp config
wtomin Dec 15, 2025
3f7050c
test pipeline
wtomin Dec 15, 2025
07fe23d
cache & parallel support: qwen-image edit
wtomin Dec 15, 2025
c4421d5
correct example script path in doc
wtomin Dec 15, 2025
7f683d3
remove cache support
wtomin Dec 15, 2025
e2c98f3
update docs
wtomin Dec 15, 2025
e2658d0
updates
wtomin Dec 15, 2025
6239b06
e2e test
wtomin Dec 15, 2025
a26dc01
fix image edit shape
wtomin Dec 15, 2025
8c9e400
fix ci
wtomin Dec 16, 2025
597af26
fix mkdocs
wtomin Dec 16, 2025
74da8cb
fix docs
wtomin Dec 16, 2025
0624d59
fix docs
wtomin Dec 16, 2025
8f630d2
fix image edit example
wtomin Dec 16, 2025
0ecd312
args name degree to size except for ring&ulysses degrees
wtomin Dec 16, 2025
ed46752
rm attention npu
wtomin Dec 16, 2025
e06b049
fix ci
wtomin Dec 16, 2025
eb6818c
rm simple test
wtomin Dec 16, 2025
94026b0
rm pipeline test
wtomin Dec 16, 2025
db5c134
fix pre-commit
wtomin Dec 16, 2025
14f17e2
fix ci
wtomin Dec 17, 2025
1c55873
extend time out minutes
wtomin Dec 17, 2025
adf5ae7
test sp pipeline
wtomin Dec 17, 2025
9d14371
change docs structure
wtomin Dec 17, 2025
824f452
remove ring degree
wtomin Dec 17, 2025
b99f48c
fix docs
wtomin Dec 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ steps:
queue: "cpu_queue_premerge"

- label: "Diffusion Model Test"
timeout_in_minutes: 15
timeout_in_minutes: 20
depends_on: image-build
commands:
- pytest -s -v tests/e2e/offline_inference/test_t2i_model.py
Expand Down Expand Up @@ -49,6 +49,23 @@ steps:
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Diffusion Parallelism Test"
timeout_in_minutes: 15
depends_on: image-build
commands:
- pytest -s -v tests/e2e/offline_inference/test_sequence_parallel.py
agents:
queue: "gpu_4_queue" # g6.12xlarge instance on AWS, has 4 L4 GPU
plugins:
- docker#v5.2.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
environment:
- "HF_HOME=/fsx/hf_cache"
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Omni Model Test"
timeout_in_minutes: 15
depends_on: image-build
Expand Down
6 changes: 4 additions & 2 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ nav:
- configuration/*
- Diffusion Acceleration:
- Overview: user_guide/diffusion_acceleration.md
- TeaCache: user_guide/teacache.md
- Cache-DiT: user_guide/cache_dit_acceleration.md
- Acceleration Methods:
- TeaCache: user_guide/acceleration/teacache.md
- Cache-DiT: user_guide/acceleration/cache_dit_acceleration.md
- Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md
- Models:
- models/supported_models.md
- Developer Guide:
Expand Down
4 changes: 3 additions & 1 deletion docs/configuration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,6 @@ For introduction, please check [Introduction for stage config](./stage_configs.m

## Optimization Features

- **[TeaCache Configuration](../user_guide/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
- **[TeaCache Configuration](../user_guide/acceleration/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
- **[Cache-DiT Configuration](../user_guide/acceleration/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models
- **[Parallelism Configuration](../user_guide/acceleration/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models
2 changes: 1 addition & 1 deletion docs/mkdocs/hooks/generate_examples.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def generate(self) -> str:
main_file_rel = self.main_file.relative_to(ROOT_DIR)
content += f'{code_fence}{self.main_file.suffix[1:]}\n--8<-- "{main_file_rel}"\n{code_fence}\n'
else:
with open(self.main_file) as f:
with open(self.main_file, encoding="utf-8") as f:
# Skip the title from md snippets as it's been included above
main_content = f.readlines()[1:]
content += self.fix_relative_links("".join(main_content))
Expand Down
128 changes: 128 additions & 0 deletions docs/user_guide/acceleration/parallelism_acceleration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Parallelism Acceleration Guide

This guide includes how to use parallelism methods in vLLM-Omni to speed up diffusion model inference as well as reduce the memory requirement on each device.

## Overview

The following parallelism methods are currently supported in vLLM-Omni:

1. DeepSpeed Ulysses Sequence Parallel (Ulysses-SP) ([paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.


The following table shows which models are currently supported by parallelism method:


| Model | Model Identifier | Ulysses-SP |
|-------|-----------------|-----------|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ |

### Sequence Parallelism

#### Ulysses-SP

##### Quick Start

An example of using Ulysses-SP is shown below:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
```

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.

##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:

- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)


To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.

| Configuration | Ulysses degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 112.5s | 1.0x |
| Ulysses-SP | 2 | 65.2s | 1.73x |
| Ulysses-SP | 4 | 39.6s | 2.84x |
| Ulysses-SP | 8 | 30.8s | 3.65x |

##### How to parallelize a new model

If a diffusion model has been deployed in vLLM-Omni and supports single-card inference, you can refer to the following instruction on how to parallelize this model with Ulysses-SP.

First, please edit the `TransformerModel`'s `forward` function in the `xxx_model_transformer.py` to make the inputs (image hidden states, positional embeddings, etc.) as chunks separated at the sequence dimension. Taking `qwen_image_transformer.py` as an example:

```diff
class QwenImageTransformer2DModel(nn.Module):
def forward(
self,
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor = None,
...
):
+ if self.parallel_config.sequence_parallel_size > 1:
+ hidden_states = torch.chunk(hidden_states, get_sequence_parallel_world_size(), dim=-2)[
+ get_sequence_parallel_rank()
+ ]

hidden_states = self.img_in(hidden_states)

...
image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)

+ def get_rotary_emb_chunk(freqs):
+ freqs = torch.chunk(freqs, get_sequence_parallel_world_size(), dim=0)[get_sequence_parallel_rank()]
+ return freqs

+ if self.parallel_config.sequence_parallel_size > 1:
+ img_freqs, txt_freqs = image_rotary_emb
+ img_freqs = get_rotary_emb_chunk(img_freqs)
+ image_rotary_emb = (img_freqs, txt_freqs)
```

Next, at the end of the `forward` function, please call `get_sp_group().all_gather` to gather the chunked outputs across devices, and concatenate them at the sequence dimension.


```diff
class QwenImageTransformer2DModel(nn.Module):
def forward(
self,
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor = None,
...
):
# Use only the image part (hidden_states) from the dual-stream blocks
hidden_states = self.norm_out(hidden_states, temb)
output = self.proj_out(hidden_states)

+ if self.parallel_config.sequence_parallel_size > 1:
+ output = get_sp_group().all_gather(output, dim=-2)
return Transformer2DModelOutput(sample=output)
```

Finally, you can set the parallel configuration and pass it to `Omni` and start parallel inference with:
```diff
from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

omni = Omni(
model="Qwen/Qwen-Image",
+ parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
```
77 changes: 62 additions & 15 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,46 @@
# Diffusion Acceleration Overview

vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods intelligently cache intermediate computations to avoid redundant work across diffusion timesteps.
vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.

## Supported Acceleration Methods

vLLM-Omni currently supports two main cache acceleration backends:

1. **[TeaCache](teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
2. **[Cache-DiT](cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
- **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
- **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
- **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
1. **[TeaCache](acceleration/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
2. **[Cache-DiT](acceleration/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
- **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
- **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
- **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking

Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.

vLLM-Omni also supports the sequence parallelism (SP) for the diffusion model, that includes:

1. [Ulysses-SP](acceleration/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.

## Quick Comparison

### Cache Methods

| Method | Configuration | Description | Best For |
|--------|--------------|-------------|----------|
| **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
| **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |

## Supported Models

The following table shows which models are currently supported by each cache backend:
The following table shows which models are currently supported by each acceleration method:

| Model | Model Identifier | TeaCache | Cache-DiT |
|-------|-----------------|----------|-----------|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ❌ | ✅ |
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP |
|-------|-----------------|----------|-----------|-----------|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ✅ |❌ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ❌ | ✅ |✅ |


## Performance Benchmarks

The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models with 50 inference steps:
The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models generating 1024x1024 images with 50 inference steps:

!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
Expand All @@ -55,6 +61,14 @@ The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Im
| **Qwen/Qwen-Image-Edit** | None | No acceleration | 51.5s | 1.0x | Baseline (diffusers) |
| **Qwen/Qwen-Image-Edit** | Cache-DiT | Default (Fn=1, Bn=0, W=4, TaylorSeer disabled, SCM disabled) | 21.6s | **2.38x** | - |

To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.

| Configuration | Ulysses degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 112.5s | 1.0x |
| Ulysses-SP | 2 | 65.2s | 1.73x |
| Ulysses-SP | 4 | 39.6s | 2.84x |
| Ulysses-SP | 8 | 30.8s | 3.65x |

## Quick Start

Expand Down Expand Up @@ -92,9 +106,42 @@ omni = Omni(
outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
```

### Using Ulysses-SP

Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
```


Run image-to-image:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
ulysses_degree = 2

omni = Omni(
model="Qwen/Qwen-Image-Edit",
parallel_config=DiffusionParallelConfig(ulysses_degree=2)
)

outputs = omni.generate(prompt="turn this cat to a dog",
pil_image=input_image, num_inference_steps=50)
```

## Documentation

For detailed information on each acceleration method:

- **[TeaCache Guide](teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[TeaCache Guide](acceleration/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](acceleration/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[Sequence Parallelism](acceleration/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
11 changes: 11 additions & 0 deletions examples/offline_inference/image_to_image/image_edit.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import torch
from PIL import Image

from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.utils.platform_utils import detect_device_type, is_npu

Expand Down Expand Up @@ -94,6 +95,13 @@ def parse_args() -> argparse.Namespace:
"Default: None (no cache acceleration)."
),
)
parser.add_argument(
"--ulysses_degree",
type=int,
default=1,
help="Number of GPUs used for ulysses sequence parallelism.",
)

return parser.parse_args()


Expand All @@ -115,6 +123,7 @@ def main():
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()

parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree)
# Configure cache based on backend type
cache_config = None
if args.cache_backend == "cache_dit":
Expand Down Expand Up @@ -145,6 +154,7 @@ def main():
vae_use_tiling=vae_use_tiling,
cache_backend=args.cache_backend,
cache_config=cache_config,
parallel_config=parallel_config,
)
print("Pipeline loaded")

Expand All @@ -154,6 +164,7 @@ def main():
print(f" Model: {args.model}")
print(f" Inference steps: {args.num_inference_steps}")
print(f" Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
print(f" Parallel configuration: ulysses_degree={args.ulysses_degree}")
print(f" Input image size: {input_image.size}")
print(f"{'=' * 60}\n")

Expand Down
Loading