Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
0dc25f3
feat: add Ring Attention support for sequence parallelism
mxuax Dec 29, 2025
263c41d
fix: merge Returns docstring into single description
mxuax Dec 29, 2025
7bb7d39
fix: put Returns description on single line for griffe
mxuax Dec 29, 2025
d78ef5c
fix: put Returns description on single line for griffe
mxuax Dec 29, 2025
c6858e6
chore: sync and trigger CI rebuild
mxuax Dec 30, 2025
04bf625
fix: merge Returns docstring into single description
mxuax Dec 30, 2025
e611333
Merge branch 'main' into usp
mxuax Dec 30, 2025
e22f68d
refactor: clean up ring attention backend
mxuax Dec 30, 2025
02490dc
Merge branch 'usp' of https://github.com/mxuax/vllm-omni-ring-attn in…
mxuax Dec 30, 2025
fe6d67f
vllm-omni-ring-attn\tests\e2e\offline_inference\test_sequence_paralle…
mxuax Dec 30, 2025
e235261
remove backward in ring_flash_attn.py and ring_pytorch_attn.py
mxuax Dec 30, 2025
e4bdd84
modify test file
mxuax Dec 30, 2025
b30971a
fix doc string
mxuax Dec 30, 2025
26d6106
modify test image return type error
mxuax Dec 30, 2025
98dd6b0
modify test image
mxuax Dec 30, 2025
d8a09b7
modify test image
mxuax Dec 30, 2025
bb29cf7
modify test image
mxuax Dec 30, 2025
70ef57d
Merge branch 'main' into usp
ZJY0516 Dec 30, 2025
3f4e265
modify ring_pytorch_attn default backends to be efficient spda
mxuax Dec 31, 2025
727b9af
Merge branch 'usp' of https://github.com/mxuax/vllm-omni-ring-attn in…
mxuax Dec 31, 2025
a491391
add debug lines for ci
mxuax Dec 31, 2025
c30c755
add debug lines for ci
mxuax Dec 31, 2025
b7b0bad
fixed bug test_sp wrongly access output.request_
mxuax Dec 31, 2025
aaa41c0
add shm-size: 8gb in pipeline.yml for ring communication requirements
mxuax Dec 31, 2025
86769b3
modify test_comm.py and add it to pipeline.yml to check the p2p commu…
mxuax Dec 31, 2025
ae3bd1c
modify the flash call
mxuax Dec 31, 2025
d537d16
modify pytorch_attn for continuous tensor passing
mxuax Dec 31, 2025
b349eae
finalize test parameer
mxuax Dec 31, 2025
c0f1db7
Merge branch 'main' into usp
mxuax Dec 31, 2025
698e7f2
Accelerate Diffusion Parallelism Test
mxuax Dec 31, 2025
6c77f7e
Merge branch 'usp' of https://github.com/mxuax/vllm-omni-ring-attn in…
mxuax Dec 31, 2025
1aea35f
fix time limitation
mxuax Dec 31, 2025
ed6182d
Merge branch 'main' into usp
hsliuustc0106 Dec 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ steps:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
shm-size: "8gb"
environment:
- "HF_HOME=/fsx/hf_cache"
volumes:
Expand Down
123 changes: 110 additions & 13 deletions docs/user_guide/acceleration/parallelism_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,29 @@ The following parallelism methods are currently supported in vLLM-Omni:

1. DeepSpeed Ulysses Sequence Parallel (DeepSpeed Ulysses-SP) ([arxiv paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.

2. [Ring-Attention](#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded


The following table shows which models are currently supported by parallelism method:

### ImageGen

| Model | Model Identifier | Ulysses-SP |
|-------|------------------|-----------|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ |
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ |
| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ |
| Model | Model Identifier | Ulysses-SP | Ring-SP |
|-------|------------------|-----------|---------|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ❌ |
| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ❌ |
| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ❌ |
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ |
| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ❌ |

### VideoGen

| Model | Model Identifier | Ulysses-SP |
|-------|------------------|-----------|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ |
| Model | Model Identifier | Ulysses-SP | Ring-SP |
|-------|------------------|-----------|---------|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ❌ |

### Sequence Parallelism

Expand Down Expand Up @@ -80,6 +82,101 @@ To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** m
| Ulysses-SP | 4 | 39.6s | 2.84x |
| Ulysses-SP | 8 | 30.8s | 3.65x |

#### Ring-Attention

Ring-Attention ([arxiv paper](https://arxiv.org/abs/2310.01889)) splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results. Unlike Ulysses-SP which uses all-to-all communication, Ring-Attention keeps the sequence dimension sharded throughout the computation and circulates Key/Value blocks through a ring topology.

##### Offline Inference

An example of offline inference script using Ring-Attention is shown below:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2

omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
```

See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.


##### Online Serving

You can enable Ring-Attention in online serving for diffusion models via `--ring`:

```bash
# Text-to-image (requires >= 2 GPUs)
vllm serve Qwen/Qwen-Image --omni --port 8091 --ring 2
```

##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:

- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)


To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.

| Configuration | Ring degree |Generation Time | Speedup |
|---------------|----------------|---------|---------|
| **Baseline (diffusers)** | - | 45.2s | 1.0x |
| Ring-Attention | 2 | 29.9s | 1.51x |
| Ring-Attention | 4 | 23.3s | 1.94x |


#### Hybrid Ulysses + Ring

You can combine both Ulysses-SP and Ring-Attention for larger scale parallelism. The total sequence parallel size equals `ulysses_degree × ring_degree`.

##### Offline Inference

```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig

# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total
omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ulysses_degree=2, ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
```

##### Online Serving

```bash
# Text-to-image (requires >= 4 GPUs)
vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 --ring 2
```

##### Benchmarks
!!! note "Benchmark Disclaimer"
These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:

- Specific model and use case
- Hardware configuration
- Careful parameter tuning
- Different inference settings (e.g., number of steps, image resolution)


To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends.

| Configuration | Ulysses degree | Ring degree | Generation Time | Speedup |
|---------------|----------------|-------------|-----------------|---------|
| **Baseline (diffusers)** | - | - | 45.2s | 1.0x |
| Hybrid Ulysses + Ring | 2 | 2 | 24.3s | 1.87x |


##### How to parallelize a new model
Comment thread
mxuax marked this conversation as resolved.

If a diffusion model has been deployed in vLLM-Omni and supports single-card inference, you can refer to the following instructions to parallelize it with [Ulysses-SP](https://arxiv.org/pdf/2309.14509).
Expand Down
25 changes: 25 additions & 0 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Both methods can provide significant speedups (typically **1.5x-2.0x**) while ma
vLLM-Omni also supports the sequence parallelism (SP) for the diffusion model, that includes:

1. [Ulysses-SP](acceleration/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
2. [Ring-Attention](acceleration/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded.
Comment thread
mxuax marked this conversation as resolved.

## Quick Comparison

Expand All @@ -33,6 +34,14 @@ The following table shows which models are currently supported by each accelerat

### ImageGen

<<<<<<< HEAD
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |
|-------|-----------------|----------|-----------|-----------|----------------|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ✅ |❌ | ❌ |
| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ |✅ | - |
| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ❌ | ✅ |✅ | - |
=======
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP |
|-------|------------------|----------|-----------|-----------|
| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ |
Expand Down Expand Up @@ -151,6 +160,22 @@ outputs = omni.generate(prompt="turn this cat to a dog",
pil_image=input_image, num_inference_steps=50)
```

### Using Ring-Attention

Run text-to-image:
```python
from vllm_omni import Omni
from vllm_omni.diffusion.data import DiffusionParallelConfig
ring_degree = 2

omni = Omni(
model="Qwen/Qwen-Image",
parallel_config=DiffusionParallelConfig(ring_degree=2)
)

outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
```

## Documentation

For detailed information on each acceleration method:
Expand Down
12 changes: 8 additions & 4 deletions examples/offline_inference/image_to_image/image_edit.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,12 @@ def parse_args() -> argparse.Namespace:
default=1,
help="Number of GPUs used for ulysses sequence parallelism.",
)

parser.add_argument(
"--ring_degree",
type=int,
default=1,
help="Number of GPUs used for ring sequence parallelism.",
)
parser.add_argument("--layers", type=int, default=4, help="Number of layers to decompose the input image into.")
parser.add_argument(
"--resolution",
Expand Down Expand Up @@ -268,8 +273,7 @@ def main():
# Enable VAE memory optimizations on NPU
vae_use_slicing = is_npu()
vae_use_tiling = is_npu()

parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree)
parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree, ring_degree=args.ring_degree)
# Configure cache based on backend type
cache_config = None
if args.cache_backend == "cache_dit":
Expand Down Expand Up @@ -315,7 +319,7 @@ def main():
print(f" Image {idx + 1} size: {img.size}")
else:
print(f" Input image size: {input_image.size}")
print(f" Parallel configuration: ulysses_degree={args.ulysses_degree}")
print(f" Parallel configuration: ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}")
print(f"{'=' * 60}\n")

try:
Expand Down
12 changes: 9 additions & 3 deletions examples/offline_inference/text_to_image/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,12 @@ def parse_args() -> argparse.Namespace:
default=1,
help="Number of GPUs used for ulysses sequence parallelism.",
)

parser.add_argument(
"--ring_degree",
type=int,
default=1,
help="Number of GPUs used for ring sequence parallelism.",
)
return parser.parse_args()


Expand Down Expand Up @@ -108,7 +113,8 @@ def main():
# (e.g., QwenImagePipeline or FluxPipeline)
}

parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree)
# assert args.ring_degree == 1, "Ring attention is not supported yet"
parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree, ring_degree=args.ring_degree)
omni = Omni(
model=args.model,
vae_use_slicing=vae_use_slicing,
Expand All @@ -124,7 +130,7 @@ def main():
print(f" Model: {args.model}")
print(f" Inference steps: {args.num_inference_steps}")
print(f" Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
print(f" Parallel configuration: ulysses_degree={args.ulysses_degree}")
print(f" Parallel configuration: ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}")
print(f" Image size: {args.width}x{args.height}")
print(f"{'=' * 60}\n")

Expand Down
Loading