vllm-project · hsliuustc0106 · Dec 17, 2025 · Dec 4, 2025 · Dec 5, 2025 · Dec 8, 2025
@@ -16,7 +16,7 @@ steps:
       queue: "cpu_queue_premerge"
 
   - label: "Diffusion Model Test"
-    timeout_in_minutes: 15
+    timeout_in_minutes: 20
     depends_on: image-build
     commands:
       - pytest -s -v tests/e2e/offline_inference/test_t2i_model.py
@@ -49,6 +49,23 @@ steps:
           volumes:
             - "/fsx/hf_cache:/fsx/hf_cache"
 
+  - label: "Diffusion Parallelism Test"
+    timeout_in_minutes: 15
+    depends_on: image-build
+    commands:
+      - pytest -s -v tests/e2e/offline_inference/test_sequence_parallel.py
+    agents:
+      queue: "gpu_4_queue" # g6.12xlarge instance on AWS, has 4 L4 GPU
+    plugins:
+      - docker#v5.2.0:
+          image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
+          always-pull: true
+          propagate-environment: true
+          environment:
+            - "HF_HOME=/fsx/hf_cache"
+          volumes:
+            - "/fsx/hf_cache:/fsx/hf_cache"
+
   - label: "Omni Model Test"
     timeout_in_minutes: 15
     depends_on: image-build

@@ -22,8 +22,10 @@ nav:
     - configuration/*
   - Diffusion Acceleration:
     - Overview: user_guide/diffusion_acceleration.md
-    - TeaCache: user_guide/teacache.md
-    - Cache-DiT: user_guide/cache_dit_acceleration.md
+    - Acceleration Methods:
+      - TeaCache: user_guide/acceleration/teacache.md
+      - Cache-DiT: user_guide/acceleration/cache_dit_acceleration.md
+      - Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md
   - Models:
     - models/supported_models.md
 - Developer Guide:

@@ -12,4 +12,6 @@ For introduction, please check [Introduction for stage config](./stage_configs.m
 
 ## Optimization Features
 
-- **[TeaCache Configuration](../user_guide/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
+- **[TeaCache Configuration](../user_guide/acceleration/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
+- **[Cache-DiT Configuration](../user_guide/acceleration/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models
+- **[Parallelism Configuration](../user_guide/acceleration/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models
@@ -195,7 +195,7 @@ def generate(self) -> str:
             main_file_rel = self.main_file.relative_to(ROOT_DIR)
             content += f'{code_fence}{self.main_file.suffix[1:]}\n--8<-- "{main_file_rel}"\n{code_fence}\n'
         else:
-            with open(self.main_file) as f:
+            with open(self.main_file, encoding="utf-8") as f:
                 # Skip the title from md snippets as it's been included above
                 main_content = f.readlines()[1:]
             content += self.fix_relative_links("".join(main_content))

@@ -0,0 +1,128 @@
+# Parallelism Acceleration Guide
+
+This guide includes how to use parallelism methods in vLLM-Omni to speed up diffusion model inference as well as reduce the memory requirement on each device.
+
+## Overview
+
+The following parallelism methods are currently supported in vLLM-Omni:
+
+1. DeepSpeed Ulysses Sequence Parallel (Ulysses-SP) ([paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
+
+
+The following table shows which models are currently supported by parallelism method:
+
+
+| Model | Model Identifier |  Ulysses-SP |
+|-------|-----------------|-----------|
+| **Qwen-Image** | `Qwen/Qwen-Image` |  ✅ |
+| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ |
+| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ |
+
+### Sequence Parallelism
+
+#### Ulysses-SP
+
+##### Quick Start
+
+An example of using Ulysses-SP is shown below:
+```python
+from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=2)
+)
+
+outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
+```
+
+See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example.
+
+##### Benchmarks
+!!! note "Benchmark Disclaimer"
+    These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
+
+    - Specific model and use case
+    - Hardware configuration
+    - Careful parameter tuning
+    - Different inference settings (e.g., number of steps, image resolution)
+
+
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
+
+| Configuration | Ulysses degree |Generation Time | Speedup |
+|---------------|----------------|---------|---------|
+| **Baseline (diffusers)** | - | 112.5s | 1.0x |
+| Ulysses-SP  |  2  |  65.2s | 1.73x |
+| Ulysses-SP  |  4  | 39.6s | 2.84x |
+| Ulysses-SP  |  8  | 30.8s | 3.65x |
+
+##### How to parallelize a new model
+
+If a diffusion model has been deployed in vLLM-Omni and supports single-card inference, you can refer to the following instruction on how to parallelize this model with Ulysses-SP.
+
+First, please edit the `TransformerModel`'s `forward` function in the `xxx_model_transformer.py` to make the inputs (image hidden states, positional embeddings, etc.) as chunks separated at the sequence dimension. Taking `qwen_image_transformer.py` as an example:
+
+```diff
+class QwenImageTransformer2DModel(nn.Module):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        ...
+    ):
++   if self.parallel_config.sequence_parallel_size > 1:
++       hidden_states = torch.chunk(hidden_states, get_sequence_parallel_world_size(), dim=-2)[
++           get_sequence_parallel_rank()
++      ]
+
+    hidden_states = self.img_in(hidden_states)
+
+    ...
+    image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
+
++   def get_rotary_emb_chunk(freqs):
++       freqs = torch.chunk(freqs, get_sequence_parallel_world_size(), dim=0)[get_sequence_parallel_rank()]
++       return freqs
+
++   if self.parallel_config.sequence_parallel_size > 1:
++       img_freqs, txt_freqs = image_rotary_emb
++       img_freqs = get_rotary_emb_chunk(img_freqs)
++       image_rotary_emb = (img_freqs, txt_freqs)
+```
+
+Next, at the end of the `forward` function, please call `get_sp_group().all_gather` to gather the chunked outputs across devices, and concatenate them at the sequence dimension.
+
+
+```diff
+class QwenImageTransformer2DModel(nn.Module):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        ...
+    ):
+    # Use only the image part (hidden_states) from the dual-stream blocks
+    hidden_states = self.norm_out(hidden_states, temb)
+    output = self.proj_out(hidden_states)
+
++   if self.parallel_config.sequence_parallel_size > 1:
++       output = get_sp_group().all_gather(output, dim=-2)
+    return Transformer2DModelOutput(sample=output)
+```
+
+Finally, you can set the parallel configuration and pass it to `Omni` and start parallel inference with:
+```diff
+from vllm_omni import Omni
++from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+
+omni = Omni(
+    model="Qwen/Qwen-Image",
++    parallel_config=DiffusionParallelConfig(ulysses_degree=2)
+)
+
+outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
+```
@@ -1,40 +1,46 @@
 # Diffusion Acceleration Overview
 
-vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods intelligently cache intermediate computations to avoid redundant work across diffusion timesteps.
+vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.
 
 ## Supported Acceleration Methods
 
 vLLM-Omni currently supports two main cache acceleration backends:
 
-1. **[TeaCache](teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
-2. **[Cache-DiT](cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
-   - **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
-   - **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
-   - **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
+1. **[TeaCache](acceleration/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar
+2. **[Cache-DiT](acceleration/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques:
+    - **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences
+    - **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference
+    - **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking
 
 Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.
 
+vLLM-Omni also supports the sequence parallelism (SP) for the diffusion model, that includes:
+
+1. [Ulysses-SP](acceleration/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
+
 ## Quick Comparison
 
+### Cache Methods
+
 | Method | Configuration | Description | Best For |
 |--------|--------------|-------------|----------|
 | **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
 | **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |
 
 ## Supported Models
 
-The following table shows which models are currently supported by each cache backend:
+The following table shows which models are currently supported by each acceleration method:
 
-| Model | Model Identifier | TeaCache | Cache-DiT |
-|-------|-----------------|----------|-----------|
-| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ |
-| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ✅ |
-| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ❌ | ✅ |
+| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP |
+|-------|-----------------|----------|-----------|-----------|
+| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ |
+| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ❌ | ✅ |❌ |
+| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ❌ | ✅ |✅ |
 
 
 ## Performance Benchmarks
 
-The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models with 50 inference steps:
+The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models generating 1024x1024 images with 50 inference steps:
 
 !!! note "Benchmark Disclaimer"
     These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on:
@@ -55,6 +61,14 @@ The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Im
 | **Qwen/Qwen-Image-Edit** | None | No acceleration | 51.5s | 1.0x | Baseline (diffusers) |
 | **Qwen/Qwen-Image-Edit** | Cache-DiT | Default (Fn=1, Bn=0, W=4, TaylorSeer disabled, SCM disabled) | 21.6s | **2.38x** | - |
 
+To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends.
+
+| Configuration | Ulysses degree |Generation Time | Speedup |
+|---------------|----------------|---------|---------|
+| **Baseline (diffusers)** | - | 112.5s | 1.0x |
+| Ulysses-SP  |  2  |  65.2s | 1.73x |
+| Ulysses-SP  |  4  | 39.6s | 2.84x |
+| Ulysses-SP  |  8  | 30.8s | 3.65x |
 
 ## Quick Start
 
@@ -92,9 +106,42 @@ omni = Omni(
 outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)
 ```
 
+### Using Ulysses-SP
+
+Run text-to-image:
+```python
+from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+
+omni = Omni(
+    model="Qwen/Qwen-Image",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=2)
+)
+
+outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50, width=2048, height=2048)
+```
+
+
+Run image-to-image:
+```python
+from vllm_omni import Omni
+from vllm_omni.diffusion.data import DiffusionParallelConfig
+ulysses_degree = 2
+
+omni = Omni(
+    model="Qwen/Qwen-Image-Edit",
+    parallel_config=DiffusionParallelConfig(ulysses_degree=2)
+)
+
+outputs = omni.generate(prompt="turn this cat to a dog",
+        pil_image=input_image, num_inference_steps=50)
+```
+
 ## Documentation
 
 For detailed information on each acceleration method:
 
-- **[TeaCache Guide](teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
-- **[Cache-DiT Acceleration Guide](cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
+- **[TeaCache Guide](acceleration/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
+- **[Cache-DiT Acceleration Guide](acceleration/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
+- **[Sequence Parallelism](acceleration/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
@@ -24,6 +24,7 @@
 import torch
 from PIL import Image
 
+from vllm_omni.diffusion.data import DiffusionParallelConfig
 from vllm_omni.entrypoints.omni import Omni
 from vllm_omni.utils.platform_utils import detect_device_type, is_npu
 
@@ -94,6 +95,13 @@ def parse_args() -> argparse.Namespace:
             "Default: None (no cache acceleration)."
         ),
     )
+    parser.add_argument(
+        "--ulysses_degree",
+        type=int,
+        default=1,
+        help="Number of GPUs used for ulysses sequence parallelism.",
+    )
+
     return parser.parse_args()
 
 
@@ -115,6 +123,7 @@ def main():
     vae_use_slicing = is_npu()
     vae_use_tiling = is_npu()
 
+    parallel_config = DiffusionParallelConfig(ulysses_degree=args.ulysses_degree)
     # Configure cache based on backend type
     cache_config = None
     if args.cache_backend == "cache_dit":
@@ -145,6 +154,7 @@ def main():
         vae_use_tiling=vae_use_tiling,
         cache_backend=args.cache_backend,
         cache_config=cache_config,
+        parallel_config=parallel_config,
     )
     print("Pipeline loaded")
 
@@ -154,6 +164,7 @@ def main():
     print(f"  Model: {args.model}")
     print(f"  Inference steps: {args.num_inference_steps}")
     print(f"  Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
+    print(f"  Parallel configuration: ulysses_degree={args.ulysses_degree}")
     print(f"  Input image size: {input_image.size}")
     print(f"{'=' * 60}\n")