diff --git a/docs/.nav.yml b/docs/.nav.yml index af80816e409..a4939961e89 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -54,20 +54,28 @@ nav: - Features: - Sleep Mode: features/sleep_mode.md - Diffusion Features: - - Acceleration Overview: user_guide/diffusion_acceleration.md - - TeaCache: user_guide/diffusion/teacache.md - - Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md + - Overview: user_guide/diffusion_features.md + - Feature Compatibility: user_guide/feature_compatibility.md + - Cache Acceleration: + - TeaCache: user_guide/diffusion/cache_acceleration/teacache.md + - Cache-DiT: user_guide/diffusion/cache_acceleration/cache_dit.md - Quantization: - Overview: user_guide/diffusion/quantization/overview.md - FP8: user_guide/diffusion/quantization/fp8.md - Int8: user_guide/diffusion/quantization/int8.md - GGUF: user_guide/diffusion/quantization/gguf.md - - Step Execution: user_guide/diffusion/step_execution.md - - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md + - Parallelism: + - Overview: user_guide/diffusion/parallelism/overview.md + - CFG Parallel: user_guide/diffusion/parallelism/cfg_parallel.md + - Expert Parallel: user_guide/diffusion/parallelism/expert_parallel.md + - Hybrid Sharded Data Parallel: user_guide/diffusion/parallelism/hsdp.md + - Sequence Parallel: user_guide/diffusion/parallelism/sequence_parallel.md + - Tensor Parallel: user_guide/diffusion/parallelism/tensor_parallel.md + - VAE Patch Parallel: user_guide/diffusion/parallelism/vae_patch_parallel.md - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md - LoRA: user_guide/diffusion/lora.md - - Hybrid Sharded Data Parallel: design/feature/hsdp.md - Custom Pipeline: features/custom_pipeline.md + - Step Execution: user_guide/diffusion/step_execution.md - ComfyUI: features/comfyui.md - Developer Guide: - General: @@ -92,6 +100,8 @@ nav: - design/feature/cfg_parallel.md - design/feature/sequence_parallel.md - design/feature/tensor_parallel.md + - design/feature/vae_parallel.md + - design/feature/hsdp.md - design/feature/cache_dit.md - design/feature/teacache.md - design/feature/async_chunk_design.md diff --git a/docs/configuration/README.md b/docs/configuration/README.md index 7e32806ea2e..b5761a7f1bc 100644 --- a/docs/configuration/README.md +++ b/docs/configuration/README.md @@ -20,7 +20,4 @@ For introduction, please check [Introduction for stage config](./stage_configs.m ## Optimization Features -- **[TeaCache Configuration](../user_guide/diffusion/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss -- **[Cache-DiT Configuration](../user_guide/diffusion/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models -- **[Parallelism Configuration](../user_guide/diffusion/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models -- **[CPU Offloading](../user_guide/diffusion/cpu_offload_diffusion.md)** - Enable CPU offloading (model-level and layerwise) for for DiT models +- **[Diffusion Features Overview](../user_guide/diffusion_features.md)** - Complete overview of all diffusion model features and supported models diff --git a/docs/design/feature/cache_dit.md b/docs/design/feature/cache_dit.md index a5395995638..237a958774d 100644 --- a/docs/design/feature/cache_dit.md +++ b/docs/design/feature/cache_dit.md @@ -256,7 +256,7 @@ cache_config={ } ``` -Check the [user guide for cache_dit](../../user_guide/diffusion/cache_dit_acceleration.md) for more adjustable parameters. +Check the [user guide for cache_dit](../../user_guide/diffusion/cache_acceleration/cache_dit.md) for more adjustable parameters. --- diff --git a/docs/design/feature/teacache.md b/docs/design/feature/teacache.md index 775fb635b2b..9fa315cee77 100644 --- a/docs/design/feature/teacache.md +++ b/docs/design/feature/teacache.md @@ -369,7 +369,7 @@ images = omni.generate( 2. **Compare performance** - Measure speedup vs baseline (expect 1.5x-2.0x) 3. **Verify output quality** - Visually compare cached vs uncached outputs (should be nearly identical) -See more detailed examples in [user guide for teacache](../../user_guide/diffusion/teacache.md). +See more detailed examples in [user guide for teacache](../../user_guide/diffusion/cache_acceleration/teacache.md). --- diff --git a/docs/user_guide/diffusion/cache_acceleration/cache_dit.md b/docs/user_guide/diffusion/cache_acceleration/cache_dit.md new file mode 100644 index 00000000000..dec52b9d6b6 --- /dev/null +++ b/docs/user_guide/diffusion/cache_acceleration/cache_dit.md @@ -0,0 +1,285 @@ +# Cache-DiT Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Acceleration Methods](#acceleration-methods) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) +- [Additional Resources](#additional-resources) + +--- + +## Overview + +Cache-DiT accelerates diffusion transformer models through intelligent caching mechanisms, providing significant speedup with minimal quality loss. It supports multiple acceleration techniques that can be combined for optimal performance: + +- **DBCache**: Dual Block Cache for reducing redundant computations +- **TaylorSeer**: Taylor expansion-based forecasting for faster inference +- **SCM**: Step Computation Masking for selective step computation + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +--- + +## Quick Start + +### Basic Usage + +Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`: + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams + +omni = Omni( + model="Qwen/Qwen-Image", + cache_backend="cache_dit", # Enable Cache-DiT with defaults +) + +outputs = omni.generate( + "a beautiful landscape", + OmniDiffusionSamplingParams(num_inference_steps=50), +) +``` + +**Note**: When `cache_config` is not provided, Cache-DiT uses optimized default values. See the [Configuration Parameters](#configuration-parameters) section for details. + +### Custom Configuration + +To customize cache-dit settings, provide a `cache_config` dictionary, for example: + +```python +omni = Omni( + model="Qwen/Qwen-Image", + cache_backend="cache_dit", + cache_config={ + "Fn_compute_blocks": 1, + "Bn_compute_blocks": 0, + "max_warmup_steps": 4, + "residual_diff_threshold": 0.12, + }, +) +``` + +--- + +## Example Script + +### Offline Inference + +Use the example script under `examples/offline_inference/text_to_image`: + +```bash +cd examples/offline_inference/text_to_image +python text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "a cup of coffee on the table" \ + --cache-backend cache_dit \ + --num-inference-steps 50 +``` + +See the [text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py) for detailed configuration options. + +The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer: + +```python +omni = Omni( + model="Qwen/Qwen-Image", + cache_backend="cache_dit", + cache_config={ + # Scheme: Hybrid DBCache + SCM + TaylorSeer + "Fn_compute_blocks": 1, # Optimized for single-transformer models + "Bn_compute_blocks": 0, # Number of backward compute blocks + "max_warmup_steps": 4, # Maximum warmup steps (works for few-step models) + "residual_diff_threshold": 0.24, # Higher threshold for more aggressive caching + "max_continuous_cached_steps": 3, # Limit to prevent precision degradation + # TaylorSeer parameters [cache-dit only] + "enable_taylorseer": False, # Disabled by default (not suitable for few-step models) + "taylorseer_order": 1, # TaylorSeer polynomial order + # SCM (Step Computation Masking) parameters [cache-dit only] + "scm_steps_mask_policy": None, # SCM mask policy: None (disabled), "slow", "medium", "fast", "ultra" + "scm_steps_policy": "dynamic", # SCM steps policy: "dynamic" or "static" + } +) +``` + +You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements. + +For image-to-image tasks, use the example script under `examples/offline_inference/image_to_image`: + +```bash +cd examples/offline_inference/image_to_image +python image_edit.py \ + --model Qwen/Qwen-Image-Edit \ + --prompt "make the sky more colorful" \ + --image path/to/input/image.jpg \ + --cache-backend cache_dit \ + --num-inference-steps 50 \ + --cache-dit-max-continuous-cached-steps 3 \ + --cache-dit-residual-diff-threshold 0.24 \ + --cache-dit-enable-taylorseer +``` + +See the [image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py) for detailed configuration options. + +### Online Serving + +```bash +# Default configuration (recommended) +vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit + +# Custom configuration +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --cache-backend cache_dit \ + --cache-config '{"Fn_compute_blocks": 1, "residual_diff_threshold": 0.12}' +``` + +--- + +## Acceleration Methods + +For comprehensive illustration, please view Cache-DiT [User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/). + +### 1. DBCache (Dual Block Cache) + +DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality. + +**Example Configuration**: + +```python +cache_config={ + "Fn_compute_blocks": 8, # Use first 8 blocks for difference computation + "Bn_compute_blocks": 0, # No additional fusion blocks + "max_warmup_steps": 8, # Cache after 8 warmup steps + "residual_diff_threshold": 0.12, # Lower threshold for faster inference + "max_cached_steps": -1, # No limit on cached steps +} +``` + +**Performance Tips**: + +- Default `Fn_compute_blocks=1` works well for most cases. Some models (e.g., [FLUX.2-klein](https://github.com/wtomin/vllm-omni/blob/main/vllm_omni/diffusion/cache/cache_dit_backend.py#L363)) use a larger value for `Fn_compute_blocks` for a balanced performance. +- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality. +- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed. + +### 2. TaylorSeer + +TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality. + +**Example Configuration**: + +```python +cache_config={ + "enable_taylorseer": True, + "taylorseer_order": 1, # First-order Taylor expansion +} +``` + +**Performance Tips**: + +- TaylorSeer is **not suitable for few-step distilled models**. +- Use `taylorseer_order=1` for most cases (good balance of speed and quality). +- Combine with DBCache for maximum acceleration. +- Higher orders (2-3) may improve quality but reduce speed gains. + +### 3. SCM (Step Computation Masking) + +SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration. + +`scm_steps_mask_policy` options (number of compute steps out of 28): + +| Policy | Compute Steps | Speed | Quality | +|--------|--------------|-------|---------| +| `None` (default) | All | Baseline | Best | +| `"slow"` | 18 / 28 | Moderate | High | +| `"medium"` | 15 / 28 | Balanced | Good | +| `"fast"` | 11 / 28 | Fast | Moderate | +| `"ultra"` | 8 / 28 | Fastest | Lower | + +**Example Configuration**: + +```python +cache_config={ + "scm_steps_mask_policy": "medium", # Balanced speed/quality + "scm_steps_policy": "dynamic", # Use dynamic cache +} +``` + +**Performance Tips**: + +- SCM is disabled by default. Enable it by setting a policy value if you need additional acceleration. +- Start with `"medium"` policy and adjust based on quality requirements. +- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised. +- `"dynamic"` policy generally provides better quality than `"static"`. +- SCM mask is automatically regenerated when `num_inference_steps` changes during inference. + +--- + +## Configuration Parameters + +In `cache_config` passed to `Omni` constructor, it accepts the arguments of `DBCacheConfig` ([Cache-DiT API Reference](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)). Key parameters are listed below: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) | +| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion | +| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) | +| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) | +| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) | +| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) | +| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) | +| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) | +| `taylorseer_order` | int | 1 | Taylor expansion order | +| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") | +| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") | + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- Production deployments requiring fast inference +- Diffusion transformer models (DiT architecture) +- Scenarios where 1.5x-3x speedup is valuable + +**Not for:** + +- Non-DiT architectures (use model-specific acceleration instead) +- Models already using few-step distillation (< 10 steps) + +--- + +## Troubleshooting + +### Common Issue 1: Quality Degradation + +**Symptoms**: Generated images have visible artifacts or lower quality + +**Solution**: +```python +# Reduce aggressiveness - use more conservative settings +cache_config={ + "residual_diff_threshold": 0.20, # Lower threshold (closer to default 0.24) + "Fn_compute_blocks": 8, # Use more blocks for better decisions + "max_warmup_steps": 6, # Longer warmup + "scm_steps_mask_policy": "slow", # More compute steps +} +``` + +--- + +## Summary + +Using Cache-DiT acceleration: + +1. ✅ **Enable Cache-DiT** - Set `cache_backend="cache_dit"` to get 1.5x-3x speedup with optimized defaults +2. ✅ **(Optional) Customize** - Adjust `cache_config` parameters for specific speed/quality trade-offs diff --git a/docs/user_guide/diffusion/cache_acceleration/teacache.md b/docs/user_guide/diffusion/cache_acceleration/teacache.md new file mode 100644 index 00000000000..026b86ec7f9 --- /dev/null +++ b/docs/user_guide/diffusion/cache_acceleration/teacache.md @@ -0,0 +1,194 @@ +# TeaCache Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) + +--- + +## Overview + +TeaCache accelerates diffusion model inference by caching transformer computations when consecutive timesteps are similar, providing **1.5x-2.0x speedup** with minimal quality loss. It dynamically decides whether to reuse cached outputs based on input similarity, making it ideal for production deployments where inference speed matters without sacrificing generation quality. + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +--- + +## Quick Start + + + +### Basic Usage + + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams + +omni = Omni( + model="Qwen/Qwen-Image", + cache_backend="tea_cache", +) + +outputs = omni.generate( + "A cat sitting on a windowsill", + OmniDiffusionSamplingParams(num_inference_steps=50), +) +``` + +### Custom Configuration + +```python +omni = Omni( + model="Qwen/Qwen-Image", + cache_backend="tea_cache", + cache_config={ + "rel_l1_thresh": 0.2, # Controls speed/quality tradeoff + }, +) +``` + +### Using Environment Variable + +You can also enable TeaCache via environment variable: + +```bash +export DIFFUSION_CACHE_BACKEND=tea_cache +``` + +Then initialize without explicitly setting `cache_backend`: + +```python +from vllm_omni import Omni + +omni = Omni( + model="Qwen/Qwen-Image", + cache_config={"rel_l1_thresh": 0.2} +) +``` + +--- + +## Example Script + +### Offline Inference + +Use python script under `examples/offline_inference/text_to_image/` or `examples/offline_inference/image_to_image/` with CLI: + +```bash +# Text-to-image example +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --cache-backend tea_cache + +# Image-to-image example +python examples/offline_inference/image_to_image/image_edit.py \ + --model Qwen/Qwen-Image-Edit \ + --image input.png \ + --prompt "Edit description" \ + --cache-backend tea_cache \ + --tea-cache-rel-l1-thresh 0.25 +``` + +See the [text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py) or [image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py) for detailed configuration options. + +### Online Serving + +```bash +# Default configuration +vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend tea_cache + +# Custom configuration +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --cache-backend tea_cache \ + --cache-config '{"rel_l1_thresh": 0.2}' +``` + +--- + +## Configuration Parameters + +In `OmniDiffusionConfig` + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `rel_l1_thresh` | float | `0.2` | Similarity threshold for cache reuse. Lower values prioritize quality (less caching), higher values prioritize speed (more caching). Suggested range: 0.1-0.8 | +| `coefficients` | list[float] \| None | `None` | Polynomial coefficients for rescaling L1 distance. Must contain exactly 5 elements if provided. If `None`, uses model-specific defaults based on transformer type. | + +Users can find the default model coefficients in [`vllm_omni/diffusion/cache/teacache/config.py`](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/diffusion/cache/teacache/config.py), for example: + +```python +_MODEL_COEFFICIENTS = { + # Qwen-Image transformer coefficients from ComfyUI-TeaCache + # Tuned specifically for Qwen's dual-stream transformer architecture + # Used for all Qwen-Image Family pipelines, in general + "QwenImageTransformer2DModel": [ + -4.50000000e02, + 2.80000000e02, + -4.50000000e01, + 3.20000000e00, + -2.00000000e-02, + ], + ... +} +``` + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- Production deployments requiring faster inference, tolerant of minimal quality loss +- Scenarios where 1.5-2x speedup is valuable +- Useful for single-card acceleration + +**Not for:** + +- Maximum quality requirements where no degradation is acceptable +- Very short inference runs (< 20 steps) where caching overhead may outweigh benefits + + +--- + +## Troubleshooting + +### Common Issue 1: Quality Degradation + +**Symptoms**: Generated images show artifacts, reduced detail, or inconsistent quality compared to non-cached results + +**Solution**: + +```python +# Lower the threshold for more conservative caching +cache_config={"rel_l1_thresh": 0.1} +``` + +### Common Issue 2: Limited Speedup + +**Symptoms**: Actual speedup is less than expected (< 1.3x) + +**Solutions**: +1. Increase the threshold to enable more aggressive caching: + ```python + cache_config={"rel_l1_thresh": 0.8} + ``` +2. Ensure you're using sufficient inference steps (35+ recommended) +3. Check that your model architecture is supported (see Supported Models section) + +--- + + +## Summary + +1. ✅ **Enable TeaCache** - Set `cache_backend="tea_cache"` to get 1.5x-2.0x speedup with optimized defaults +2. ✅ **(Optional) Customize** - Adjust thresholds and polynomial coefficients for specific speed/quality trade-offs diff --git a/docs/user_guide/diffusion/cache_dit_acceleration.md b/docs/user_guide/diffusion/cache_dit_acceleration.md deleted file mode 100644 index c51ecca1e1c..00000000000 --- a/docs/user_guide/diffusion/cache_dit_acceleration.md +++ /dev/null @@ -1,228 +0,0 @@ -# Cache-DiT Acceleration Guide - -This guide explains how to use cache-dit acceleration in vLLM-Omni to speed up diffusion model inference. - -## Overview - -Cache-dit is a library that accelerates diffusion transformer models through intelligent caching mechanisms. It supports multiple acceleration techniques that can be combined for optimal performance: - -- **DBCache**: Dual Block Cache for reducing redundant computations -- **TaylorSeer**: Taylor expansion-based forecasting for faster inference -- **SCM**: Step Computation Masking for selective step computation - -## Quick Start - -### Basic Usage - -Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`. Cache-dit will use its recommended default parameters: - -```python -from vllm_omni.entrypoints.omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -# Simplest way: just enable cache-dit with default parameters -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="cache_dit", -) - -images = omni.generate( - "a beautiful landscape", - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -**Default Parameters**: When `cache_config` is not provided, cache-dit uses optimized default values. See the [Configuration Reference](#configuration-reference) section for a complete list of all parameters and their default values. - -### Custom Configuration - -To customize cache-dit settings, provide a `cache_config` dictionary, for example: - -```python -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="cache_dit", - cache_config={ - "Fn_compute_blocks": 1, - "Bn_compute_blocks": 0, - "max_warmup_steps": 4, - "residual_diff_threshold": 0.12, - }, -) -``` - -## Online Serving (OpenAI-Compatible) - -Enable Cache-DiT for online serving by passing `--cache-backend cache_dit` when starting the server: - -```bash -# Use Cache-DiT default (recommended) parameters -vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit -``` - -To customize Cache-DiT settings for online serving, pass a JSON string via `--cache-config`: - -```bash -vllm serve Qwen/Qwen-Image --omni --port 8091 \ - --cache-backend cache_dit \ - --cache-config '{"Fn_compute_blocks": 1, "Bn_compute_blocks": 0, "max_warmup_steps": 4, "residual_diff_threshold": 0.12}' -``` - -## Acceleration Methods - -For comprehensive illustration, please view cache-dit [User_Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/) - -### 1. DBCache (Dual Block Cache) - -DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality. - -**Key Parameters**: - -- `Fn_compute_blocks` (int, default: 1): Number of **first n** transformer blocks used to compute stable feature differences. Higher values provide more accurate caching decisions but increase computation. -- `Bn_compute_blocks` (int, default: 0): Number of **last n** transformer blocks used for additional fusion. These blocks act as an auto-scaler for approximate hidden states. -- `max_warmup_steps` (int, default: 4): Number of initial steps where caching is disabled to ensure the model learns sufficient features before caching begins. Optimized for few-step distilled models. -- `residual_diff_threshold` (float, default: 0.24): Threshold for residual difference. Higher values lead to faster performance but may reduce precision. Default uses a relatively higher threshold for more aggressive caching. -- `max_cached_steps` (int, default: -1): Maximum number of cached steps. Set to -1 for unlimited caching. -- `max_continuous_cached_steps` (int, default: 3): Maximum number of consecutive cached steps. Limits consecutive caching to prevent precision degradation. - -**Example Configuration**: - -```python -cache_config={ - "Fn_compute_blocks": 8, # Use first 8 blocks for difference computation - "Bn_compute_blocks": 0, # No additional fusion blocks - "max_warmup_steps": 8, # Cache after 8 warmup steps - "residual_diff_threshold": 0.12, # Higher threshold for faster inference - "max_cached_steps": -1, # No limit on cached steps -} -``` - -**Performance Tips**: - -- Default `Fn_compute_blocks=1` works well for most cases. Increase to 8-12 for larger models or when more accuracy is needed -- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality -- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed - -### 2. TaylorSeer - -TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality. - -**Key Parameters**: - -- `enable_taylorseer` (bool, default: False): Enable TaylorSeer acceleration -- `taylorseer_order` (int, default: 1): Order of Taylor expansion. Higher orders provide better accuracy but require more computation. - -**Example Configuration**: - -```python -cache_config={ - "enable_taylorseer": True, - "taylorseer_order": 1, # First-order Taylor expansion -} -``` - -**Performance Tips**: - -- Use `taylorseer_order=1` for most cases (good balance of speed and quality) -- Combine with DBCache for maximum acceleration -- Higher orders (2-3) may improve quality but reduce speed gains - -### 3. SCM (Step Computation Masking) - -SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration. - -**Key Parameters**: - -- `scm_steps_mask_policy` (str | None, default: None): Predefined mask policy. Options: - - `None`: SCM disabled (default) - - `"slow"`: More compute steps, higher quality (18 compute steps out of 28) - - `"medium"`: Balanced (15 compute steps out of 28) - - `"fast"`: More cache steps, faster inference (11 compute steps out of 28) - - `"ultra"`: Maximum speed (8 compute steps out of 28) -- `scm_steps_policy` (str, default: "dynamic"): Policy for cached steps: - - `"dynamic"`: Use dynamic cache for masked steps (recommended) - - `"static"`: Use static cache for masked steps - -**Example Configuration**: - -```python -cache_config={ - "scm_steps_mask_policy": "medium", # Balanced speed/quality - "scm_steps_policy": "dynamic", # Use dynamic cache -} -``` - -**Performance Tips**: - -- SCM is disabled by default (`scm_steps_mask_policy=None`). Enable it by setting a policy value if you need additional acceleration -- Start with `"medium"` policy and adjust based on quality requirements -- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised -- `"dynamic"` policy generally provides better quality than `"static"` -- SCM mask is automatically regenerated when `num_inference_steps` changes during inference - -## Configuration Reference - -### DiffusionCacheConfig Parameters - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) | -| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion | -| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) | -| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) | -| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) | -| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) | -| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) | -| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) | -| `taylorseer_order` | int | 1 | Taylor expansion order | -| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") | -| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") | - -## Example: Accelerate Text-to-Image Generation with CacheDiT - -See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example with cache-dit acceleration. - -```bash -# Enable cache-dit with hybrid acceleration -cd examples/offline_inference/text_to_image -python text_to_image.py \ - --model Qwen/Qwen-Image \ - --prompt "a cup of coffee on the table" \ - --cache-backend cache_dit \ - --num-inference-steps 50 -``` - - -The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer: - -```python -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="cache_dit", - cache_config={ - # Scheme: Hybrid DBCache + SCM + TaylorSeer - # DBCache - "Fn_compute_blocks": 8, - "Bn_compute_blocks": 0, - "max_warmup_steps": 4, - "residual_diff_threshold": 0.12, - # TaylorSeer - "enable_taylorseer": True, - "taylorseer_order": 1, - # SCM - "scm_steps_mask_policy": "fast", # Set to None to disable SCM - "scm_steps_policy": "dynamic", - }, -) -``` - -You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements. - -To test another model, you can modify `--model` with the target model identifier like `Tongyi-MAI/Z-Image-Turbo` and update `cache_config` according the model architecture (e.g., number of transformer blocks). - - -## Additional Resources - -- [Cache-DiT User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/) -- [Cache-DiT Benchmark](https://cache-dit.readthedocs.io/en/latest/benchmark/HYBRID_CACHE/) -- [DBCache Technical Details](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/) diff --git a/docs/user_guide/diffusion/parallelism/cfg_parallel.md b/docs/user_guide/diffusion/parallelism/cfg_parallel.md new file mode 100644 index 00000000000..5541106680a --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/cfg_parallel.md @@ -0,0 +1,169 @@ +# CFG-Parallel Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) + +--- + +## Overview + +CFG-Parallel accelerates diffusion models by distributing positive and negative classifier-free guidance (CFG) passes across different GPUs, providing ~1.8x speedup when CFG is enabled. It's ideal for image editing tasks that require guidance scales greater than 1.0. + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +--- + +## Quick Start + +### Basic Usage + +Simplest working example: + +```python +from vllm_omni import Omni +from vllm_omni.diffusion.data import DiffusionParallelConfig +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from PIL import Image + +omni = Omni( + model="Qwen/Qwen-Image-Edit", + parallel_config=DiffusionParallelConfig(cfg_parallel_size=2), # Enable CFG-Parallel +) + +input_image = Image.open("input.png").convert("RGB") +outputs = omni.generate( + { + "prompt": "turn this cat to a dog", + "negative_prompt": "low quality, blurry", + "multi_modal_data": {"image": input_image}, + }, + OmniDiffusionSamplingParams( + true_cfg_scale=4.0, + num_inference_steps=50, + ), +) +``` + +--- + +## Example Script + +### Offline Inference + +Use python script under `examples/offline_inference/image_to_image/image_edit.py`: + +```bash +cd examples/offline_inference/image_to_image/ +python image_edit.py \ + --model "Qwen/Qwen-Image-Edit" \ + --image "input.png" \ + --prompt "turn this cat to a dog" \ + --negative-prompt "low quality, blurry" \ + --cfg-scale 4.0 \ + --output "edited_image.png" \ + --cfg-parallel-size 2 +``` + +### Online Serving + +Enable CFG-Parallel in online serving: + +```bash +# Default configuration +vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2 + +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig` + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `cfg_parallel_size` | int | 1 | Number of GPUs for CFG parallelism. Set to 2 to enable CFG-Parallel (rank 0 for positive, rank 1 for negative branch) | + + +!!! info + Most models support `cfg_parallel_size=2` (positive branch on rank 0, negative branch on rank 1). **Bagel** is an exception: it supports `cfg_parallel_size=3`, which adds a third branch on rank 2 for full three-way CFG parallelism. + + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- Tasks requiring classifier-free guidance +- Multi-GPU setups (at least 2 GPUs available) +- Combining with other parallelism methods (sequence/tensor parallel) + +**Not for:** + +- Single GPU setups +- Models that don't support CFG-Parallel (check [supported models](../../diffusion_features.md#supported-models)) +- Workloads without negative prompts or classifier-free guidance +- Very short inference runs (< 10 steps) where parallelism overhead may outweigh benefits + +### Expected Performance + +| Configuration | Speedup | Quality | Use Case | +|--------------|---------|---------|----------| +| CFG-Parallel (2 GPUs) | 1.5~1.8x | No degradation | Large model, VRAM limited | + +--- + +## Troubleshooting + +### Common Issue 1: No Speedup with CFG-Parallel + +**Symptoms**: CFG-Parallel enabled but no performance improvement + +**Solutions**: + +1. **Ensure CFG scale is set correctly:** +```python +# Bad: No CFG effect +sampling_params = OmniDiffusionSamplingParams(num_inference_steps=50) + +# Good: CFG-Parallel will work +sampling_params = OmniDiffusionSamplingParams( + num_inference_steps=50, + true_cfg_scale=4.0 # Must be > 1.0 +) +``` + +2. **Add negative prompt:** +```python +outputs = omni.generate( + { + "prompt": "beautiful landscape", + "negative_prompt": "low quality, blurry", # Required for best results + "multi_modal_data": {"image": input_image} + }, + sampling_params +) +``` + +3. **Check model support:** + - Verify your model in [supported models](../../diffusion_features.md#supported-models) + - Some models don't support CFG-Parallel + +--- + +## Summary + +1. ✅ **Enable CFG-Parallel** - Set `cfg_parallel_size=2` in `DiffusionParallelConfig` to get speedup when using CFG +2. ✅ **Set CFG Scale** - Ensure `true_cfg_scale > 1.0` in `OmniDiffusionSamplingParams` for CFG-Parallel to take effect +3. ✅ **Check Model Support** - Verify your model supports CFG-Parallel in [supported models](../../diffusion_features.md#supported-models) diff --git a/docs/user_guide/diffusion/parallelism/expert_parallel.md b/docs/user_guide/diffusion/parallelism/expert_parallel.md new file mode 100644 index 00000000000..7d26d1e5c4f --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/expert_parallel.md @@ -0,0 +1,87 @@ +# Expert Parallelism Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Summary](#summary) + +--- + +## Overview + +Unlike Tensor Parallelism which shards every layer's weights, Expert Parallelism (EP) only shards the MoE expert MLP blocks. This significantly reduces the memory footprint of MoE models (e.g., HunyuanImage3.0) while maintaining constant dense-equivalent compute efficiency. + +During the forward pass, a gating mechanism routes tokens to their designated experts, requiring all-to-all communication to dispatch tokens to the correct ranks and combine results. + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +!!! note "EP Size Constraint" + The effective EP size equals `tp × sp × cfg × dp`. At least one of TP/SP/CFG/DP must be set when EP is enabled. + +--- + +## Quick Start + +### Basic Usage + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from vllm_omni.diffusion.data import DiffusionParallelConfig + +omni = Omni( + model="tencent/HunyuanImage-3.0", + parallel_config=DiffusionParallelConfig( + tensor_parallel_size=8, + enable_expert_parallel=True, + ), +) + +outputs = omni.generate( + "A brown and white dog is running on the grass", + OmniDiffusionSamplingParams( + num_inference_steps=50, + width=1024, + height=1024, + ), +) +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig`: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `enable_expert_parallel` | bool | False | Enable Expert Parallelism for MoE models | + +EP size is derived automatically as `tp × sp × cfg × dp` — configure at least one of those to set the EP degree. + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- MoE models (e.g., HunyuanImage3.0) with numbers of experts +- Memory-constrained multi-GPU setups where only expert blocks need sharding + +**Not for:** + +- Dense models (no MoE layers) — EP has no effect +- Single GPU setups + +--- + +## Summary + +1. ✅ **Enable EP** - Set `enable_expert_parallel=True` in `DiffusionParallelConfig` for MoE models +2. ✅ **Set parallelism degree** - At least one of `tensor_parallel_size` / `ulysses_degree` / `cfg_parallel_size` must be > 1 to define the EP size diff --git a/docs/user_guide/diffusion/parallelism/hsdp.md b/docs/user_guide/diffusion/parallelism/hsdp.md new file mode 100644 index 00000000000..96a357c86b3 --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/hsdp.md @@ -0,0 +1,149 @@ +# HSDP Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Summary](#summary) + +--- + +## Overview + +HSDP (Hybrid Sharded Data Parallel) shards model weights across GPUs to reduce per-GPU memory usage. This enables inference of large models (e.g., Wan2.2 14B) on GPUs with limited memory. + +Unlike Tensor Parallelism which splits computation, HSDP uses PyTorch's FSDP2 to shard and redistribute weights at runtime. Each GPU only holds a fraction of the model weights, and weights are gathered on-demand during forward passes. + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +**Operating Modes:** + +- **Standalone Mode**: HSDP alone without other parallelism. Must specify `hsdp_shard_size` explicitly. +- **Combined Mode**: HSDP overlays on top of other parallelism (Ulysses-SP, CFG-Parallel). HSDP dimensions must match world_size. + +--- + +## Quick Start + +### Basic Usage + +Simplest working example (standalone HSDP, shard across 4 GPUs): + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from vllm_omni.diffusion.data import DiffusionParallelConfig + +omni = Omni( + model="Wan-AI/Wan2.2-T2V-A14B-Diffusers", + parallel_config=DiffusionParallelConfig( + use_hsdp=True, + hsdp_shard_size=4, # Shard across 4 GPUs + ), +) + +outputs = omni.generate( + "A cat playing piano", + OmniDiffusionSamplingParams(num_inference_steps=50), +) +``` + +### Combined with Sequence Parallel + +```python +omni = Omni( + model="Wan-AI/Wan2.2-T2V-A14B-Diffusers", + parallel_config=DiffusionParallelConfig( + ulysses_degree=4, # Sequence parallel + use_hsdp=True, # HSDP overlays on SP + ), +) +``` + +--- + +## Example Script + +### Offline Inference + +Use Python script under `examples/offline_inference/image_to_video/`: + +```bash +# Standalone HSDP: shard across 4 GPUs +python examples/offline_inference/image_to_video/image_to_video.py \ + --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --use-hsdp \ + --hsdp-shard-size 4 + +# Combined HSDP + Sequence Parallel +python examples/offline_inference/image_to_video/image_to_video.py \ + --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --ulysses-degree 4 \ + --use-hsdp +``` + +### Online Serving + +**Standalone HSDP** (shard model across 4 GPUs): + +```bash +vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091 \ + --use-hsdp --hsdp-shard-size 4 +``` + +**Combined with Sequence Parallel**: + +```bash +vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091 \ + --use-hsdp --usp 4 +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig`: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `use_hsdp` | bool | False | Enable HSDP | +| `hsdp_shard_size` | int | -1 | Number of GPUs to shard weights across. `-1` = auto (requires other parallelism > 1) | +| `hsdp_replicate_size` | int | 1 | Number of replica groups. Each group holds a full sharded copy | + +**Constraints:** + +- `hsdp_replicate_size × hsdp_shard_size == world_size` +- HSDP cannot be used with Tensor Parallelism (`tensor_parallel_size` must be 1) + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- Very large models (e.g., Wan2.2 14B) +- Multi-GPU setups where memory reduction is the primary goal +- Combining with Sequence Parallelism for large video models + +**Not for:** + +- Models that fit comfortably in single-GPU memory +- Use cases requiring Tensor Parallelism (HSDP and TP are mutually exclusive) + +### Adding HSDP Support to New Models + +For detailed instructions on adding HSDP support to new models, see the [HSDP Contributing Guide](../../../design/feature/hsdp.md). + +--- + +## Summary + +1. ✅ **Enable HSDP** - Set `use_hsdp=True` and `hsdp_shard_size` to reduce per-GPU memory for large models +2. ✅ **Combine with SP** - Use together with `ulysses_degree` for video models requiring both memory reduction and sequence parallelism +3. ⚠️ **Incompatible with TP** - `tensor_parallel_size` must be 1 when HSDP is enabled diff --git a/docs/user_guide/diffusion/parallelism/overview.md b/docs/user_guide/diffusion/parallelism/overview.md new file mode 100644 index 00000000000..90d0b9660ef --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/overview.md @@ -0,0 +1,16 @@ +# Parallelism Acceleration Guide + +This guide covers the parallelism methods in vLLM-Omni for speeding up diffusion model inference and reducing per-device memory requirements. + +## Supported Methods + +| Method | Description | +|--------|-------------| +| **[Tensor Parallelism](tensor_parallel.md)** | Shards DiT weights across GPUs to reduce per-GPU memory | +| **[Sequence Parallelism](sequence_parallel.md)** | Splits sequence dimension across GPUs (Ulysses-SP, Ring-Attention, or hybrid) for high-resolution images and videos | +| **[CFG-Parallel](cfg_parallel.md)** | Runs CFG positive/negative branches on separate GPUs for ~1.8x speedup on guided generation | +| **[VAE Patch Parallelism](vae_patch_parallel.md)** | Distributes VAE decode spatially across GPUs to reduce peak VAE memory | +| **[HSDP](hsdp.md)** | Shards full model weights via PyTorch FSDP2 to enable large-model inference on memory-constrained GPUs | +| **[Expert Parallelism](expert_parallel.md)** | Shards MoE expert blocks across GPUs for MoE models (e.g. HunyuanImage3.0) | + +See [Supported Models](../../diffusion_features.md#supported-models) for per-model compatibility. diff --git a/docs/user_guide/diffusion/parallelism/sequence_parallel.md b/docs/user_guide/diffusion/parallelism/sequence_parallel.md new file mode 100644 index 00000000000..e69b541f2ed --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/sequence_parallel.md @@ -0,0 +1,233 @@ +# Sequence Parallelism Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) + +--- + +## Overview + +Sequence parallelism splits the input along the sequence dimension across multiple GPUs, allowing each device to process only a portion of the sequence. vLLM-Omni provides 1.5x-3.6x speedup for large images and videos using DeepSpeed Ulysses, Ring-Attention, or hybrid approaches. Use sequence parallelism when generating high-resolution images/videos that don't fit on a single GPU or require faster inference. + +See supported models list in [Diffusion Features - Supported Models](../../diffusion_features.md#supported-models). + +**Supported Methods:** + +- **DeepSpeed Ulysses Sequence Parallel (Ulysses-SP)** ([paper](https://arxiv.org/pdf/2309.14509)): Uses all-to-all communication for subset of attention heads per device +- **Ring-Attention** ([paper](https://arxiv.org/abs/2310.01889)): Uses ring-based P2P communication with sharded sequence dimension throughout +- **Hybrid Ulysses + Ring**: Combines both for larger scale parallelism (`ulysses_degree × ring_degree`) + +--- + +## Quick Start + +### Basic Usage - Ulysses-SP + +Simplest working example with Ulysses Sequence Parallel: + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from vllm_omni.diffusion.data import DiffusionParallelConfig + +omni = Omni( + model="Qwen/Qwen-Image", + parallel_config=DiffusionParallelConfig(ulysses_degree=2) # Enable Ulysses-SP +) + +outputs = omni.generate( + "A cat sitting on a windowsill", + OmniDiffusionSamplingParams(num_inference_steps=50, width=1024, height=1024), +) +``` + +!!! note "Experimental UAA mode" + `ulysses_mode="advanced_uaa"` is an experimental extension to Ulysses-SP. It lets Ulysses attention handle arbitrary sequence lengths and arbitrary attention head counts without relying on `attention_mask`-based token padding. + + In hybrid Ulysses + Ring mode, Ring still requires every rank in the same ring group to observe the same post-Ulysses sequence length. If that condition is not met, vLLM-Omni raises a validation error instead of entering the ring kernel with inconsistent shapes. + +To enable the experimental UAA mode, use a model/configuration that requires it. For example, `Tongyi-MAI/Z-Image-Turbo` has 30 attention heads, so `ulysses_degree=4` requires UAA because 30 is not divisible by 4: + +```python +omni = Omni( + model="Tongyi-MAI/Z-Image-Turbo", + parallel_config=DiffusionParallelConfig( + ulysses_degree=4, + ulysses_mode="advanced_uaa", + ), +) +``` + +### Alternative Methods + +**Ring-Attention** (better for very long sequences): + +```python +omni = Omni( + model="Qwen/Qwen-Image", + parallel_config=DiffusionParallelConfig(ring_degree=2) # Enable Ring-Attention +) +``` + +**Hybrid Ulysses + Ring** (for larger scale): + +```python +omni = Omni( + model="Qwen/Qwen-Image", + parallel_config=DiffusionParallelConfig(ulysses_degree=2, ring_degree=2) # 4 GPUs total +) +``` + +--- + +## Example Script + +### Offline Inference + +Use Python script under `examples/offline_inference/text_to_image/text_to_image.py`: + +**Ulysses-SP:** + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "A cat sitting on a windowsill" \ + --ulysses-degree 2 \ + --width 1024 --height 1024 +``` + +**Ring-Attention:** + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "A cat sitting on a windowsill" \ + --ring-degree 2 \ + --width 1024 --height 1024 +``` + +**Hybrid Ulysses + Ring:** + +```bash +# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "A cat sitting on a windowsill" \ + --ulysses-degree 2 --ring-degree 2 \ + --width 1024 --height 1024 +``` + +### Online Serving + +**Ulysses-SP:** + +```bash +# Text-to-image (requires >= 2 GPUs) +vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 +``` + +**Ulysses-SP with UAA mode** (for models with non-divisible head counts): + +```bash +vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091 --usp 4 --ulysses-mode advanced_uaa +``` + +**Ring-Attention:** + +```bash +# Text-to-image (requires >= 2 GPUs) +vllm serve Qwen/Qwen-Image --omni --port 8091 --ring 2 +``` + +**Hybrid Ulysses + Ring:** + +```bash +# Text-to-image (requires >= 4 GPUs) +vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 --ring 2 +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig`: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `ulysses_degree` | int | 1 | Number of GPUs for Ulysses-SP. Uses all-to-all communication. | +| `ring_degree` | int | 1 | Number of GPUs for Ring-Attention. Uses P2P ring communication. | +| `ulysses_mode` | str | `"default"` | Ulysses attention mode. Set to `"advanced_uaa"` to handle arbitrary sequence lengths and head counts without padding. | + +**Notes:** +- Total sequence parallel size equals to `ulysses_degree × ring_degree` +- Degrees must evenly divide the sequence length for optimal performance (or use `ulysses_mode="advanced_uaa"` for Ulysses-SP) + + +## Best Practices + +### When to Use + +**Good for:** + +- Large images (1024x1024 or higher) or videos +- Fast inter-GPU communication, larger bandwidth (e.g., NVLink) + +**Not for:** + +- Small images (<1024px) - overhead exceeds benefit, use single GPU with cache instead + + +--- + +## Troubleshooting + +### Common Issue 1: Performance Not Scaling + +**Symptoms**: Adding GPUs doesn't improve speed proportionally, or higher parallelism degree is slower + +**Diagnosis:** +```bash +# Check GPU topology +nvidia-smi topo -m + +``` + +**Solutions:** + +1. Check inter-GPU communication - NVLink is better than PCIe +2. Reduce parallelism degree if over-parallelized: +```python +# If 4 GPUs is slower than 2 +parallel_config=DiffusionParallelConfig(ulysses_degree=2) +``` +3. Try to switch between Ring-Attention and Ulysses-SP + +- Ring-Attention has advantages, like communication-computation overlap, but the block-wise loop overhead is relatively higher, especially for short sequences +- Ulysses-SP: can benefit from larger bandwidth (such as NVLink), with two major constraints, the sequence length should be divisible by usp size, and the number of heads should be divisible by usp size (or use `ulysses_mode="advanced_uaa"`) + + +### Common Issue 2: Out of Memory (OOM) + +**Symptoms**: CUDA OOM errors or process crashes with memory errors + +**Solutions:** + +1. Increase parallelism degree to split sequence more: +```python +parallel_config=DiffusionParallelConfig(ulysses_degree=4) # From 2 +``` +2. Combine with other parallelism method, e.g., tensor parallel, and memory optimization methods, e.g., cpu offloading. + + +## Summary + +1. ✅ **Enable Sequence Parallelism** - Set `ulysses_degree` or `ring_degree` for long sequence generation +2. ✅ **UAA mode** - Use `ulysses_mode="advanced_uaa"` when head count is not divisible by `ulysses_degree` +3. ✅ **Troubleshooting** - Check GPU topology with `nvidia-smi topo -m`, reduce degree if performance doesn't scale diff --git a/docs/user_guide/diffusion/parallelism/tensor_parallel.md b/docs/user_guide/diffusion/parallelism/tensor_parallel.md new file mode 100644 index 00000000000..8e6851412cf --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/tensor_parallel.md @@ -0,0 +1,151 @@ +# Tensor Parallelism Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) + +--- + +## Overview + +Tensor Parallelism (TP) shards some model weights across multiple GPUs, usually the Linear layers. This enables running large models that don't fit on a single GPU. It's essential for memory-constrained setups or very large models. + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + +!!! note "TP Limitations for Diffusion Models" + We currently implement Tensor Parallelism (TP) only for the DiT (Diffusion Transformer) blocks. This is because the `text_encoder` component in vLLM-Omni uses the original Transformers implementation, which does not yet support TP. + + - Good news: The text_encoder typically has minimal impact on overall inference performance. + - Bad news: When TP is enabled, every TP process retains a full copy of the text_encoder weights, leading to significant GPU memory waste. + + We are actively refactoring this design to address this. For details and progress, please refer to [Issue #771](https://github.com/vllm-project/vllm-omni/issues/771). + +--- + +## Quick Start + + +### Basic Usage + +Simplest working example: + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from vllm_omni.diffusion.data import DiffusionParallelConfig + +omni = Omni( + model="Tongyi-MAI/Z-Image-Turbo", + parallel_config=DiffusionParallelConfig(tensor_parallel_size=2), # Enable TP +) + +outputs = omni.generate( + "a cat reading a book", + OmniDiffusionSamplingParams(num_inference_steps=9), +) +``` + +--- + +## Example Script + +### Offline Inference + +Use Python script under `examples/offline_inference`, and enable TP: + +```bash +# Text-to-Image with Qwen-Image +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --tensor-parallel-size 2 + +# Image Editing with Qwen-Image-Edit +python examples/offline_inference/image_to_image/image_edit.py \ + --model Qwen/Qwen-Image-Edit \ + --image input.png \ + --prompt "Edit description" \ + --tensor-parallel-size 2 +``` + +### Online Serving + +You can enable tensor parallelism in online serving via `--tensor-parallel-size`: + +```bash +# Text-to-Image with Qwen-Image on 2 GPUs +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --tensor-parallel-size 2 + +# Text-to-Image with Z-Image (TP=2 only) +vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091 \ + --tensor-parallel-size 2 +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig`: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `tensor_parallel_size` | int | 1 | Number of GPUs to shard model weights across. Must divide number of heads. | + + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- Large models that don't fit on a single GPU, especially for models with large DiT blocks (transformer layers) +- Memory-constrained environments + +**Not for:** + +- When maximum throughput is needed and memory is sufficient +- Models with incompatible dimensions (e.g., Z-Image `num_heads=30`, which now supports `tensor_parallel_size=2`) + + +## Troubleshooting + +### Common Issue 1: Out of Memory (OOM) + +**Symptoms**: CUDA OOM errors during model loading or inference, process crashes with memory errors + +**Solution**: +```python +# Step 1: Enable TP with smallest degree +parallel_config=DiffusionParallelConfig(tensor_parallel_size=2) + +# Step 2: If still OOM, increase TP degree +parallel_config=DiffusionParallelConfig(tensor_parallel_size=4) + +``` + +### Common Issue 2: Divisibility Error + +**Symptoms**: Error like "Model dimension X not divisible by tensor_parallel_size Y" + +**Solutions**: +1. Check model-specific constraints (e.g., Z-Image only supports TP=2) +2. Use a smaller TP size that divides model dimensions +3. Consult [Supported Models](../../diffusion_features.md#supported-models) for compatible TP sizes + + +--- + +## Summary + +1. ✅ **Enable TP** - Set `--tensor-parallel-size` to reduce per-GPU memory +2. ✅ **Increase TP size** - Only increase if OOM persists +3. ⚠️ **Text encoder not sharded** - Known limitation diff --git a/docs/user_guide/diffusion/parallelism/vae_patch_parallel.md b/docs/user_guide/diffusion/parallelism/vae_patch_parallel.md new file mode 100644 index 00000000000..4e8513eabf4 --- /dev/null +++ b/docs/user_guide/diffusion/parallelism/vae_patch_parallel.md @@ -0,0 +1,200 @@ +# VAE Patch Parallelism Guide + + +## Table of Content + +- [Overview](#overview) +- [Quick Start](#quick-start) +- [Example Script](#example-script) +- [Configuration Parameters](#configuration-parameters) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) +- [Summary](#summary) + +--- + +## Overview + +VAE Patch Parallelism distributes the VAE (Variational AutoEncoder) decode/encode computation across multiple GPUs by splitting the latent space into spatial tiles or patches. Each GPU processes a subset of tiles in parallel, significantly reducing peak memory consumption during the VAE decode stage while maintaining output quality. + +This is particularly useful for: +- **High-resolution image generation** where VAE decode can become a memory bottleneck +- **Memory-constrained environments** where the VAE decode activation peak exceeds available VRAM +- **Multi-GPU setups** where you want to leverage distributed resources for the VAE stage + +See supported models list in [Supported Models](../../diffusion_features.md#supported-models). + + +VAE Patch Parallelism uses two strategies based on image size: + +| Strategy | Use Case | How It Works | Overlap Handling | Output Quality | +|----------|----------|--------------|------------------|----------------| +| **Tiled Decode** | Large images (triggers VAE tiling) | Distributes existing VAE tiling computation across ranks. Each rank decodes a subset of overlapping tiles. | Uses VAE's native `blend_v` and `blend_h` functions to seamlessly merge overlapping regions | Bit-identical (same logic as single-GPU tiling) | +| **Patch Decode** | Small images (no VAE tiling) | Splits latent into spatial patches with halos. Each rank decodes one patch with boundary context. | Halo regions provide edge context; core regions are directly stitched without blending | Near-identical (diff < 0.5%, visually imperceptible) | + + +VAE Patch Parallelism **reuses the DiT process group** (`dit_group`) and does not initialize a separate ProcessGroup. This means: + +- **Shared ranks**: VAE patch parallelism uses the same GPU ranks as DiT parallelism (Tensor Parallel, Sequence Parallel, etc.) +- **Combined usage**: VAE patch parallelism is typically used together with other parallelism methods +- **Configuration alignment**: The `vae_patch_parallel_size` should be no greater than the size of your DiT process group + +--- + +## Quick Start + +### Basic Usage + +Simplest working example: + +```python +from vllm_omni import Omni +from vllm_omni.inputs.data import OmniDiffusionSamplingParams +from vllm_omni.diffusion.data import DiffusionParallelConfig + +# TP=2 for DiT, VAE patch parallel also uses these 2 GPUs +omni = Omni( + model="Tongyi-MAI/Z-Image-Turbo", + parallel_config=DiffusionParallelConfig( + tensor_parallel_size=2, # Enable tensor parallelism for DiT + vae_patch_parallel_size=2, # Enable VAE patch parallelism + ), + vae_use_tiling=True, # Required for VAE patch parallelism +) + +outputs = omni.generate( + "a futuristic city at sunset, high resolution, 8k", + OmniDiffusionSamplingParams( + num_inference_steps=9, + height=1152, # High resolution benefits from VAE patch parallel + width=1152, + ), +) +``` + +--- + +## Example Script + +### Offline Inference + +Use Python script under `examples/offline_inference/text_to_image/`: + +```bash +# Text-to-Image with Z-Image +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Tongyi-MAI/Z-Image-Turbo \ + --prompt "a futuristic city at sunset" \ + --height 1152 \ + --width 1152 \ + --tensor-parallel-size 2 \ + --vae-patch-parallel-size 2 \ + --vae-use-tiling +``` + +### Online Serving + +You can enable VAE patch parallelism in online serving via `--vae-patch-parallel-size`: + +```bash +# Text-to-Image with Z-Image, TP=2 + VAE patch parallel=2 +vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091 \ + --tensor-parallel-size 2 \ + --vae-patch-parallel-size 2 \ + --vae-use-tiling +``` + +--- + +## Configuration Parameters + +In `DiffusionParallelConfig`: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `vae_patch_parallel_size` | int | 1 | Number of GPUs for VAE patch/tile parallelism. Set to 2 or higher to enable. Should typically match `tensor_parallel_size` as they share the same process group. | + +Additional requirements: + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `vae_use_tiling` | bool | False | Must be set to `True` when using VAE patch parallelism. | + +!!! note "Automatic VAE Tiling" + When `vae_patch_parallel_size > 1` and the model has a distributed VAE (`DistributedVaeMixin`), the system automatically sets `vae_use_tiling=True` if not already enabled. + +--- + +## Best Practices + +### When to Use + +**Good for:** + +- High-resolution image generation and long video generation +- Memory-constrained setups where VAE decode causes OOM +- Multi-GPU environments + +**Not for:** + +- Low-resolution images/videos where VAE decode is not a bottleneck +- Single GPU setups should use vae tiling decode, but not parallel vae tiling decode +- Models that do not support vae patch parallel + +--- + +## Troubleshooting + +### Common Issue 1: Model Not Support VAE Patch Parallel + +**Symptoms**: +``` +WARNING: vae_patch_parallel_size=2 is set but VAE patch parallelism is NOT enabled for xxxPipeline; ignoring. +``` + +**Root Cause**: VAE Patch Parallelism requires the model's VAE to implement `DistributedVaeMixin`. At startup, `vllm_omni/diffusion/registry.py` checks whether the instantiated pipeline has a `.vae` attribute that is an instance of `DistributedVaeMixin`. If it does not, the setting is silently ignored: + +```python +vae_pp_size = od_config.parallel_config.vae_patch_parallel_size +is_distributed_vae = hasattr(model, "vae") and isinstance(model.vae, DistributedVaeMixin) +if vae_pp_size > 1 and not is_distributed_vae: + logger.warning( + "vae_patch_parallel_size=%d is set but VAE patch parallelism is NOT enabled for %s; ignoring.", + vae_pp_size, + od_config.model_class_name, + ) +``` + +**Solutions**: + +1. **Use a supported model** (recommended): check [Supported Models](../../diffusion_features.md#supported-models) for the VAE-Patch-Parallel column. + +2. To add support for a new model, implement `DistributedVaeMixin` on its VAE class (contributions are welcome). + + +### Common Issue 2: `vae_patch_parallel_size` Exceeds DiT Process Group Size + +**Symptoms**: Shows warning message, and vae patch parallel size is resized to DiT process group size + +**Root Cause**: VAE Patch Parallelism reuses the DiT process group. + +**Recommendation**: Always set `vae_patch_parallel_size` to be no greater than your DiT process group size. + +Note that the size of DiT process group size equals to: +```text +dit_parallel_size = data_parallel_size + × cfg_parallel_size + × sequence_parallel_size + × pipeline_parallel_size + × tensor_parallel_size + +``` +_sequence_parallel_size = ulysses_degree × ring_degree_ + +--- + +## Summary + +1. ✅ **Enable VAE Patch Parallelism** - Set `vae_patch_parallel_size`, `vae_use_tiling=True` in `DiffusionParallelConfig` to reduce VAE decode peak memory +2. ✅ **Use Long Sequence** - VAE patch parallelism benefits are most apparent at long sequence decoding +3. ✅ **Combine with other parallelism methods** - Suggest to use together with Tensor Parallel or CFG-Parallel for maximum memory savings diff --git a/docs/user_guide/diffusion/parallelism_acceleration.md b/docs/user_guide/diffusion/parallelism_acceleration.md deleted file mode 100644 index b6c8aed5935..00000000000 --- a/docs/user_guide/diffusion/parallelism_acceleration.md +++ /dev/null @@ -1,477 +0,0 @@ -# Parallelism Acceleration Guide - -This guide includes how to use parallelism methods in vLLM-Omni to speed up diffusion model inference as well as reduce the memory requirement on each device. - -## Overview - -The following parallelism methods are currently supported in vLLM-Omni: - -1. DeepSpeed Ulysses Sequence Parallel (DeepSpeed Ulysses-SP) ([arxiv paper](https://arxiv.org/pdf/2309.14509)): Ulysses-SP splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads. - -2. [Ring-Attention](#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded - -3. Classifier-Free-Guidance Parallel (CFG-Parallel): CFG-Parallel runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step. - -4. [Tensor Parallelism](#tensor-parallelism): Tensor parallelism shards model weights across devices. This can reduce per-GPU memory usage. Note that for diffusion models we currently shard the majority of layers within the DiT. - -5. [VAE Patch Parallelism](#vae-patch-parallelism): VAE patch parallelism shards VAE decode spatially across ranks. This can reduce the peak memory of VAE decode and (depending on resolution and communication overhead) speed up VAE decode. - -6. [HSDP](#hsdp): Hybrid Sharded Data Parallel shards model weights across GPUs using PyTorch FSDP2. This reduces per-GPU memory usage, enabling inference of large models on GPUs with limited memory. - -7. [Expert Parallel](#expert-parallelism): Expert Parallelism shards the Experts of a Mixture-of-Experts (MoE) layer across multiple devices. During the forward, a gating mechanism routes tokens to their designated experts, necessitating cross-cards communication(all-to-all) to dispatch tokens to the correct ranks and combine the results. This parallelism allows for massive scaling of model parameters without a proportional increase in the computational load per device. - -The following table shows which models are currently supported by parallelism method: - -### ImageGen - -| Model | Model Identifier | Ulysses-SP | Ring-SP | CFG-Parallel | Tensor-Parallel | VAE-Patch-Parallel | Expert-Parallel | HSDP | -|--------------------------|--------------------------------------|:----------:|:-------:|:------------:|:---------------:|:------------------:|:---------------:|:----:| -| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ✅ | ✅ | ❌ | ✅ | ❌ | N/A | ❌ | -| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ✅ | ✅ | ❌ | ✅ | ❌ | N/A | ❌ | -| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ❌ | ❌ | ❌ | ❌ | N/A | ❌ | -| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ | N/A | ❌ | -| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ❌ | N/A | ❌ | -| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ❌ | N/A | ❌ | -| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ✅ | ✅ | ✅ | ✅ | ❌ | N/A | ❌ | -| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ✅ (TP=2 only) | ✅ | N/A | ❌ | -| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ❌ | ❌ | ✅ | ✅ | N/A | ❌ | -| **FLUX.2-klein** | `black-forest-labs/FLUX.2-klein-4B` | ✅ | ✅ | ❌ | ✅ | ❌ | N/A | ✅ | -| **FLUX.1-dev** | `black-forest-labs/FLUX.1-dev` | ❌ | ❌ | ✅ | ✅ | ❌ | N/A | ✅ | -| **FLUX.2-dev** | `black-forest-labs/FLUX.2-dev` | ❌ | ❌ | ❌ | ✅ | ❌ | N/A | ✅ | -| **HunyuanImage3.0** | `tencent/HunyuanImage-3.0`, `tencent/HunyuanImage-3.0-Instruct` | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | -| **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ✅ | ✅ | ❌ | N/A | ❌ | -| **DreamID-Omni** | `XuGuo699/DreamID-Omni` | ❌ | ❌ | ✅ | ❌ | ❌ | N/A | ❌ | -| **FLUX.1-Kontext-dev** | `black-forest-labs/FLUX.1-Kontext-dev` | ❌ | ❌ | ❌ | ✅ | ❌ | N/A | ✅ | -| **OmniGen2** | `OmniGen2/OmniGen2` | ❌ | ❌ | ❌ | ✅ | ❌ | N/A | ❌ | - -!!! note "TP Limitations for Diffusion Models" - We currently implement Tensor Parallelism (TP) only for the DiT (Diffusion Transformer) blocks. This is because the `text_encoder` component in vLLM-Omni uses the original Transformers implementation, which does not yet support TP. - - - Good news: The text_encoder typically has minimal impact on overall inference performance. - - Bad news: When TP is enabled, every TP process retains a full copy of the text_encoder weights, leading to significant GPU memory waste. - - We are actively refactoring this design to address this. For details and progress, please refer to [Issue #771](https://github.com/vllm-project/vllm-omni/issues/771). - - -!!! note "Why Z-Image is TP=2 only" - Z-Image Turbo is currently limited to `tensor_parallel_size` of **1 or 2** due to model shape divisibility constraints. - For example, the model has `n_heads=30` and a final projection out dimension of `64`, so valid TP sizes must divide both 30 and 64; the only common divisors are **1 and 2**. - -### VideoGen - -| Model | Model Identifier | Ulysses-SP | Ring-Attention | Tensor-Parallel | HSDP | VAE-Patch-Parallel | -|-------|------------------|:----------:|:--------------:|:---------------:|:----:| :----:| -| **Wan2.1** | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Wan2.1** | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ✅ | ✅ | ✅ | ✅ | ✅ | -| **LTX-2** | `Lightricks/LTX-2` | ✅ | ✅ | ✅ | ❌ | ❌ | - -### Tensor Parallelism - -Tensor parallelism splits model parameters across GPUs. In vLLM-Omni, tensor parallelism is configured via `DiffusionParallelConfig.tensor_parallel_size`. - -#### Offline Inference - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Tongyi-MAI/Z-Image-Turbo", - parallel_config=DiffusionParallelConfig(tensor_parallel_size=2), -) - -outputs = omni.generate( - "a cat reading a book", - OmniDiffusionSamplingParams( - num_inference_steps=9, - width=512, - height=512, - ), -) -``` - -### VAE Patch Parallelism - -VAE patch parallelism distributes the VAE decode workload across multiple ranks by splitting the latent spatially. It is configured via `DiffusionParallelConfig.vae_patch_parallel_size` and can be combined with other parallelism methods (e.g., TP). - -!!! note "Enablement and feature gate" - - VAE patch parallelism is currently **enabled only for validated pipelines** (check [ImageGen](#imagegen) and [VideoGen](#videogen) for more information). - - If `vae_patch_parallel_size > 1` is set for a validated pipeline, vLLM-Omni will automatically enable `vae_use_tiling` as a safety gate. (We use `vae_use_tiling` because it indicates the VAE supports diffusers tiling parameters like `tile_latent_min_size` and `tile_overlap_factor`.) - -#### Offline Inference - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Tongyi-MAI/Z-Image-Turbo", - parallel_config=DiffusionParallelConfig( - tensor_parallel_size=2, - vae_patch_parallel_size=2, - ), - vae_use_tiling=True, -) - -outputs = omni.generate( - prompt="a cat reading a book", - num_inference_steps=9, - width=1024, - height=1024, -) -``` - -#### How it works (method selection) - -VAE patch parallelism automatically selects between two internal decode methods based on whether diffusers tiling would kick in: - -- `_distributed_tiled_decode`: Used when the latent spatial size exceeds `vae.tile_latent_min_size` (i.e., diffusers tiled decode). Each rank decodes a subset of tiles; rank0 gathers and runs the same overlap+blend+stitch logic as diffusers. This matches the single-rank diffusers tiled output. - -- `_distributed_patch_decode`: Used when diffusers tiling would not kick in. Each rank decodes a grid patch expanded with a latent-space halo; then rank0 gathers the cropped core patches and stitches them into the full image. This path has no blending and can introduce small numerical differences compared to the non-parallel decode. - -### Sequence Parallelism - -#### Ulysses-SP - -!!! note "Experimental UAA mode" - `ulysses_mode="advanced_uaa"` is an experimental extension to Ulysses-SP. It lets Ulysses attention handle arbitrary sequence lengths and arbitrary attention head counts without relying on `attention_mask`-based token padding. - - In hybrid Ulysses + Ring mode, Ring still requires every rank in the same ring group to observe the same post-Ulysses sequence length. If that condition is not met, vLLM-Omni raises a validation error instead of entering the ring kernel with inconsistent shapes. - -##### Offline Inference - -An example of offline inference script using [Ulysses-SP](https://arxiv.org/pdf/2309.14509) is shown below: -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -ulysses_degree = 2 - -omni = Omni( - model="Qwen/Qwen-Image", - parallel_config=DiffusionParallelConfig(ulysses_degree=2) -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048), -) -``` - -See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example. - -To enable the experimental UAA mode explicitly, use a model/configuration that actually requires it. For example, `Tongyi-MAI/Z-Image-Turbo` has 30 attention heads, so `ulysses_degree=4` requires UAA because 30 is not divisible by 4: - -```python -omni = Omni( - model="Tongyi-MAI/Z-Image-Turbo", - parallel_config=DiffusionParallelConfig( - ulysses_degree=4, - ulysses_mode="advanced_uaa", - ), -) -``` - -##### Online Serving - -You can enable Ulysses-SP in online serving for diffusion models via `--usp`: - -```bash -# Text-to-image (requires >= 2 GPUs) -vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 - -# Experimental UAA mode for a model with 30 attention heads -vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091 --usp 4 --ulysses-mode advanced_uaa -``` - -##### Benchmarks -!!! note "Benchmark Disclaimer" - These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on: - - - Specific model and use case - - Hardware configuration - - Careful parameter tuning - - Different inference settings (e.g., number of steps, image resolution) - - -To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends. - -| Configuration | Ulysses degree |Generation Time | Speedup | -|---------------|----------------|---------|---------| -| **Baseline (diffusers)** | - | 112.5s | 1.0x | -| Ulysses-SP | 2 | 65.2s | 1.73x | -| Ulysses-SP | 4 | 39.6s | 2.84x | -| Ulysses-SP | 8 | 30.8s | 3.65x | - -#### Ring-Attention - -Ring-Attention ([arxiv paper](https://arxiv.org/abs/2310.01889)) splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results. Unlike Ulysses-SP which uses all-to-all communication, Ring-Attention keeps the sequence dimension sharded throughout the computation and circulates Key/Value blocks through a ring topology. - -##### Offline Inference - -An example of offline inference script using Ring-Attention is shown below: -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -ring_degree = 2 - -omni = Omni( - model="Qwen/Qwen-Image", - parallel_config=DiffusionParallelConfig(ring_degree=2) -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048), -) -``` - -See `examples/offline_inference/text_to_image/text_to_image.py` for a complete working example. - - -##### Online Serving - -You can enable Ring-Attention in online serving for diffusion models via `--ring`: - -```bash -# Text-to-image (requires >= 2 GPUs) -vllm serve Qwen/Qwen-Image --omni --port 8091 --ring 2 -``` - -##### Benchmarks -!!! note "Benchmark Disclaimer" - These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on: - - - Specific model and use case - - Hardware configuration - - Careful parameter tuning - - Different inference settings (e.g., number of steps, image resolution) - - -To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends. - -| Configuration | Ring degree |Generation Time | Speedup | -|---------------|----------------|---------|---------| -| **Baseline (diffusers)** | - | 45.2s | 1.0x | -| Ring-Attention | 2 | 29.9s | 1.51x | -| Ring-Attention | 4 | 23.3s | 1.94x | - - -#### Hybrid Ulysses + Ring - -You can combine both Ulysses-SP and Ring-Attention for larger scale parallelism. The total sequence parallel size equals `ulysses_degree × ring_degree`. - -!!! note "Experimental UAA in hybrid mode" - `ulysses_mode="advanced_uaa"` can also be used with hybrid Ulysses + Ring, but this does not remove Ring's shape requirement. Every rank in the same ring group must still have the same post-Ulysses sequence length. - -##### Offline Inference - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig - -# Hybrid: 2 Ulysses × 2 Ring = 4 GPUs total -omni = Omni( - model="Qwen/Qwen-Image", - parallel_config=DiffusionParallelConfig( - ulysses_degree=2, - ring_degree=2, - ) -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048), -) -``` - -##### Online Serving - -```bash -# Text-to-image (requires >= 4 GPUs) -vllm serve Qwen/Qwen-Image --omni --port 8091 --usp 2 --ring 2 -``` - -##### Benchmarks -!!! note "Benchmark Disclaimer" - These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on: - - - Specific model and use case - - Hardware configuration - - Careful parameter tuning - - Different inference settings (e.g., number of steps, image resolution) - - -To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**1024x1024** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA A100 GPUs. `flash_attn` is the attention backends. - -| Configuration | Ulysses degree | Ring degree | Generation Time | Speedup | -|---------------|----------------|-------------|-----------------|---------| -| **Baseline (diffusers)** | - | - | 45.2s | 1.0x | -| Hybrid Ulysses + Ring | 2 | 2 | 24.3s | 1.87x | - - -### CFG-Parallel - -#### Offline Inference - -CFG-Parallel is enabled through `DiffusionParallelConfig(cfg_parallel_size=2)`, which runs one rank for the positive branch and one rank for the negative branch. - -An example of offline inference using CFG-Parallel (image-to-image) is shown below: - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -image_path = "path_to_image.png" -omni = Omni( - model="Qwen/Qwen-Image-Edit", - parallel_config=DiffusionParallelConfig(cfg_parallel_size=2), -) -input_image = Image.open(image_path).convert("RGB") - -outputs = omni.generate( - { - "prompt": "turn this cat to a dog", - "negative_prompt": "low quality, blurry", - "multi_modal_data": {"image": input_image}, - }, - OmniDiffusionSamplingParams( - true_cfg_scale=4.0, - num_inference_steps=50, - ), -) -``` - -Notes: - -- CFG-Parallel is only effective when a `negative_prompt` is provided AND a guidance scale (or `cfg_scale`) is greater than 1. - -See `examples/offline_inference/image_to_image/image_edit.py` for a complete working example. -```bash -cd examples/offline_inference/image_to_image/ -python image_edit.py \ - --model "Qwen/Qwen-Image-Edit" \ - --image "qwen_image_output.png" \ - --prompt "turn this cat to a dog" \ - --negative-prompt "low quality, blurry" \ - --cfg-scale 4.0 \ - --output "edited_image.png" \ - --cfg-parallel-size 2 -``` - -#### Online Serving - -You can enable CFG-Parallel in online serving for diffusion models via `--cfg-parallel-size`: - -```bash -vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 --cfg-parallel-size 2 -``` - -### HSDP - -HSDP (Hybrid Sharded Data Parallel) shards model weights across GPUs to reduce per-GPU memory usage. This enables inference of large models (e.g., Wan2.2 14B) on GPUs with limited memory. - -Unlike Tensor Parallelism which splits computation, HSDP uses PyTorch's FSDP2 to shard and redistribute weights at runtime. Each GPU only holds a fraction of the model weights, and weights are gathered on-demand during forward passes. - -#### Configuration - -HSDP is configured via `DiffusionParallelConfig`: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `use_hsdp` | bool | False | Enable HSDP | -| `hsdp_shard_size` | int | -1 | Number of GPUs to shard weights across. -1 = auto (requires other parallelism > 1) | -| `hsdp_replicate_size` | int | 1 | Number of replica groups. Each group holds a full sharded copy | - -**Constraints:** - -- `hsdp_replicate_size × hsdp_shard_size == world_size` -- HSDP cannot be used with Tensor Parallelism (`tensor_parallel_size` must be 1) - -#### Operating Modes - -HSDP can work in two modes: - -- **Standalone Mode**: HSDP alone without other parallelism. Must specify `hsdp_shard_size` explicitly. -- **Combined Mode**: HSDP overlays on top of other parallelism (Ulysses Sequence Parallel, CFG Parallel). HSDP dimensions must match world_size. - -#### Offline Inference - -**Standalone HSDP** (shard across 4 GPUs, no other parallelism): - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Wan-AI/Wan2.2-T2V-A14B-Diffusers", - parallel_config=DiffusionParallelConfig( - use_hsdp=True, - hsdp_shard_size=4, # Shard across 4 GPUs - ), -) - -outputs = omni.generate( - "A cat playing piano", - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -**Combined HSDP + Sequence Parallel**: - -```python -omni = Omni( - model="Wan-AI/Wan2.2-T2V-A14B-Diffusers", - parallel_config=DiffusionParallelConfig( - ulysses_degree=4, # Sequence parallel - use_hsdp=True, # HSDP overlays on SP - ), -) -``` - -#### Online Serving - -**Standalone HSDP** (shard model across 4 GPUs): - -```bash -vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091 --use-hsdp --hsdp-shard-size 4 -``` - -**Combined with Sequence Parallel**: - -```bash -vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091 --use-hsdp --usp 4 -``` - -#### Adding HSDP Support to New Models - -For detailed instructions on adding HSDP support to new models, see the [HSDP Contributing Guide](../../design/feature/hsdp.md). - -### Expert Parallelism - -Unlike Tensor Parallelism which shards every layer's weights, EP only shards the MoE expert MLP blocks. This significantly reduces the memory footprint of MoE models (e.g., HunyuanImage3.0) while maintaining constant dense-equivalent compute efficiency. Expert Parallelism is enabled via `DiffusionParallelConfig.enable_expert_parallel`. And `self.ep = tp * sp * cfg * dp` for now, so at least one of TP/SP/CFG/DP should set when EP enabled. - -#### Offline Inference - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="tencent/HunyuanImage-3.0", - parallel_config=DiffusionParallelConfig(tensor_parallel_size=8, enable_expert_parallel=True), -) - -outputs = omni.generate( - "A brown and white dog is running on the grass", - OmniDiffusionSamplingParams( - num_inference_steps=50, - width=1024, - height=1024, - ), -) -``` diff --git a/docs/user_guide/diffusion/step_execution.md b/docs/user_guide/diffusion/step_execution.md index 99c2878506e..f8c9fa8ddb2 100644 --- a/docs/user_guide/diffusion/step_execution.md +++ b/docs/user_guide/diffusion/step_execution.md @@ -46,7 +46,7 @@ its stepwise request state machine. For normal diffusion inference, leave it disabled unless your workflow depends on this mode. If you are looking for general diffusion speedups, see -[Diffusion Acceleration Overview](../diffusion_acceleration.md). +[Diffusion Features Overview](../diffusion_features.md). ## Troubleshooting diff --git a/docs/user_guide/diffusion/teacache.md b/docs/user_guide/diffusion/teacache.md deleted file mode 100644 index c90076e202d..00000000000 --- a/docs/user_guide/diffusion/teacache.md +++ /dev/null @@ -1,146 +0,0 @@ -# TeaCache Configuration Guide - -TeaCache speeds up diffusion model inference by caching transformer computations when consecutive timesteps are similar. This typically provides **1.5x-2.0x speedup** with minimal quality loss. - -## Quick Start - -Enable TeaCache by setting `cache_backend` to `"tea_cache"`: - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__ -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="tea_cache", - cache_config={ - "rel_l1_thresh": 0.2 # Optional, defaults to 0.2 - } -) -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams( - num_inference_steps=50, - ), -) -``` - -### Using Environment Variable - -You can also enable TeaCache via environment variable: - -```bash -export DIFFUSION_CACHE_BACKEND=tea_cache -``` - -Then initialize without explicitly setting `cache_backend`: - -```python -from vllm_omni import Omni - -omni = Omni( - model="Qwen/Qwen-Image", - cache_config={"rel_l1_thresh": 0.2} # Optional -) -``` - -## Online Serving (OpenAI-Compatible) - -Enable TeaCache for online serving by passing `--cache-backend tea_cache` when starting the server: - -```bash -vllm serve Qwen/Qwen-Image --omni --port 8091 \ - --cache-backend tea_cache \ - --cache-config '{"rel_l1_thresh": 0.2}' -``` - -## Configuration Parameters - -### `rel_l1_thresh` (float, default: `0.2`) - -Controls the balance between speed and quality. Lower values prioritize quality, higher values prioritize speed. - -**Recommended values:** - -- `0.2` - **~1.5x speedup** with minimal quality loss (recommended) -- `0.4` - **~1.8x speedup** with slight quality loss -- `0.6` - **~2.0x speedup** with noticeable quality loss -- `0.8` - **~2.25x speedup** with significant quality loss - -## Examples - -### Python API - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="tea_cache", - cache_config={"rel_l1_thresh": 0.2} -) -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams( - num_inference_steps=50, - ), -) -``` - -## Performance Tuning - -Start with the default `rel_l1_thresh=0.2` and adjust based on your needs: - -- **Maximum quality**: Use `0.1-0.2` -- **Balanced**: Use `0.2-0.4` (recommended) -- **Maximum speed**: Use `0.6-0.8` (may reduce quality) - -## Troubleshooting - -### Quality Degradation - -If you notice quality issues, lower the threshold: - -```python -cache_config={"rel_l1_thresh": 0.1} # More conservative caching -``` - -## Supported Models - -### ImageGen - - - -| Architecture | Models | Example HF Models | -|--------------|--------|-------------------| -| `QwenImagePipeline` | Qwen-Image | `Qwen/Qwen-Image` | -| `QwenImageEditPipeline` | Qwen-Image-Edit | `Qwen/Qwen-Image-Edit` | -| `QwenImageEditPlusPipeline` | Qwen-Image-Edit-2509 | `Qwen/Qwen-Image-Edit-2509` | -| `QwenImageLayeredPipeline` | Qwen-Image-Layered | `Qwen/Qwen-Image-Layered` | -| `BagelForConditionalGeneration` | BAGEL (DiT-only) | `ByteDance-Seed/BAGEL-7B-MoT` | -| `HunyuanImage3Pipeline` | HunyuanImage3 | `tencent/HunyuanImage-3.0-Instruct` | - -### VideoGen - -No VideoGen models are supported by TeaCache yet. - -### Coming Soon - - - -| Architecture | Models | Example HF Models | -|--------------|--------|-------------------| -| `FluxPipeline` | Flux | - | -| `CogVideoXPipeline` | CogVideoX | - | diff --git a/docs/user_guide/diffusion_acceleration.md b/docs/user_guide/diffusion_acceleration.md deleted file mode 100644 index 5edfa22376f..00000000000 --- a/docs/user_guide/diffusion_acceleration.md +++ /dev/null @@ -1,380 +0,0 @@ -# Diffusion Acceleration Overview - -vLLM-Omni supports various acceleration methods to speed up diffusion model inference with minimal quality degradation. These include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, **parallelism methods** that distribute the computation across multiple devices, and **quantization methods** that reduce memory footprint while preserving accuracy. - -## Supported Acceleration Methods - -vLLM-Omni currently supports two main cache acceleration backends: - -1. **[TeaCache](diffusion/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar -2. **[Cache-DiT](diffusion/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques: - - **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences - - **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference - - **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking - -Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality. - -vLLM-Omni also supports quantization methods: - -3. **[Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8 or Int8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers. - -vLLM-Omni also supports parallelism methods for diffusion models, including: - -1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads. - -2. [Ring-Attention](diffusion/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded. - -3. [CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step. - -4. [Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism) - shards DiT weights across devices to reduce per-GPU memory usage. - -5. [VAE Patch Parallelism](diffusion/parallelism_acceleration.md#vae-patch-parallelism) - shards VAE decode/encode spatially across ranks to reduce VAE peak memory (and can speed up VAE decode). - -6. [HSDP](diffusion/parallelism_acceleration.md#hsdp) - Hybrid Sharded Data Parallel shards model weights across GPUs to reduce per-GPU memory usage, enabling inference of large models on limited GPU memory. - -## Quick Comparison - -### Cache Methods - -| Method | Configuration | Description | Best For | -|--------|--------------|-------------|----------| -| **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality | -| **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control | - -### Quantization Methods - -| Method | Configuration | Description | Best For | -|--------|--------------|-------------|----------| -| **FP8** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup | -| **Int8** | `quantization="int8"` | Int8 W8A8 | Memory reduction, inference speedup | - -## Supported Models - -The following table shows which models are currently supported by each acceleration method: - -### ImageGen - -| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel | Tensor-Parallel | VAE-Patch-Parallel | -|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:------------:|:---------------:|:------------------:| -| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | -| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | -| **Ovis-Image** | `OvisAI/Ovis-Image` | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | -| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Qwen-Image-Edit** | `Qwen/Qwen-Image-Edit` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | -| **Qwen-Image-Edit-2509** | `Qwen/Qwen-Image-Edit-2509` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | -| **Qwen-Image-Layered** | `Qwen/Qwen-Image-Layered` | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | -| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ (TP=2 only) | ✅ | -| **Stable-Diffusion3.5** | `stabilityai/stable-diffusion-3.5` | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | -| **Bagel** | `ByteDance-Seed/BAGEL-7B-MoT` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | -| **FLUX.1-dev** | `black-forest-labs/FLUX.1-dev` | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | -| **NextStep-1.1** | `stepfun-ai/NextStep-1.1` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | -| **FLUX.2-klein** | `black-forest-labs/FLUX.2-klein-4B` | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | -| **FLUX.2-dev** | `black-forest-labs/FLUX.2-dev` | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | -| **FLUX.1-Kontext-dev** | `black-forest-labs/FLUX.1-Kontext-dev` | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | -| **GLM-Image** | `zai-org/GLM-Image` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | - -### VideoGen - -| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel | HSDP | VAE-Patch-Parallel | -|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:------------:|:----:|:----:| -| **Wan2.1-T2V** | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Wan2.1-T2V** | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| **LTX-2** | `Lightricks/LTX-2` | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | -| **DreamID-Omni** | `XuGuo699/DreamID-Omni` | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | - -### Quantization - -| Model | Model Identifier | FP8 | Int8 | -|-------|------------------|:---:|:---:| -| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ | -| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ | -| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ | - - -### AudioGen - -| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel | -|------------------------|------------------------------------------|:--------:|:---------:|:----------:|:--------------:|:------------:| -| **Stable-Audio-Open** | `stabilityai/stable-audio-open-1.0` | ✅ | | | | | - -## Performance Benchmarks - -The following benchmarks were measured on **Qwen/Qwen-Image** and **Qwen/Qwen-Image-Edit** models generating 1024x1024 images with 50 inference steps: - -!!! note "Benchmark Disclaimer" - These benchmarks are provided for **general reference only**. The configurations shown use default or common parameter settings and have not been exhaustively optimized for maximum performance. Actual performance may vary based on: - - - Specific model and use case - - Hardware configuration - - Careful parameter tuning - - Different inference settings (e.g., number of steps, image resolution) - - For optimal performance in your specific scenario, we recommend experimenting with different parameter configurations as described in the detailed guides below. - -| Model | Cache Backend | Cache Config | Generation Time | Speedup | Notes | -|-------|---------------|--------------|----------------|---------|-------| -| **Qwen/Qwen-Image** | None | None | 20.0s | 1.0x | Baseline (diffusers) | -| **Qwen/Qwen-Image** | TeaCache | `rel_l1_thresh=0.2` | 10.47s | **1.91x** | Recommended default setting | -| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer (Fn=1, Bn=0, W=8, TaylorSeer order=1) | 10.8s | **1.85x** | - | -| **Qwen/Qwen-Image** | Cache-DiT | DBCache + TaylorSeer + SCM (Fn=8, Bn=0, W=4, TaylorSeer order=1, SCM fast) | 14.0s | **1.43x** | - | -| **Qwen/Qwen-Image-Edit** | None | No acceleration | 51.5s | 1.0x | Baseline (diffusers) | -| **Qwen/Qwen-Image-Edit** | Cache-DiT | Default (Fn=1, Bn=0, W=4, TaylorSeer disabled, SCM disabled) | 21.6s | **2.38x** | - | - -To measure the parallelism methods, we run benchmarks with **Qwen/Qwen-Image** model generating images (**2048x2048** as long sequence input) with 50 inference steps. The hardware devices are NVIDIA H800 GPUs. `sdpa` is the attention backends. - -| Configuration | Ulysses degree |Generation Time | Speedup | -|---------------|----------------|---------|---------| -| **Baseline (diffusers)** | - | 112.5s | 1.0x | -| Ulysses-SP | 2 | 65.2s | 1.73x | -| Ulysses-SP | 4 | 39.6s | 2.84x | -| Ulysses-SP | 8 | 30.8s | 3.65x | - -## Quick Start - -### Using TeaCache - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="tea_cache", - cache_config={"rel_l1_thresh": 0.2} # Optional, defaults to 0.2 -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams( - num_inference_steps=50, - ), -) -``` - -### Using Cache-DiT - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -omni = Omni( - model="Qwen/Qwen-Image", - cache_backend="cache_dit", - cache_config={ - "Fn_compute_blocks": 1, - "Bn_compute_blocks": 0, - "max_warmup_steps": 8, - "enable_taylorseer": True, - "taylorseer_order": 1, - } -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams( - num_inference_steps=50, - ), -) -``` - -### Using Ulysses-SP - -Run text-to-image: -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -ulysses_degree = 2 - -omni = Omni( - model="Qwen/Qwen-Image", - parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree) -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048), -) -``` - - -Run image-to-image: -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -ulysses_degree = 2 - -omni = Omni( - model="Qwen/Qwen-Image-Edit", - parallel_config=DiffusionParallelConfig(ulysses_degree=ulysses_degree) -) - -outputs = omni.generate( - { - "prompt": "turn this cat to a dog", - "multi_modal_data": {"image": input_image} - }, - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -### Using Ring-Attention - -Run text-to-image: -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -ring_degree = 2 - -omni = Omni( - model="Qwen/Qwen-Image", - parallel_config=DiffusionParallelConfig(ring_degree=2) -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50, width=2048, height=2048), -) -``` - -### Using Tensor Parallelism - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Tongyi-MAI/Z-Image-Turbo", - parallel_config=DiffusionParallelConfig(tensor_parallel_size=2), -) - -outputs = omni.generate( - prompt="a cat reading a book", - num_inference_steps=9, - width=512, - height=512, -) -``` - -### Using VAE Patch Parallelism - -```python -from vllm_omni import Omni -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Tongyi-MAI/Z-Image-Turbo", - parallel_config=DiffusionParallelConfig( - tensor_parallel_size=2, - vae_patch_parallel_size=2, - ), -) - -outputs = omni.generate( - prompt="a cat reading a book", - num_inference_steps=9, - width=1024, - height=1024, -) -``` - -### Using HSDP - -HSDP (Hybrid Sharded Data Parallel) shards model weights across GPUs to reduce per-GPU memory usage. This enables inference of large models (e.g., Wan2.2 14B) on GPUs with limited memory. - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig - -omni = Omni( - model="Wan-AI/Wan2.2-T2V-A14B-Diffusers", - parallel_config=DiffusionParallelConfig( - use_hsdp=True, - hsdp_replicate_size=1, # Number of replica groups - hsdp_shard_size=8, # Number of GPUs to shard across (or -1 for auto) - ), -) - -outputs = omni.generate( - "A cat playing piano", - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -### Using CFG-Parallel - -Run image-to-image: - -CFG-Parallel splits the CFG positive/negative branches across GPUs. Use it when you set a non-trivial `true_cfg_scale`. - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams -from vllm_omni.diffusion.data import DiffusionParallelConfig -cfg_parallel_size = 2 - -omni = Omni( - model="Qwen/Qwen-Image-Edit", - parallel_config=DiffusionParallelConfig(cfg_parallel_size=cfg_parallel_size) -) - -outputs = omni.generate( - { - "prompt": "turn this cat to a dog", - "multi_modal_data": {"image": input_image} - }, - OmniDiffusionSamplingParams(num_inference_steps=50, true_cfg_scale=4.0), -) -``` - -### Using FP8 Quantization - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -omni = Omni( - model="", - quantization="fp8", -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -### Using Int8 Quantization - -```python -from vllm_omni import Omni -from vllm_omni.inputs.data import OmniDiffusionSamplingParams - -omni = Omni( - model="", - quantization="int8", -) - -outputs = omni.generate( - "A cat sitting on a windowsill", - OmniDiffusionSamplingParams(num_inference_steps=50), -) -``` - -## Documentation - -For detailed information on each acceleration method: - -- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices -- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters -- **[Quantization Guide](diffusion/quantization/overview.md)** - Quantization for DiT models with per-layer control -- **[Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism)** - Guidance on how to enable TP for diffusion models. -- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration. -- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks. -- **[VAE Patch Parallelism](diffusion/parallelism_acceleration.md#vae-patch-parallelism)** - Guidance on how to reduce VAE memory via patch/tile parallelism. -- **[HSDP](diffusion/parallelism_acceleration.md#hsdp)** - Hybrid Sharded Data Parallel for memory-efficient inference of large models. diff --git a/docs/user_guide/diffusion_features.md b/docs/user_guide/diffusion_features.md new file mode 100644 index 00000000000..7e325c1edc8 --- /dev/null +++ b/docs/user_guide/diffusion_features.md @@ -0,0 +1,190 @@ +# Diffusion Advanced Features + +## Table of Contents + +- [Overview](#overview) +- [Supported Features](#supported-features) +- [Supported Models](#supported-models) +- [Feature Compatibility](#feature-compatibility) +- [Learn More](#learn-more) + +## Overview + +vLLM-Omni supports various advanced features for diffusion models: + +- Acceleration: **cache methods**, **parallelism methods** +- Memory optimization: **cpu offloading**, **quantization** +- Extensions: **LoRA inference** + +## Supported Features + +### Acceleration + +#### Lossy Acceleration + +Cache methods trade minimal quality for significant speedup. Quality loss is typically imperceptible with proper tuning. + +| Method | Description | Best For | +|--------|-------------|----------| +| **[TeaCache](diffusion/cache_acceleration/teacache.md)** | Adaptive caching using modulated inputs | Quick setup, balanced quality/speed on single GPU | +| **[Cache-DiT](diffusion/cache_acceleration/cache_dit.md)** | Multiple caching techniques: DBCache, TaylorSeer, SCM | Fine-grained control, tunable quality-speed tradeoff | + + +#### Lossless Acceleration + +Parallelism methods distribute computation across GPUs without quality loss (mathematically equivalent to single-GPU). + +| Method | Description | Best For | +|--------|-------------|----------| +| **[Ulysses-SP](diffusion/parallelism/sequence_parallel.md)** | Sequence parallelism via all-to-all communication | High-resolution images (>1536px) or long videos with 2-8 GPUs | +| **[Ring-Attention](diffusion/parallelism/sequence_parallel.md)** | Sequence parallelism via ring-based communication | Videos, very long sequences, memory-constrained, with 2-8 GPUs | +| **[CFG-Parallel](diffusion/parallelism/cfg_parallel.md)** | Splits CFG positive/negative branches across devices | Image editing with CFG guidance (true_cfg_scale > 1) on 2 GPUs | +| **[Tensor Parallelism](diffusion/parallelism/tensor_parallel.md)** | Shards model weights across devices | Large models that don't fit in single GPU, with 2+ GPUs | +| **[HSDP](diffusion/parallelism/hsdp.md)** | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP | +| **[Expert Parallelism](diffusion/parallelism/expert_parallel.md)** | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) | + +**Note:** Some acceleration methods can be combined together for optimized performance. See [Feature Compatibility Table](#feature-compatibility) and [Feature Compatibility Tutorial](feature_compatibility.md) for detailed configuration examples. + +### Memory Optimization + +Memory optimization methods help reduce GPU memory usage, enabling inference on resource-constrained hardware or larger models. + +| Method | Description | Best For | +|--------|-------------|----------| +| **[CPU Offload](diffusion/cpu_offload_diffusion.md)** | Offloads model components to CPU memory | Limited VRAM, large models on consumer GPUs | +| **[Quantization](diffusion/quantization/overview.md)** | Reduces DiT layers from BF16 to FP8/INT8/etc. | Limited VRAM, minimal accuracy loss | +| **[VAE Patch Parallelism](diffusion/parallelism/vae_patch_parallel.md)** | Distributes VAE decode tiling across GPUs | High-resolution generation with reduced VAE memory peak | + +### Extensions + +Extension methods add specialized capabilities to diffusion models beyond standard inference. + +| Method | Description | Best For | +|--------|-------------|----------| +| **[LoRA Inference](diffusion/lora.md)** | Enables inference with Low-Rank Adaptation (LoRA) adapters weights | Reinforcement learning extensions | + + +### Quantization Methods + +| Method | Configuration | Description | Best For | +|--------|--------------|-------------|----------| +| **[FP8](diffusion/quantization/fp8.md)** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup | +| **[INT8](diffusion/quantization/int8.md)** | `quantization="int8"` | INT8 weight-only, no calibration or pre-quantized checkpoint needed | Memory reduction, broad GPU compatibility | +| **[GGUF](diffusion/quantization/gguf.md)** | `quantization="gguf"` | Native GGUF transformer-only weights (Q4, Q8, etc.) | Memory reduction on consumer GPUs | + +## Supported Models + +The following tables show which models support each feature: + +- **🔀SP (Ulysses & Ring)**: Includes both Ulysses-SP and Ring-Attention methods +- ✅ = Fully supported +- ❌ = Not supported + +> Notes: + +> 1. CPU Offload has two methods: Module-wise (default for models with DiT + text encoder) and Layerwise. The tables below show **Layerwise support** only. +> 2. The **💾Quantization** column is collapsed for readability. See [Quantization Overview](diffusion/quantization/overview.md) for per-method (FP8, GGUF, …) and per-model support details. + +### ImageGen + +| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | +|-------|:----------:|:-----------:|:---------------------:|:--------------:|:-----------------:|:------:|:------------------------:|:--------------------:|:--------------:| +| **Bagel** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **FLUX.1-dev** | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | +| **FLUX.2-klein** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | +| **FLUX.1-Kontext-dev** | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | +| **FLUX.2-dev** | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | +| **GLM-Image** | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **HunyuanImage3** | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | +| **LongCat-Image** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **LongCat-Image-Edit** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **MammothModa2(T2I)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **Nextstep_1(T2I)** | ❓ | ❓ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **OmniGen2** | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **Ovis-Image** | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | +| **Qwen-Image** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | +| **Qwen-Image-2512** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | +| **Qwen-Image-Edit** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | +| **Qwen-Image-Edit-2509** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | +| **Qwen-Image-Layered** | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | +| **Stable-Diffusion3.5** | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **Z-Image** | ✅ | ✅ | ✅ | ❓ | ✅ (TP=2 only) | ❌ | ❌ | ✅ | ✅ | + +> Notes: +> 1. Nextstep_1(T2I) does not support cache acceleration methods such as TeaCache or Cache-DiT. +> 2. `Tongyi-MAI/Z-Image-Turbo` is a distilled model with minimal NFEs; CFG-Parallel is not necessary. + +### VideoGen + +| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | +|-------|:----------:|:-----------:|:---------------------:|:--------------:|:-----------------:|:------:|:------------------------:|:--------------------:|:--------------:| +| **Wan2.2** | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | +| **LTX-2** | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | +| **Helios** | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | +| **HunyuanVideo-1.5 T2V I2V** | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| **DreamID-Omni** | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | + +### AudioGen + +| Model | ⚡TeaCache | ⚡Cache-DiT | 🔀SP (Ulysses & Ring) | 🔀CFG-Parallel | 🔀Tensor-Parallel | 🔀HSDP | 💾CPU Offload (Layerwise) | 💾VAE-Patch-Parallel | 💾Quantization | +|-------|:----------:|:-----------:|:---------------------:|:--------------:|:-----------------:|:------:|:------------------------:|:--------------------:|:--------------:| +| **Stable-Audio-Open** | ❌ | ❌ | ❓ | ❓ | ❌ | ❌ | ❌ | ❌ | ✅ | + + +## Feature Compatibility + +**Legend:** + +- ✅: Functionality is supported +- ❌: No support plan +- ❓: Not verified yet and Not Recommended + +| | ⚡TeaCache | ⚡Cache-DiT | 🔀Ulysses-SP | 🔀Ring-Attn | 🔀CFG-Parallel | 🔀Tensor Parallel | 🔀HSDP | 🔀Expert Parallel | 💾CPU Offloading (Layerwise) | 💾CPU Offloading (Module-wise) | 💾VAE Patch Parallel | 💾FP8 Quant | 🔧LoRA Inference | +|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| +| **⚡TeaCache** | | | | | | | | | | | | | | +| **⚡Cache-DiT** | ❌ | | | | | | | | | | | | | +| **🔀Ulysses-SP** | ✅ | ✅ | | | | | | | | | | | | +| **🔀Ring-Attn** | ✅ | ✅ | ✅ | | | | | | | | | | | +| **🔀CFG-Parallel** | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | +| **🔀Tensor Parallel** | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | +| **🔀HSDP** | ❓ | ❓ | ❓ | ❓ | ❓ | ❌ | | | | | | | | +| **🔀Expert Parallel** | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | | | | | | +| **💾CPU Offloading (Layerwise)** | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | +| **💾CPU Offloading (Module-wise)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ❓ | ❌ | | | | | +| **💾VAE Patch Parallel** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | | | | +| **💾FP8 Quant** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ✅ | ✅ | | | +| **🔧LoRA Inference** | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | + +!!! info + + 1. Tensor Parallel and HSDP are not compatible. + 2. TeaCache and Cache-DiT are not compatible. + 3. CPU Offloading (Layerwise) and CPU Offloading (Module-wise) are not compatible. + 4. CPU Offloading (Layerwise) supports single-card for now. + 5. Using FP8-Quant as an example of qunatization methods. + + +## Learn More + +**Cache Acceleration:** + +- **[TeaCache Configuration Guide](diffusion/cache_acceleration/teacache.md)** - Parameter tuning, performance tips, troubleshooting +- **[Cache-DiT Advanced Guide](diffusion/cache_acceleration/cache_dit.md)** - DBCache, TaylorSeer, SCM techniques and optimization + +**Parallelism Methods:** + +- **[Parallelism Overview](diffusion/parallelism/overview.md)** - Tensor Parallelism, Sequence Parallelism, CFG Parallelism, HSDP, and Expert Parallelism + +**Memory Optimization:** + +- **[CPU Offload Guide](diffusion/cpu_offload_diffusion.md)** - Offload model components to CPU, reduce GPU memory usage +- **[VAE Patch Parallelism Guide](diffusion/parallelism/vae_patch_parallel.md)** - Distribute VAE decode tiling across GPUs for high-resolution images +- **[Quantization Overview](diffusion/quantization/overview.md)** - Overview of quantization methods for diffusion models + +**Extensions:** + +- **[LoRA Inference Guide](diffusion/lora.md)** - Low-Rank Adaptation for style customization and fine-tuning + +**Advanced Topics:** + +- **[Feature Compatibility](feature_compatibility.md)** - How to combine multiple features for maximum performance diff --git a/docs/user_guide/feature_compatibility.md b/docs/user_guide/feature_compatibility.md new file mode 100644 index 00000000000..ea05fe81ec1 --- /dev/null +++ b/docs/user_guide/feature_compatibility.md @@ -0,0 +1,218 @@ +# Feature Compatibility + +This guide explains the compatibility matrix of different diffusion features in vLLM-Omni. You can use cache methods together with parallelism methods and other features to achieve optimal speed and efficiency. + +## Overview + +vLLM-Omni supports combining: + +- **Cache methods** (TeaCache, Cache-DiT) with **Parallelism methods** (Ulysses-SP, Ring-Attention, CFG-Parallel, Tensor Parallelism) +- **Multiple parallelism methods** together (e.g., Ulysses-SP + Ring-Attention, CFG-Parallel + Sequence Parallelism) +- **LoRA adapters** with most acceleration features +- **CPU offloading** with other memory optimization features + +See the feature compatibility matrix in [Table](diffusion_features.md#feature-compatibility) + +## Common Combinations + +### 1. Cache + Sequence Parallelism (Recommended) + +Best for: **Large images (>1536px) or videos** + +Combines cache acceleration with sequence parallelism for maximum speedup on single-device-challenging workloads. + +**Using TeaCache + Ulysses-SP:** + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "A beautiful mountain landscape" \ + --cache-backend tea_cache \ + --ulysses-degree 2 +``` + +**Using Cache-DiT + Ring-Attention:** + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "A futuristic city" \ + --cache-backend cache_dit \ + --ring-degree 2 +``` + +### 2. Cache + CFG-Parallel + +Best for: **Image editing with Classifier-Free Guidance** + +Accelerates both the diffusion process and CFG computation. + +```bash +python examples/offline_inference/image_to_image/image_edit.py \ + --model Qwen/Qwen-Image-Edit \ + --prompt "make it sunset" \ + --negative-prompt "low quality, blurry" \ + --image input.png \ + --cache-backend cache_dit \ + --cfg-parallel-size 2 \ + --cfg-scale 4.0 +``` + +### 3. CFG-Parallel + Sequence Parallelism + +Best for: **Large resolution image editing with CFG** + +Combines both CFG branch splitting and sequence parallelism for maximum GPU utilization. + +**CFG-Parallel + Ulysses-SP:** + +```bash +python examples/offline_inference/image_to_image/image_edit.py \ + --model Qwen/Qwen-Image-Edit \ + --prompt "transform into autumn scene" \ + --negative-prompt "low quality" \ + --image input.png \ + --cache-backend cache_dit \ + --cfg-parallel-size 2 \ + --ulysses-degree 2 \ + --cfg-scale 4.0 +``` + +### 4. Hybrid Ulysses + Ring + Vae tiling + +Best for: **Very large images or videos on multiple devices** + +Combines Ulysses-SP (all-to-all) with Ring-Attention (ring P2P) for scalable parallelism. + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Qwen/Qwen-Image \ + --prompt "Epic fantasy landscape" \ + --cache-backend cache_dit \ + --ulysses-degree 2 \ + --ring-degree 2 \ + --num-inference-steps 50 \ + --width 2048 \ + --height 2048 \ + --vae-use-tiling +``` + +### 5. Cache + Tensor Parallelism + +Best for: **Large models that don't fit in single GPU memory** + +Reduces per-GPU memory usage while maintaining cache acceleration. + +```bash +python examples/offline_inference/text_to_image/text_to_image.py \ + --model Tongyi-MAI/Z-Image-Turbo \ + --prompt "A cat reading a book" \ + --cache-backend tea_cache \ + --tensor-parallel-size 2 \ + --num-inference-steps 9 \ +``` + +## Online Serving + +### Cache + Sequence Parallelism + +```bash +# TeaCache + Ulysses-SP +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --cache-backend tea_cache \ + --cache-config '{"rel_l1_thresh": 0.2}' \ + --usp 2 + +# Cache-DiT + Ring-Attention +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --cache-backend cache_dit \ + --cache-config '{"Fn_compute_blocks": 1, "max_warmup_steps": 8}' \ + --ring 2 +``` + +### Cache + CFG-Parallel + +```bash +vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 \ + --cache-backend cache_dit \ + --cfg-parallel-size 2 +``` + +### Multiple Parallelism Methods + +```bash +# CFG-Parallel + Ulysses-SP (4 GPUs total) +vllm serve Qwen/Qwen-Image-Edit --omni --port 8091 \ + --cache-backend cache_dit \ + --cfg-parallel-size 2 \ + --usp 2 + +# Hybrid Ulysses + Ring (4 GPUs total) +vllm serve Qwen/Qwen-Image --omni --port 8091 \ + --cache-backend cache_dit \ + --usp 2 \ + --ring 2 +``` + + +## Limitations + +### Incompatibilities + +- **TeaCache + Cache-DiT**: These two cache methods cannot be used together. Only one cache backend can be active at a time. Attempting to enable both will result in an error. + +### Partial Support + +- **Tensor Parallelism — Text Encoder Not Sharded**: TP currently only shards the DiT blocks. Each TP rank retains a **full copy of the text encoder weights**, leading to significant GPU memory overhead proportional to TP degree. Tracked in [Issue #771](https://github.com/vllm-project/vllm-omni/issues/771). + +- **CPU Offloading — Two Modes Are Mutually Exclusive**: Model-level offload (`enable_cpu_offload`) and layerwise offload (`enable_layerwise_offload`) cannot be used simultaneously. If both are set, layerwise takes priority and model-level is silently ignored. + +- **CPU Offloading — VAE stays on GPU**: Both offloading strategies keep the VAE on GPU at all times. For high-resolution generation, VAE decode can still cause OOM. Mitigate by combining with `vae_use_tiling=True` or VAE Patch Parallelism. + +- **VAE Patch Parallelism — DistributedVaeExecutor Required**: VAE Patch Parallelism is only enabled for models that have `DistributedVaeExecutor`. Unsupported models will silently ignore `vae_patch_parallel_size`, and use sequential vae tiling instead. + +### Configuration Constraints + +- **GPU Count Must Match Parallel Degrees**: Total GPU count must satisfy: + ``` + total_gpus = ulysses_degree × ring_degree × cfg_parallel_size × tensor_parallel_size + ``` + Any mismatch will cause a configuration error at startup. + +- **VAE Patch Parallel Size ≤ DiT Process Group Size**: `vae_patch_parallel_size` reuses the DiT process group and cannot exceed it. Larger values are automatically clamped with a warning. + +- **Model-Specific TP Constraints**: Some models impose divisibility constraints on TP size. For example, Z-Image Turbo (`num_heads=30`) only supports `tensor_parallel_size=2`. Check [Supported Models](diffusion_features.md#supported-models) for per-model constraints. + +## Troubleshooting + +### Performance Not Scaling + +**Symptoms:** Adding more GPUs doesn't improve speed proportionally + +**Solutions:** +1. Check GPU communication bandwidth (use `nvidia-smi topo -m`) +2. Reduce parallelism degree if communication overhead is high +3. For very long sequences, prefer Ring-Attention over Ulysses-SP +4. Ensure batch size is large enough to saturate GPUs + +### Out of Memory with Parallelism + +**Symptoms:** OOM errors when combining methods + +**Solutions:** +1. Enable Tensor Parallelism to shard weights +2. Reduce resolution or batch size +3. Combine with memory efficient methods, such as cpu offloading + +### Configuration Errors + +**Symptoms:** Errors about invalid parallel configuration + +**Solutions:** +1. Verify total GPU count matches: `ulysses × ring × cfg × tp` +2. Check model supports all enabled methods +3. Ensure divisibility constraints (e.g., Z-Image TP=1 or 2 only) + +## See Also + +- [Diffusion Acceleration Overview](diffusion_features.md) - Main acceleration guide