-
Notifications
You must be signed in to change notification settings - Fork 951
[skip ci][Doc] Refine the Diffusion Features User Guide #1928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
0156891
rename
wtomin 5e1b891
add cfg parallel doc
wtomin f14b0d8
vae patch parallel
wtomin 0e9e7f4
tensor parallel
wtomin 52afc87
sequence parallel
wtomin 4e3871a
moving to directrory
wtomin 2e11c58
hsdp
wtomin 4d5efa8
ep doc
wtomin 00e8fed
mv to directory
wtomin ffa6670
rename
wtomin 3a13dea
cache-dit v1
wtomin 3fc7c6d
cache-dit v2
wtomin 89eb5ee
teacache v1
wtomin ef86f03
teacache v2
wtomin aa072f1
feature compat
wtomin aa2afb4
path correction
wtomin abfd99d
update feature compatibility
wtomin 6e6873f
updates
wtomin 55604ab
path correction
wtomin f4951f6
path correction
wtomin 3273c84
design doc add
wtomin 43a3a07
updates
wtomin 5738793
update cache-dit
wtomin 46eebc9
update vae patch paralell
wtomin 48ce2d0
update cli arg
wtomin 95b4325
update doc
wtomin daf8665
update nav yaml
wtomin f42b208
add wan2.2
wtomin f4646fe
add EP
wtomin 588e896
update table
wtomin 895072c
replace symbol
wtomin f6d76f5
udpate main
wtomin d859ab1
udpates
wtomin 15a6291
updates
wtomin b59fd8f
updates
wtomin 87f56f8
fix warning
wtomin 2370a83
readability
wtomin dcbbb24
update docs
wtomin 93bb14b
udpates
wtomin ac9244f
Apply suggestion from @wtomin
wtomin 675a6de
update layerwise compatibility
wtomin 32321d4
fix module-wise offload
wtomin 68c58a0
fix layerwise offload
wtomin 40afe73
remove vae allowlist
wtomin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
285 changes: 285 additions & 0 deletions
285
docs/user_guide/diffusion/cache_acceleration/cache_dit.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,285 @@ | ||
| # Cache-DiT Guide | ||
|
|
||
|
|
||
| ## Table of Content | ||
|
|
||
| - [Overview](#overview) | ||
| - [Quick Start](#quick-start) | ||
| - [Example Script](#example-script) | ||
| - [Acceleration Methods](#acceleration-methods) | ||
| - [Configuration Parameters](#configuration-parameters) | ||
| - [Best Practices](#best-practices) | ||
| - [Troubleshooting](#troubleshooting) | ||
| - [Summary](#summary) | ||
| - [Additional Resources](#additional-resources) | ||
|
|
||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| Cache-DiT accelerates diffusion transformer models through intelligent caching mechanisms, providing significant speedup with minimal quality loss. It supports multiple acceleration techniques that can be combined for optimal performance: | ||
|
|
||
| - **DBCache**: Dual Block Cache for reducing redundant computations | ||
| - **TaylorSeer**: Taylor expansion-based forecasting for faster inference | ||
| - **SCM**: Step Computation Masking for selective step computation | ||
|
|
||
| See supported models list in [Supported Models](../../diffusion_features.md#supported-models). | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`: | ||
|
|
||
| ```python | ||
| from vllm_omni import Omni | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
|
|
||
| omni = Omni( | ||
| model="Qwen/Qwen-Image", | ||
| cache_backend="cache_dit", # Enable Cache-DiT with defaults | ||
| ) | ||
|
|
||
| outputs = omni.generate( | ||
| "a beautiful landscape", | ||
| OmniDiffusionSamplingParams(num_inference_steps=50), | ||
| ) | ||
| ``` | ||
|
|
||
| **Note**: When `cache_config` is not provided, Cache-DiT uses optimized default values. See the [Configuration Parameters](#configuration-parameters) section for details. | ||
|
|
||
| ### Custom Configuration | ||
|
|
||
| To customize cache-dit settings, provide a `cache_config` dictionary, for example: | ||
|
|
||
| ```python | ||
| omni = Omni( | ||
| model="Qwen/Qwen-Image", | ||
| cache_backend="cache_dit", | ||
| cache_config={ | ||
| "Fn_compute_blocks": 1, | ||
| "Bn_compute_blocks": 0, | ||
| "max_warmup_steps": 4, | ||
| "residual_diff_threshold": 0.12, | ||
| }, | ||
| ) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Example Script | ||
|
|
||
| ### Offline Inference | ||
|
|
||
| Use the example script under `examples/offline_inference/text_to_image`: | ||
|
|
||
| ```bash | ||
| cd examples/offline_inference/text_to_image | ||
| python text_to_image.py \ | ||
| --model Qwen/Qwen-Image \ | ||
| --prompt "a cup of coffee on the table" \ | ||
| --cache-backend cache_dit \ | ||
| --num-inference-steps 50 | ||
| ``` | ||
|
|
||
| See the [text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py) for detailed configuration options. | ||
|
|
||
| The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer: | ||
|
|
||
| ```python | ||
| omni = Omni( | ||
| model="Qwen/Qwen-Image", | ||
| cache_backend="cache_dit", | ||
| cache_config={ | ||
| # Scheme: Hybrid DBCache + SCM + TaylorSeer | ||
| "Fn_compute_blocks": 1, # Optimized for single-transformer models | ||
| "Bn_compute_blocks": 0, # Number of backward compute blocks | ||
| "max_warmup_steps": 4, # Maximum warmup steps (works for few-step models) | ||
| "residual_diff_threshold": 0.24, # Higher threshold for more aggressive caching | ||
| "max_continuous_cached_steps": 3, # Limit to prevent precision degradation | ||
| # TaylorSeer parameters [cache-dit only] | ||
| "enable_taylorseer": False, # Disabled by default (not suitable for few-step models) | ||
| "taylorseer_order": 1, # TaylorSeer polynomial order | ||
| # SCM (Step Computation Masking) parameters [cache-dit only] | ||
| "scm_steps_mask_policy": None, # SCM mask policy: None (disabled), "slow", "medium", "fast", "ultra" | ||
| "scm_steps_policy": "dynamic", # SCM steps policy: "dynamic" or "static" | ||
| } | ||
| ) | ||
| ``` | ||
|
|
||
| You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements. | ||
|
|
||
| For image-to-image tasks, use the example script under `examples/offline_inference/image_to_image`: | ||
|
|
||
| ```bash | ||
| cd examples/offline_inference/image_to_image | ||
| python image_edit.py \ | ||
| --model Qwen/Qwen-Image-Edit \ | ||
| --prompt "make the sky more colorful" \ | ||
| --image path/to/input/image.jpg \ | ||
| --cache-backend cache_dit \ | ||
| --num-inference-steps 50 \ | ||
| --cache-dit-max-continuous-cached-steps 3 \ | ||
| --cache-dit-residual-diff-threshold 0.24 \ | ||
| --cache-dit-enable-taylorseer | ||
| ``` | ||
|
|
||
| See the [image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py) for detailed configuration options. | ||
|
|
||
| ### Online Serving | ||
|
|
||
| ```bash | ||
| # Default configuration (recommended) | ||
| vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit | ||
|
|
||
| # Custom configuration | ||
| vllm serve Qwen/Qwen-Image --omni --port 8091 \ | ||
| --cache-backend cache_dit \ | ||
| --cache-config '{"Fn_compute_blocks": 1, "residual_diff_threshold": 0.12}' | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Acceleration Methods | ||
|
|
||
| For comprehensive illustration, please view Cache-DiT [User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/). | ||
|
|
||
| ### 1. DBCache (Dual Block Cache) | ||
|
|
||
| DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality. | ||
|
|
||
| **Example Configuration**: | ||
|
|
||
| ```python | ||
| cache_config={ | ||
| "Fn_compute_blocks": 8, # Use first 8 blocks for difference computation | ||
| "Bn_compute_blocks": 0, # No additional fusion blocks | ||
| "max_warmup_steps": 8, # Cache after 8 warmup steps | ||
| "residual_diff_threshold": 0.12, # Lower threshold for faster inference | ||
| "max_cached_steps": -1, # No limit on cached steps | ||
| } | ||
| ``` | ||
|
|
||
| **Performance Tips**: | ||
|
|
||
| - Default `Fn_compute_blocks=1` works well for most cases. Some models (e.g., [FLUX.2-klein](https://github.com/wtomin/vllm-omni/blob/main/vllm_omni/diffusion/cache/cache_dit_backend.py#L363)) use a larger value for `Fn_compute_blocks` for a balanced performance. | ||
| - Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality. | ||
| - Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed. | ||
|
|
||
| ### 2. TaylorSeer | ||
|
|
||
| TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality. | ||
|
|
||
| **Example Configuration**: | ||
|
|
||
| ```python | ||
| cache_config={ | ||
| "enable_taylorseer": True, | ||
| "taylorseer_order": 1, # First-order Taylor expansion | ||
| } | ||
| ``` | ||
|
|
||
| **Performance Tips**: | ||
|
|
||
| - TaylorSeer is **not suitable for few-step distilled models**. | ||
| - Use `taylorseer_order=1` for most cases (good balance of speed and quality). | ||
| - Combine with DBCache for maximum acceleration. | ||
| - Higher orders (2-3) may improve quality but reduce speed gains. | ||
|
|
||
| ### 3. SCM (Step Computation Masking) | ||
|
|
||
| SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration. | ||
|
|
||
| `scm_steps_mask_policy` options (number of compute steps out of 28): | ||
|
|
||
| | Policy | Compute Steps | Speed | Quality | | ||
| |--------|--------------|-------|---------| | ||
| | `None` (default) | All | Baseline | Best | | ||
| | `"slow"` | 18 / 28 | Moderate | High | | ||
| | `"medium"` | 15 / 28 | Balanced | Good | | ||
| | `"fast"` | 11 / 28 | Fast | Moderate | | ||
| | `"ultra"` | 8 / 28 | Fastest | Lower | | ||
|
|
||
| **Example Configuration**: | ||
|
|
||
| ```python | ||
| cache_config={ | ||
| "scm_steps_mask_policy": "medium", # Balanced speed/quality | ||
| "scm_steps_policy": "dynamic", # Use dynamic cache | ||
| } | ||
| ``` | ||
|
|
||
| **Performance Tips**: | ||
|
|
||
| - SCM is disabled by default. Enable it by setting a policy value if you need additional acceleration. | ||
| - Start with `"medium"` policy and adjust based on quality requirements. | ||
| - Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised. | ||
| - `"dynamic"` policy generally provides better quality than `"static"`. | ||
| - SCM mask is automatically regenerated when `num_inference_steps` changes during inference. | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration Parameters | ||
|
|
||
| In `cache_config` passed to `Omni` constructor, it accepts the arguments of `DBCacheConfig` ([Cache-DiT API Reference](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)). Key parameters are listed below: | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| |-----------|------|---------|-------------| | ||
| | `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) | | ||
| | `Bn_compute_blocks` | int | 0 | Last n blocks for fusion | | ||
| | `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) | | ||
| | `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) | | ||
| | `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) | | ||
| | `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) | | ||
| | `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) | | ||
| | `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) | | ||
| | `taylorseer_order` | int | 1 | Taylor expansion order | | ||
| | `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") | | ||
| | `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") | | ||
|
|
||
| --- | ||
|
|
||
| ## Best Practices | ||
|
|
||
| ### When to Use | ||
|
|
||
| **Good for:** | ||
|
|
||
| - Production deployments requiring fast inference | ||
| - Diffusion transformer models (DiT architecture) | ||
| - Scenarios where 1.5x-3x speedup is valuable | ||
|
|
||
| **Not for:** | ||
|
|
||
| - Non-DiT architectures (use model-specific acceleration instead) | ||
| - Models already using few-step distillation (< 10 steps) | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issue 1: Quality Degradation | ||
|
|
||
| **Symptoms**: Generated images have visible artifacts or lower quality | ||
|
|
||
| **Solution**: | ||
| ```python | ||
| # Reduce aggressiveness - use more conservative settings | ||
| cache_config={ | ||
| "residual_diff_threshold": 0.20, # Lower threshold (closer to default 0.24) | ||
| "Fn_compute_blocks": 8, # Use more blocks for better decisions | ||
| "max_warmup_steps": 6, # Longer warmup | ||
| "scm_steps_mask_policy": "slow", # More compute steps | ||
| } | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Summary | ||
|
|
||
| Using Cache-DiT acceleration: | ||
|
|
||
| 1. ✅ **Enable Cache-DiT** - Set `cache_backend="cache_dit"` to get 1.5x-3x speedup with optimized defaults | ||
| 2. ✅ **(Optional) Customize** - Adjust `cache_config` parameters for specific speed/quality trade-offs | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.