Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
0156891
rename
wtomin Mar 16, 2026
5e1b891
add cfg parallel doc
wtomin Mar 16, 2026
f14b0d8
vae patch parallel
wtomin Mar 16, 2026
0e9e7f4
tensor parallel
wtomin Mar 16, 2026
52afc87
sequence parallel
wtomin Mar 16, 2026
4e3871a
moving to directrory
wtomin Mar 16, 2026
2e11c58
hsdp
wtomin Mar 16, 2026
4d5efa8
ep doc
wtomin Mar 16, 2026
00e8fed
mv to directory
wtomin Mar 16, 2026
ffa6670
rename
wtomin Mar 16, 2026
3a13dea
cache-dit v1
wtomin Mar 16, 2026
3fc7c6d
cache-dit v2
wtomin Mar 16, 2026
89eb5ee
teacache v1
wtomin Mar 16, 2026
ef86f03
teacache v2
wtomin Mar 16, 2026
aa072f1
feature compat
wtomin Mar 16, 2026
aa2afb4
path correction
wtomin Mar 16, 2026
abfd99d
update feature compatibility
wtomin Mar 16, 2026
6e6873f
updates
wtomin Mar 16, 2026
55604ab
path correction
wtomin Mar 16, 2026
f4951f6
path correction
wtomin Mar 16, 2026
3273c84
design doc add
wtomin Mar 16, 2026
43a3a07
updates
wtomin Mar 17, 2026
5738793
update cache-dit
wtomin Mar 17, 2026
46eebc9
update vae patch paralell
wtomin Mar 17, 2026
48ce2d0
update cli arg
wtomin Mar 17, 2026
95b4325
update doc
wtomin Mar 17, 2026
daf8665
update nav yaml
wtomin Mar 17, 2026
f42b208
add wan2.2
wtomin Mar 20, 2026
f4646fe
add EP
wtomin Mar 23, 2026
588e896
update table
wtomin Mar 23, 2026
895072c
replace symbol
wtomin Mar 23, 2026
f6d76f5
udpate main
wtomin Mar 26, 2026
d859ab1
udpates
wtomin Mar 26, 2026
15a6291
updates
wtomin Mar 26, 2026
b59fd8f
updates
wtomin Mar 26, 2026
87f56f8
fix warning
wtomin Mar 27, 2026
2370a83
readability
wtomin Mar 27, 2026
dcbbb24
update docs
wtomin Mar 27, 2026
93bb14b
udpates
wtomin Mar 27, 2026
ac9244f
Apply suggestion from @wtomin
wtomin Mar 27, 2026
675a6de
update layerwise compatibility
wtomin Mar 27, 2026
32321d4
fix module-wise offload
wtomin Mar 30, 2026
68c58a0
fix layerwise offload
wtomin Mar 30, 2026
40afe73
remove vae allowlist
wtomin Mar 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,20 +54,28 @@ nav:
- Features:
- Sleep Mode: features/sleep_mode.md
- Diffusion Features:
- Acceleration Overview: user_guide/diffusion_acceleration.md
- TeaCache: user_guide/diffusion/teacache.md
- Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md
- Overview: user_guide/diffusion_features.md
- Feature Compatibility: user_guide/feature_compatibility.md
- Cache Acceleration:
- TeaCache: user_guide/diffusion/cache_acceleration/teacache.md
- Cache-DiT: user_guide/diffusion/cache_acceleration/cache_dit.md
- Quantization:
- Overview: user_guide/diffusion/quantization/overview.md
- FP8: user_guide/diffusion/quantization/fp8.md
- Int8: user_guide/diffusion/quantization/int8.md
- GGUF: user_guide/diffusion/quantization/gguf.md
- Step Execution: user_guide/diffusion/step_execution.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- Parallelism:
- Overview: user_guide/diffusion/parallelism/overview.md
- CFG Parallel: user_guide/diffusion/parallelism/cfg_parallel.md
- Expert Parallel: user_guide/diffusion/parallelism/expert_parallel.md
- Hybrid Sharded Data Parallel: user_guide/diffusion/parallelism/hsdp.md
- Sequence Parallel: user_guide/diffusion/parallelism/sequence_parallel.md
- Tensor Parallel: user_guide/diffusion/parallelism/tensor_parallel.md
- VAE Patch Parallel: user_guide/diffusion/parallelism/vae_patch_parallel.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- LoRA: user_guide/diffusion/lora.md
- Hybrid Sharded Data Parallel: design/feature/hsdp.md
- Custom Pipeline: features/custom_pipeline.md
- Step Execution: user_guide/diffusion/step_execution.md
- ComfyUI: features/comfyui.md
- Developer Guide:
- General:
Expand All @@ -92,6 +100,8 @@ nav:
- design/feature/cfg_parallel.md
- design/feature/sequence_parallel.md
- design/feature/tensor_parallel.md
- design/feature/vae_parallel.md
- design/feature/hsdp.md
- design/feature/cache_dit.md
- design/feature/teacache.md
- design/feature/async_chunk_design.md
Expand Down
5 changes: 1 addition & 4 deletions docs/configuration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,4 @@ For introduction, please check [Introduction for stage config](./stage_configs.m

## Optimization Features

- **[TeaCache Configuration](../user_guide/diffusion/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
- **[Cache-DiT Configuration](../user_guide/diffusion/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models
- **[Parallelism Configuration](../user_guide/diffusion/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models
- **[CPU Offloading](../user_guide/diffusion/cpu_offload_diffusion.md)** - Enable CPU offloading (model-level and layerwise) for for DiT models
- **[Diffusion Features Overview](../user_guide/diffusion_features.md)** - Complete overview of all diffusion model features and supported models
2 changes: 1 addition & 1 deletion docs/design/feature/cache_dit.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ cache_config={
}
```

Check the [user guide for cache_dit](../../user_guide/diffusion/cache_dit_acceleration.md) for more adjustable parameters.
Check the [user guide for cache_dit](../../user_guide/diffusion/cache_acceleration/cache_dit.md) for more adjustable parameters.

---

Expand Down
2 changes: 1 addition & 1 deletion docs/design/feature/teacache.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ images = omni.generate(
2. **Compare performance** - Measure speedup vs baseline (expect 1.5x-2.0x)
3. **Verify output quality** - Visually compare cached vs uncached outputs (should be nearly identical)

See more detailed examples in [user guide for teacache](../../user_guide/diffusion/teacache.md).
See more detailed examples in [user guide for teacache](../../user_guide/diffusion/cache_acceleration/teacache.md).

---

Expand Down
285 changes: 285 additions & 0 deletions docs/user_guide/diffusion/cache_acceleration/cache_dit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
# Cache-DiT Guide


## Table of Content

- [Overview](#overview)
- [Quick Start](#quick-start)
- [Example Script](#example-script)
- [Acceleration Methods](#acceleration-methods)
- [Configuration Parameters](#configuration-parameters)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)
- [Summary](#summary)
- [Additional Resources](#additional-resources)

---

## Overview

Cache-DiT accelerates diffusion transformer models through intelligent caching mechanisms, providing significant speedup with minimal quality loss. It supports multiple acceleration techniques that can be combined for optimal performance:

- **DBCache**: Dual Block Cache for reducing redundant computations
- **TaylorSeer**: Taylor expansion-based forecasting for faster inference
- **SCM**: Step Computation Masking for selective step computation

See supported models list in [Supported Models](../../diffusion_features.md#supported-models).

---

## Quick Start

### Basic Usage

Enable cache-dit acceleration by simply setting `cache_backend="cache_dit"`:

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit", # Enable Cache-DiT with defaults
)

outputs = omni.generate(
"a beautiful landscape",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

**Note**: When `cache_config` is not provided, Cache-DiT uses optimized default values. See the [Configuration Parameters](#configuration-parameters) section for details.

### Custom Configuration

To customize cache-dit settings, provide a `cache_config` dictionary, for example:

```python
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
"Fn_compute_blocks": 1,
"Bn_compute_blocks": 0,
"max_warmup_steps": 4,
"residual_diff_threshold": 0.12,
},
)
```

---

## Example Script

### Offline Inference

Use the example script under `examples/offline_inference/text_to_image`:

```bash
cd examples/offline_inference/text_to_image
python text_to_image.py \
--model Qwen/Qwen-Image \
--prompt "a cup of coffee on the table" \
--cache-backend cache_dit \
--num-inference-steps 50
```

See the [text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/text_to_image/text_to_image.py) for detailed configuration options.

The script uses cache-dit acceleration with a hybrid configuration combining DBCache, SCM, and TaylorSeer:

```python
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="cache_dit",
cache_config={
# Scheme: Hybrid DBCache + SCM + TaylorSeer
"Fn_compute_blocks": 1, # Optimized for single-transformer models
"Bn_compute_blocks": 0, # Number of backward compute blocks
"max_warmup_steps": 4, # Maximum warmup steps (works for few-step models)
"residual_diff_threshold": 0.24, # Higher threshold for more aggressive caching
"max_continuous_cached_steps": 3, # Limit to prevent precision degradation
# TaylorSeer parameters [cache-dit only]
"enable_taylorseer": False, # Disabled by default (not suitable for few-step models)
"taylorseer_order": 1, # TaylorSeer polynomial order
# SCM (Step Computation Masking) parameters [cache-dit only]
"scm_steps_mask_policy": None, # SCM mask policy: None (disabled), "slow", "medium", "fast", "ultra"
"scm_steps_policy": "dynamic", # SCM steps policy: "dynamic" or "static"
}
)
```

You can customize the configuration by modifying the `cache_config` dictionary to use only specific methods (e.g., DBCache only, DBCache + SCM, etc.) based on your quality and speed requirements.

For image-to-image tasks, use the example script under `examples/offline_inference/image_to_image`:

```bash
cd examples/offline_inference/image_to_image
python image_edit.py \
--model Qwen/Qwen-Image-Edit \
--prompt "make the sky more colorful" \
--image path/to/input/image.jpg \
--cache-backend cache_dit \
--num-inference-steps 50 \
--cache-dit-max-continuous-cached-steps 3 \
--cache-dit-residual-diff-threshold 0.24 \
--cache-dit-enable-taylorseer
```

See the [image_edit.py](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py) for detailed configuration options.

### Online Serving

```bash
# Default configuration (recommended)
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend cache_dit

# Custom configuration
vllm serve Qwen/Qwen-Image --omni --port 8091 \
--cache-backend cache_dit \
--cache-config '{"Fn_compute_blocks": 1, "residual_diff_threshold": 0.12}'
```

---

## Acceleration Methods

For comprehensive illustration, please view Cache-DiT [User Guide](https://cache-dit.readthedocs.io/en/latest/user_guide/OVERVIEWS/).

### 1. DBCache (Dual Block Cache)

DBCache intelligently caches intermediate transformer block outputs when the residual differences between consecutive steps are small, reducing redundant computations without sacrificing quality.

**Example Configuration**:

```python
cache_config={
"Fn_compute_blocks": 8, # Use first 8 blocks for difference computation
"Bn_compute_blocks": 0, # No additional fusion blocks
"max_warmup_steps": 8, # Cache after 8 warmup steps
"residual_diff_threshold": 0.12, # Lower threshold for faster inference
"max_cached_steps": -1, # No limit on cached steps
}
```

**Performance Tips**:

- Default `Fn_compute_blocks=1` works well for most cases. Some models (e.g., [FLUX.2-klein](https://github.com/wtomin/vllm-omni/blob/main/vllm_omni/diffusion/cache/cache_dit_backend.py#L363)) use a larger value for `Fn_compute_blocks` for a balanced performance.
- Increase `residual_diff_threshold` (e.g., 0.12-0.15) for faster inference with slight quality trade-off, or decrease from default 0.24 for higher quality.
- Default `max_warmup_steps=4` is optimized for few-step models. Increase to 6-8 for more steps if needed.

### 2. TaylorSeer

TaylorSeer uses Taylor expansion to forecast future hidden states, allowing the model to skip some computation steps while maintaining quality.

**Example Configuration**:

```python
cache_config={
"enable_taylorseer": True,
"taylorseer_order": 1, # First-order Taylor expansion
}
```

**Performance Tips**:

- TaylorSeer is **not suitable for few-step distilled models**.
- Use `taylorseer_order=1` for most cases (good balance of speed and quality).
- Combine with DBCache for maximum acceleration.
- Higher orders (2-3) may improve quality but reduce speed gains.

### 3. SCM (Step Computation Masking)

SCM allows you to specify which steps must be computed and which can use cached results, similar to LeMiCa/EasyCache style acceleration.

`scm_steps_mask_policy` options (number of compute steps out of 28):

| Policy | Compute Steps | Speed | Quality |
|--------|--------------|-------|---------|
| `None` (default) | All | Baseline | Best |
| `"slow"` | 18 / 28 | Moderate | High |
| `"medium"` | 15 / 28 | Balanced | Good |
| `"fast"` | 11 / 28 | Fast | Moderate |
| `"ultra"` | 8 / 28 | Fastest | Lower |

**Example Configuration**:

```python
cache_config={
"scm_steps_mask_policy": "medium", # Balanced speed/quality
"scm_steps_policy": "dynamic", # Use dynamic cache
}
```

**Performance Tips**:

- SCM is disabled by default. Enable it by setting a policy value if you need additional acceleration.
- Start with `"medium"` policy and adjust based on quality requirements.
- Use `"fast"` or `"ultra"` for maximum speed when quality can be slightly compromised.
- `"dynamic"` policy generally provides better quality than `"static"`.
- SCM mask is automatically regenerated when `num_inference_steps` changes during inference.

---

## Configuration Parameters

In `cache_config` passed to `Omni` constructor, it accepts the arguments of `DBCacheConfig` ([Cache-DiT API Reference](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)). Key parameters are listed below:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `Fn_compute_blocks` | int | 1 | First n blocks for difference computation (optimized for single-transformer models) |
Comment thread
wtomin marked this conversation as resolved.
| `Bn_compute_blocks` | int | 0 | Last n blocks for fusion |
| `max_warmup_steps` | int | 4 | Steps before caching starts (optimized for few-step distilled models) |
| `max_cached_steps` | int | -1 | Max cached steps (-1 = unlimited) |
| `max_continuous_cached_steps` | int | 3 | Max consecutive cached steps (prevents precision degradation) |
| `residual_diff_threshold` | float | 0.24 | Residual difference threshold (higher for more aggressive caching) |
| `num_inference_steps` | int \| None | None | Initial inference steps for SCM mask generation (optional, auto-refreshed during inference) |
| `enable_taylorseer` | bool | False | Enable TaylorSeer acceleration (not suitable for few-step distilled models) |
| `taylorseer_order` | int | 1 | Taylor expansion order |
| `scm_steps_mask_policy` | str \| None | None | SCM mask policy (None, "slow", "medium", "fast", "ultra") |
| `scm_steps_policy` | str | "dynamic" | SCM computation policy ("dynamic" or "static") |

---

## Best Practices

### When to Use

**Good for:**

- Production deployments requiring fast inference
- Diffusion transformer models (DiT architecture)
- Scenarios where 1.5x-3x speedup is valuable

**Not for:**

- Non-DiT architectures (use model-specific acceleration instead)
- Models already using few-step distillation (< 10 steps)

---

## Troubleshooting

### Common Issue 1: Quality Degradation

**Symptoms**: Generated images have visible artifacts or lower quality

**Solution**:
```python
# Reduce aggressiveness - use more conservative settings
cache_config={
"residual_diff_threshold": 0.20, # Lower threshold (closer to default 0.24)
"Fn_compute_blocks": 8, # Use more blocks for better decisions
"max_warmup_steps": 6, # Longer warmup
"scm_steps_mask_policy": "slow", # More compute steps
}
```

---

## Summary

Using Cache-DiT acceleration:

1. ✅ **Enable Cache-DiT** - Set `cache_backend="cache_dit"` to get 1.5x-3x speedup with optimized defaults
2. ✅ **(Optional) Customize** - Adjust `cache_config` parameters for specific speed/quality trade-offs
Loading
Loading