Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 64 additions & 1 deletion docs/user_guide/diffusion_features.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

vLLM-Omni supports various advanced features for diffusion models:

- Acceleration: **cache methods**, **parallelism methods**
- Acceleration: **cache methods**, **parallelism methods**, **startup optimizations**
- Memory optimization: **cpu offloading**, **quantization**
- Extensions: **LoRA inference**
- Execution modes: **step execution**
Expand Down Expand Up @@ -44,6 +44,12 @@ Parallelism methods distribute computation across GPUs without quality loss (mat
| **[HSDP](diffusion/parallelism/hsdp.md)** | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP |
| **[Expert Parallelism](diffusion/parallelism/expert_parallel.md)** | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) |

#### Startup Optimization

| Method | Description | Best For |
|--------|-------------|----------|
| **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** | Loads safetensors shards in parallel using a thread pool | All diffusion models; reduces startup from minutes to seconds |

**Note:** Some acceleration methods can be combined together for optimized performance. See [Feature Compatibility Table](#feature-compatibility) and [Feature Compatibility Tutorial](feature_compatibility.md) for detailed configuration examples.

### Memory Optimization
Expand Down Expand Up @@ -178,6 +184,59 @@ The following tables show which models support each feature:
6. Step Execution is not compatible with cache backends (TeaCache, Cache-DiT) or LoRA.


## Multi-Thread Weight Loading

Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially.

This optimization is **enabled by default** with 4 threads. No configuration is needed for the default behavior.

### Configuration

| Parameter | CLI Flag | Default | Description |
|-----------|----------|---------|-------------|
| `enable_multithread_weight_load` | `--disable-multithread-weight-load` | `True` (enabled) | Pass the flag to disable multi-thread loading |
| `num_weight_load_threads` | `--num-weight-load-threads` | `4` | Number of threads for parallel weight loading |

!!! tip
The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth.

### Online Serving

```bash
# Default (multi-thread enabled, 4 threads)
vllm serve Qwen/Qwen-Image --omni --port 8091

# Custom thread count
vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8

# Disable multi-thread loading
vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load
```

### Offline Inference

```python
from vllm_omni import Omni

# Default (multi-thread enabled, 4 threads)
omni = Omni(model="Qwen/Qwen-Image")

# Custom thread count
omni = Omni(
model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
num_weight_load_threads=8,
)
```

### Benchmarks

Measured on NVIDIA H800:

| Model | Before | After | Speedup |
|-------|--------|-------|---------|
| **Qwen/Qwen-Image** (53.7 GiB) | 168s | 27s | **6.2x** |
| **Wan-AI/Wan2.2-I2V-A14B-Diffusers** (64.5 GiB) | 283s | 56s | **5.1x** |

## Learn More

**Cache Acceleration:**
Expand All @@ -203,6 +262,10 @@ The following tables show which models support each feature:

- **[Step Execution Guide](diffusion/step_execution.md)** - Per-step denoise execution with mid-request abort support

**Startup Optimization:**

- **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** - Speed up model startup by loading safetensors shards in parallel

**Advanced Topics:**

- **[Feature Compatibility](feature_compatibility.md)** - How to combine multiple features for maximum performance
Loading