From a99bbefc0b25c811d52512df2724b83f7a58fd31 Mon Sep 17 00:00:00 2001 From: samithuang <285365963@qq.com> Date: Thu, 2 Apr 2026 08:33:05 +0000 Subject: [PATCH] [Docs] Add multi-thread weight loading documentation Document the multi-thread weight loading startup optimization introduced in PR #1504, including configuration, CLI flags, usage examples, and benchmark results. Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com> --- docs/user_guide/diffusion_features.md | 65 ++++++++++++++++++++++++++- 1 file changed, 64 insertions(+), 1 deletion(-) diff --git a/docs/user_guide/diffusion_features.md b/docs/user_guide/diffusion_features.md index 9cd407d377a..416c8ac6169 100644 --- a/docs/user_guide/diffusion_features.md +++ b/docs/user_guide/diffusion_features.md @@ -12,7 +12,7 @@ vLLM-Omni supports various advanced features for diffusion models: -- Acceleration: **cache methods**, **parallelism methods** +- Acceleration: **cache methods**, **parallelism methods**, **startup optimizations** - Memory optimization: **cpu offloading**, **quantization** - Extensions: **LoRA inference** - Execution modes: **step execution** @@ -44,6 +44,12 @@ Parallelism methods distribute computation across GPUs without quality loss (mat | **[HSDP](diffusion/parallelism/hsdp.md)** | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP | | **[Expert Parallelism](diffusion/parallelism/expert_parallel.md)** | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) | +#### Startup Optimization + +| Method | Description | Best For | +|--------|-------------|----------| +| **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** | Loads safetensors shards in parallel using a thread pool | All diffusion models; reduces startup from minutes to seconds | + **Note:** Some acceleration methods can be combined together for optimized performance. See [Feature Compatibility Table](#feature-compatibility) and [Feature Compatibility Tutorial](feature_compatibility.md) for detailed configuration examples. ### Memory Optimization @@ -178,6 +184,59 @@ The following tables show which models support each feature: 6. Step Execution is not compatible with cache backends (TeaCache, Cache-DiT) or LoRA. +## Multi-Thread Weight Loading + +Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially. + +This optimization is **enabled by default** with 4 threads. No configuration is needed for the default behavior. + +### Configuration + +| Parameter | CLI Flag | Default | Description | +|-----------|----------|---------|-------------| +| `enable_multithread_weight_load` | `--disable-multithread-weight-load` | `True` (enabled) | Pass the flag to disable multi-thread loading | +| `num_weight_load_threads` | `--num-weight-load-threads` | `4` | Number of threads for parallel weight loading | + +!!! tip + The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth. + +### Online Serving + +```bash +# Default (multi-thread enabled, 4 threads) +vllm serve Qwen/Qwen-Image --omni --port 8091 + +# Custom thread count +vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8 + +# Disable multi-thread loading +vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load +``` + +### Offline Inference + +```python +from vllm_omni import Omni + +# Default (multi-thread enabled, 4 threads) +omni = Omni(model="Qwen/Qwen-Image") + +# Custom thread count +omni = Omni( + model="Wan-AI/Wan2.2-I2V-A14B-Diffusers", + num_weight_load_threads=8, +) +``` + +### Benchmarks + +Measured on NVIDIA H800: + +| Model | Before | After | Speedup | +|-------|--------|-------|---------| +| **Qwen/Qwen-Image** (53.7 GiB) | 168s | 27s | **6.2x** | +| **Wan-AI/Wan2.2-I2V-A14B-Diffusers** (64.5 GiB) | 283s | 56s | **5.1x** | + ## Learn More **Cache Acceleration:** @@ -203,6 +262,10 @@ The following tables show which models support each feature: - **[Step Execution Guide](diffusion/step_execution.md)** - Per-step denoise execution with mid-request abort support +**Startup Optimization:** + +- **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** - Speed up model startup by loading safetensors shards in parallel + **Advanced Topics:** - **[Feature Compatibility](feature_compatibility.md)** - How to combine multiple features for maximum performance