From a99bbefc0b25c811d52512df2724b83f7a58fd31 Mon Sep 17 00:00:00 2001
From: samithuang <285365963@qq.com>
Date: Thu, 2 Apr 2026 08:33:05 +0000
Subject: [PATCH] [Docs] Add multi-thread weight loading documentation

Document the multi-thread weight loading startup optimization
introduced in PR #1504, including configuration, CLI flags,
usage examples, and benchmark results.

Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
---
 docs/user_guide/diffusion_features.md | 65 ++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/docs/user_guide/diffusion_features.md b/docs/user_guide/diffusion_features.md
index 9cd407d377a..416c8ac6169 100644
--- a/docs/user_guide/diffusion_features.md
+++ b/docs/user_guide/diffusion_features.md
@@ -12,7 +12,7 @@
 
 vLLM-Omni supports various advanced features for diffusion models:
 
-- Acceleration: **cache methods**, **parallelism methods**
+- Acceleration: **cache methods**, **parallelism methods**, **startup optimizations**
 - Memory optimization: **cpu offloading**, **quantization**
 - Extensions: **LoRA inference**
 - Execution modes: **step execution**
@@ -44,6 +44,12 @@ Parallelism methods distribute computation across GPUs without quality loss (mat
 | **[HSDP](diffusion/parallelism/hsdp.md)** | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP |
 | **[Expert Parallelism](diffusion/parallelism/expert_parallel.md)** | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) |
 
+#### Startup Optimization
+
+| Method | Description | Best For |
+|--------|-------------|----------|
+| **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** | Loads safetensors shards in parallel using a thread pool | All diffusion models; reduces startup from minutes to seconds |
+
 **Note:** Some acceleration methods can be combined together for optimized performance. See [Feature Compatibility Table](#feature-compatibility) and [Feature Compatibility Tutorial](feature_compatibility.md) for detailed configuration examples.
 
 ### Memory Optimization
@@ -178,6 +184,59 @@ The following tables show which models support each feature:
     6. Step Execution is not compatible with cache backends (TeaCache, Cache-DiT) or LoRA.
 
 
+## Multi-Thread Weight Loading
+
+Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially.
+
+This optimization is **enabled by default** with 4 threads. No configuration is needed for the default behavior.
+
+### Configuration
+
+| Parameter | CLI Flag | Default | Description |
+|-----------|----------|---------|-------------|
+| `enable_multithread_weight_load` | `--disable-multithread-weight-load` | `True` (enabled) | Pass the flag to disable multi-thread loading |
+| `num_weight_load_threads` | `--num-weight-load-threads` | `4` | Number of threads for parallel weight loading |
+
+!!! tip
+    The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth.
+
+### Online Serving
+
+```bash
+# Default (multi-thread enabled, 4 threads)
+vllm serve Qwen/Qwen-Image --omni --port 8091
+
+# Custom thread count
+vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8
+
+# Disable multi-thread loading
+vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load
+```
+
+### Offline Inference
+
+```python
+from vllm_omni import Omni
+
+# Default (multi-thread enabled, 4 threads)
+omni = Omni(model="Qwen/Qwen-Image")
+
+# Custom thread count
+omni = Omni(
+    model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+    num_weight_load_threads=8,
+)
+```
+
+### Benchmarks
+
+Measured on NVIDIA H800:
+
+| Model | Before | After | Speedup |
+|-------|--------|-------|---------|
+| **Qwen/Qwen-Image** (53.7 GiB) | 168s | 27s | **6.2x** |
+| **Wan-AI/Wan2.2-I2V-A14B-Diffusers** (64.5 GiB) | 283s | 56s | **5.1x** |
+
 ## Learn More
 
 **Cache Acceleration:**
@@ -203,6 +262,10 @@ The following tables show which models support each feature:
 
 - **[Step Execution Guide](diffusion/step_execution.md)** - Per-step denoise execution with mid-request abort support
 
+**Startup Optimization:**
+
+- **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** - Speed up model startup by loading safetensors shards in parallel
+
 **Advanced Topics:**
 
 - **[Feature Compatibility](feature_compatibility.md)** - How to combine multiple features for maximum performance