vllm-project · SamitHuang · Apr 9, 2026 · Apr 2, 2026
@@ -12,7 +12,7 @@
 
 vLLM-Omni supports various advanced features for diffusion models:
 
-- Acceleration: **cache methods**, **parallelism methods**
+- Acceleration: **cache methods**, **parallelism methods**, **startup optimizations**
 - Memory optimization: **cpu offloading**, **quantization**
 - Extensions: **LoRA inference**
 - Execution modes: **step execution**
@@ -44,6 +44,12 @@ Parallelism methods distribute computation across GPUs without quality loss (mat
 | **[HSDP](diffusion/parallelism/hsdp.md)** | Weight sharding via FSDP2, redistributed on-demand at runtime | Very large models (14B+) on limited VRAM, combinable with SP |
 | **[Expert Parallelism](diffusion/parallelism/expert_parallel.md)** | Shards MoE expert MLP blocks across devices | MoE diffusion models (e.g., HunyuanImage3.0) |
 
+#### Startup Optimization
+
+| Method | Description | Best For |
+|--------|-------------|----------|
+| **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** | Loads safetensors shards in parallel using a thread pool | All diffusion models; reduces startup from minutes to seconds |
+
 **Note:** Some acceleration methods can be combined together for optimized performance. See [Feature Compatibility Table](#feature-compatibility) and [Feature Compatibility Tutorial](feature_compatibility.md) for detailed configuration examples.
 
 ### Memory Optimization
@@ -178,6 +184,59 @@ The following tables show which models support each feature:
     6. Step Execution is not compatible with cache backends (TeaCache, Cache-DiT) or LoRA.
 
 
+## Multi-Thread Weight Loading
+
+Large diffusion models can take several minutes to load weights at startup (e.g., ~3 min for Qwen-Image, ~5 min for Wan2.2 I2V 14B). Multi-thread weight loading speeds up this process by loading safetensors shards in parallel using a thread pool instead of sequentially.
+
+This optimization is **enabled by default** with 4 threads. No configuration is needed for the default behavior.
+
+### Configuration
+
+| Parameter | CLI Flag | Default | Description |
+|-----------|----------|---------|-------------|
+| `enable_multithread_weight_load` | `--disable-multithread-weight-load` | `True` (enabled) | Pass the flag to disable multi-thread loading |
+| `num_weight_load_threads` | `--num-weight-load-threads` | `4` | Number of threads for parallel weight loading |
+
+!!! tip
+    The default of 4 threads balances speed and disk I/O contention. On fast NVMe storage you may benefit from more threads (e.g., 8). On HDD or network storage, the default of 4 avoids saturating I/O bandwidth.
+
+### Online Serving
+
+```bash
+# Default (multi-thread enabled, 4 threads)
+vllm serve Qwen/Qwen-Image --omni --port 8091
+
+# Custom thread count
+vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --num-weight-load-threads 8
+
+# Disable multi-thread loading
+vllm serve Qwen/Qwen-Image --omni --disable-multithread-weight-load
+```
+
+### Offline Inference
+
+```python
+from vllm_omni import Omni
+
+# Default (multi-thread enabled, 4 threads)
+omni = Omni(model="Qwen/Qwen-Image")
+
+# Custom thread count
+omni = Omni(
+    model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+    num_weight_load_threads=8,
+)
+```
+
+### Benchmarks
+
+Measured on NVIDIA H800:
+
+| Model | Before | After | Speedup |
+|-------|--------|-------|---------|
+| **Qwen/Qwen-Image** (53.7 GiB) | 168s | 27s | **6.2x** |
+| **Wan-AI/Wan2.2-I2V-A14B-Diffusers** (64.5 GiB) | 283s | 56s | **5.1x** |
+
 ## Learn More
 
 **Cache Acceleration:**
@@ -203,6 +262,10 @@ The following tables show which models support each feature:
 
 - **[Step Execution Guide](diffusion/step_execution.md)** - Per-step denoise execution with mid-request abort support
 
+**Startup Optimization:**
+
+- **[Multi-Thread Weight Loading](#multi-thread-weight-loading)** - Speed up model startup by loading safetensors shards in parallel
+
 **Advanced Topics:**
 
 - **[Feature Compatibility](feature_compatibility.md)** - How to combine multiple features for maximum performance