From 0f6ca30ad851b05be4614e2e0b2f21abc574bcc0 Mon Sep 17 00:00:00 2001 From: zjy0516 Date: Mon, 12 Jan 2026 23:58:15 +0800 Subject: [PATCH 1/2] refactor diffusion doc Signed-off-by: zjy0516 --- docs/.nav.yml | 11 +++++------ .../cache_dit_acceleration.md | 0 .../diffusion}/cpu_offload_diffusion.md | 2 +- .../parallelism_acceleration.md | 0 .../{acceleration => diffusion}/teacache.md | 0 5 files changed, 6 insertions(+), 7 deletions(-) rename docs/user_guide/{acceleration => diffusion}/cache_dit_acceleration.md (100%) rename docs/{features => user_guide/diffusion}/cpu_offload_diffusion.md (93%) rename docs/user_guide/{acceleration => diffusion}/parallelism_acceleration.md (100%) rename docs/user_guide/{acceleration => diffusion}/teacache.md (100%) diff --git a/docs/.nav.yml b/docs/.nav.yml index abe4865af33..196e6b8f409 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -26,17 +26,16 @@ nav: - Configuration: - configuration/README.md - configuration/* - - Diffusion Acceleration: + - Diffusion Features: - Overview: user_guide/diffusion_acceleration.md - - Acceleration Methods: - - TeaCache: user_guide/acceleration/teacache.md - - Cache-DiT: user_guide/acceleration/cache_dit_acceleration.md - - Parallelism Acceleration: user_guide/acceleration/parallelism_acceleration.md + - TeaCache: user_guide/diffusion/teacache.md + - Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md + - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md + - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md - Models: - models/supported_models.md - Features: - Sleep Mode: features/sleep_mode.md - - CPU Offloading for Diffusion Model: features/cpu_offload_diffusion.md - Developer Guide: - General: - contributing/README.md diff --git a/docs/user_guide/acceleration/cache_dit_acceleration.md b/docs/user_guide/diffusion/cache_dit_acceleration.md similarity index 100% rename from docs/user_guide/acceleration/cache_dit_acceleration.md rename to docs/user_guide/diffusion/cache_dit_acceleration.md diff --git a/docs/features/cpu_offload_diffusion.md b/docs/user_guide/diffusion/cpu_offload_diffusion.md similarity index 93% rename from docs/features/cpu_offload_diffusion.md rename to docs/user_guide/diffusion/cpu_offload_diffusion.md index aaa4243a3a2..533b6b3b964 100644 --- a/docs/features/cpu_offload_diffusion.md +++ b/docs/user_guide/diffusion/cpu_offload_diffusion.md @@ -23,7 +23,7 @@ if __name__ == "__main__": m = Omni(model="Qwen/Qwen-Image",enable_cpu_offload=True) ``` -- **CLI**: pass `--dit-cpu-offload` to the diffusion service entrypoint. +- **CLI**: pass `--enable-cpu-offload` to the diffusion service entrypoint. ## Known Limitations - Cold start latency increases for over one minute for some models(e.g., Qwen-Image) diff --git a/docs/user_guide/acceleration/parallelism_acceleration.md b/docs/user_guide/diffusion/parallelism_acceleration.md similarity index 100% rename from docs/user_guide/acceleration/parallelism_acceleration.md rename to docs/user_guide/diffusion/parallelism_acceleration.md diff --git a/docs/user_guide/acceleration/teacache.md b/docs/user_guide/diffusion/teacache.md similarity index 100% rename from docs/user_guide/acceleration/teacache.md rename to docs/user_guide/diffusion/teacache.md From 4892b8378228fc093093ab8a5b6358d3c915e2be Mon Sep 17 00:00:00 2001 From: zjy0516 Date: Tue, 13 Jan 2026 14:24:03 +0800 Subject: [PATCH 2/2] update Signed-off-by: zjy0516 --- docs/.nav.yml | 12 ++++++------ docs/api/README.md | 1 + docs/configuration/README.md | 6 +++--- docs/user_guide/diffusion_acceleration.md | 18 +++++++++--------- 4 files changed, 19 insertions(+), 18 deletions(-) diff --git a/docs/.nav.yml b/docs/.nav.yml index 196e6b8f409..7493e71e8af 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -26,16 +26,16 @@ nav: - Configuration: - configuration/README.md - configuration/* - - Diffusion Features: - - Overview: user_guide/diffusion_acceleration.md - - TeaCache: user_guide/diffusion/teacache.md - - Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md - - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md - - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md - Models: - models/supported_models.md - Features: - Sleep Mode: features/sleep_mode.md + - Diffusion Features: + - Overview: user_guide/diffusion_acceleration.md + - TeaCache: user_guide/diffusion/teacache.md + - Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md + - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md + - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md - Developer Guide: - General: - contributing/README.md diff --git a/docs/api/README.md b/docs/api/README.md index 4fa85cdc663..a9d751bce25 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -82,6 +82,7 @@ Model execution components. - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_code_predictor_mtp.Qwen3OmniMoeTalkerCodePredictor][] - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeModel][] - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerForConditionalGeneration][] +- [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_talker.Qwen3OmniMoeTalkerSharedExpertWrapper][] - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMForCausalLM][] - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3MoeLLMModel][] - [vllm_omni.model_executor.models.qwen3_omni.qwen3_omni_moe_thinker.Qwen3OmniMoeConditionalGenerationMixin][] diff --git a/docs/configuration/README.md b/docs/configuration/README.md index 40439d51121..37e28bd0c57 100644 --- a/docs/configuration/README.md +++ b/docs/configuration/README.md @@ -16,6 +16,6 @@ For introduction, please check [Introduction for stage config](./stage_configs.m ## Optimization Features -- **[TeaCache Configuration](../user_guide/acceleration/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss -- **[Cache-DiT Configuration](../user_guide/acceleration/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models -- **[Parallelism Configuration](../user_guide/acceleration/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models +- **[TeaCache Configuration](../user_guide/diffusion/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss +- **[Cache-DiT Configuration](../user_guide/diffusion/cache_dit_acceleration.md)** - Enable Cache-DiT as cache acceleration backends for DiT models +- **[Parallelism Configuration](../user_guide/diffusion/parallelism_acceleration.md)** - Enable parallelism (e.g., sequence parallelism) for for DiT models diff --git a/docs/user_guide/diffusion_acceleration.md b/docs/user_guide/diffusion_acceleration.md index 8f78ae32e50..cf04c6228a6 100644 --- a/docs/user_guide/diffusion_acceleration.md +++ b/docs/user_guide/diffusion_acceleration.md @@ -6,8 +6,8 @@ vLLM-Omni supports various cache acceleration methods to speed up diffusion mode vLLM-Omni currently supports two main cache acceleration backends: -1. **[TeaCache](acceleration/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar -2. **[Cache-DiT](acceleration/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques: +1. **[TeaCache](diffusion/teacache.md)** - Hook-based adaptive caching that caches transformer computations when consecutive timesteps are similar +2. **[Cache-DiT](diffusion/cache_dit_acceleration.md)** - Library-based acceleration using multiple techniques: - **DBCache** (Dual Block Cache): Caches intermediate transformer block outputs based on residual differences - **TaylorSeer**: Uses Taylor expansion-based forecasting for faster inference - **SCM** (Step Computation Masking): Selectively computes steps based on adaptive masking @@ -16,11 +16,11 @@ Both methods can provide significant speedups (typically **1.5x-2.0x**) while ma vLLM-Omni also supports parallelism methods for diffusion models, including: -1. [Ulysses-SP](acceleration/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads. +1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads. -2. [Ring-Attention](acceleration/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded. +2. [Ring-Attention](diffusion/parallelism_acceleration.md#ring-attention) - splits the input along the sequence dimension and uses ring-based P2P communication to accumulate attention results, keeping the sequence dimension sharded. -3. [CFG-Parallel](acceleration/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step. +3. [CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel) - runs the positive/negative prompts of classifier-free guidance (CFG) on different devices, then merges on a single device to perform the scheduler step. ## Quick Comparison @@ -197,7 +197,7 @@ outputs = omni.generate(prompt="turn this cat to a dog", For detailed information on each acceleration method: -- **[TeaCache Guide](acceleration/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices -- **[Cache-DiT Acceleration Guide](acceleration/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters -- **[Sequence Parallelism](acceleration/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration. -- **[CFG-Parallel](acceleration/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks. +- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices +- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters +- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration. +- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.