diff --git a/skills/vllm-omni-add-diffusion-model/SKILL.md b/skills/vllm-omni-add-diffusion-model/SKILL.md
new file mode 100644
index 0000000..353422e
--- /dev/null
+++ b/skills/vllm-omni-add-diffusion-model/SKILL.md
@@ -0,0 +1,308 @@
+---
+name: add-diffusion-model
+description: Add a new diffusion model (text-to-image, text-to-video, image-to-video, text-to-audio, image editing) to vLLM-Omni. Use when integrating a new diffusion model, porting a diffusers pipeline or a custom model repo to vllm-omni, creating a new DiT transformer adapter, or adding diffusion model support.
+---
+
+# Adding a Diffusion Model to vLLM-Omni
+
+## Overview
+
+This skill guides you through adding a new diffusion model to vLLM-Omni. The model may come from HuggingFace Diffusers (structured pipeline) or from a private/custom repo. The workflow differs significantly depending on the source.
+
+## Prerequisites
+
+Before starting, determine:
+
+1. **Model category**: Text-to-Image, Text-to-Video, Image-to-Video, Image Editing, Text-to-Audio, or Omni
+2. **Reference source**: Diffusers pipeline, custom repo, or a combination
+3. **Model HuggingFace ID** or local checkpoint path
+4. **Architecture**: Scheduler, text encoder, VAE, transformer/backbone
+
+## Step 0: Classify the Migration Path
+
+Check the model's HF repo for `model_index.json`. This determines your path:
+
+| Scenario | How to identify | Migration path |
+|----------|----------------|----------------|
+| **Already supported** | `_class_name` in `model_index.json` matches a key in `_DIFFUSION_MODELS` in `registry.py` | Skip to Step 5 (test) and Step 7 (docs) |
+| **Diffusers-based** | Has standard `model_index.json` with `_diffusers_version`, subfolders for `transformer/`, `vae/`, etc. | Follow **Path A** below |
+| **Custom/private repo** | No diffusers `model_index.json`, weights in non-standard format, custom model code in a separate git repo | Follow **Path B** below |
+| **Hybrid** | Has some diffusers components (VAE) but custom transformer/fusion | Mix of Path A and Path B |
+
+## Path A: Diffusers-Based Model
+
+For models with a standard diffusers layout. See [references/transformer-adaptation.md](references/transformer-adaptation.md) for detailed code patterns.
+
+### A1. Analyze `model_index.json`
+
+Identify components: `transformer`, `scheduler`, `vae`, `text_encoder`, `tokenizer`.
+
+### A2. Create model directory
+
+```
+vllm_omni/diffusion/models/your_model_name/
+├── __init__.py
+├── pipeline_your_model.py
+└── your_model_transformer.py
+```
+
+### A3. Adapt transformer
+
+1. Copy from diffusers source. Remove mixins (`ModelMixin`, `ConfigMixin`, `AttentionModuleMixin`).
+2. Replace attention with `vllm_omni.diffusion.attention.layer.Attention` (QKV shape: `[B, seq, heads, head_dim]`).
+3. Add `od_config: OmniDiffusionConfig | None = None` to `__init__`.
+4. Add `load_weights()` method mapping diffusers weight names to vllm-omni names.
+5. Add class attributes: `_repeated_blocks`, `_layerwise_offload_blocks_attr`.
+
+### A4. Adapt pipeline
+
+Inherit from `nn.Module`. The key contract:
+
+```python
+class YourPipeline(nn.Module):
+    def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
+        # Load VAE, text encoder, tokenizer via from_pretrained()
+        # Instantiate transformer (weights loaded later via weights_sources)
+        self.weights_sources = [
+            DiffusersPipelineLoader.ComponentSource(
+                model_or_path=od_config.model, subfolder="transformer",
+                prefix="transformer.", fall_back_to_pt=True)]
+
+    def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput:
+        # Encode prompt → prepare latents → denoise loop → VAE decode
+        return DiffusionOutput(output=output)
+
+    def load_weights(self, weights):
+        return AutoWeightsLoader(self).load_weights(weights)
+```
+
+Add post/pre-process functions in the same pipeline file. Register them in `registry.py`.
+
+### A5. Register, test, docs → continue at Step 4 below.
+
+---
+
+## Path B: Custom/Private Repo Model
+
+For models without a diffusers pipeline — weights in custom format, model code in a private repo. Real examples: DreamID-Omni, BAGEL, HunyuanImage3.
+
+### B1. Understand the reference repo
+
+Study the original model's code to identify:
+- **Model architecture files** (transformers, fusion modules, embeddings)
+- **Weight format** (safetensors, `.pth`, custom checkpoint structure)
+- **Weight loading helpers** (custom init functions, checkpoint loaders)
+- **Pre/post-processing** (image/audio transforms, tokenization, VAE encode/decode)
+- **External dependencies** (packages not on PyPI)
+- **Config format** (JSON config files, hardcoded dicts)
+
+### B2. Decide what lives WHERE
+
+This is the key design decision for custom models. Follow these placement rules:
+
+| Code type | Where to place | Example |
+|-----------|---------------|---------|
+| **Pipeline orchestration** (init, forward, denoise loop) | `vllm_omni/diffusion/models/<name>/pipeline_<name>.py` | Always required |
+| **Custom transformer/backbone** (ported and adapted to vllm-omni) | `vllm_omni/diffusion/models/<name>/<name>_transformer.py` or similar | `wan2_2.py`, `fusion.py`, `bagel_transformer.py` |
+| **Custom sub-models** (VAE, fusion, autoencoder) | `vllm_omni/diffusion/models/<name>/` as separate files | `autoencoder.py`, `fusion.py` |
+| **External dependency code** (original repo utilities) | **External repo**, installed via download script or pip | `dreamid_omni` package via git clone |
+| **Hardcoded model configs** | Module-level dicts in pipeline file | `VIDEO_CONFIG`, `AUDIO_CONFIG` dicts |
+| **Download/setup script** | `examples/offline_inference/<name>/download_<name>.py` | `download_dreamid_omni.py` |
+| **Custom `model_index.json`** | Generated by download script, placed at model root | Minimal: `{"_class_name": "YourPipeline", ...}` |
+
+### B3. Handle external dependencies
+
+If the model's code lives in a separate git repo:
+
+**Option 1: Import with graceful fallback** (recommended for models with external utils)
+
+```python
+try:
+    from external_model.utils import init_vae, load_checkpoint
+except ImportError:
+    raise ImportError(
+        "Failed to import from dependency 'external_model'. "
+        "Please run the download script first."
+    )
+```
+
+**Option 2: Port the code directly** (preferred when feasible)
+
+Copy the essential model files into `vllm_omni/diffusion/models/<name>/` and adapt them. This avoids external dependencies. BAGEL does this — `autoencoder.py` and `bagel_transformer.py` are ported directly.
+
+**Decision criteria**: Port if the code is self-contained and won't diverge. Use external deps if the model repo is actively maintained and the code is complex.
+
+### B4. Handle custom weight loading
+
+Custom models have two patterns for weight loading:
+
+**Pattern 1: Bypass standard loader** (DreamID-Omni style)
+
+When the original model has complex custom init functions that load weights in `__init__`:
+
+```python
+class CustomPipeline(nn.Module):
+    def __init__(self, *, od_config, prefix=""):
+        super().__init__()
+        model = od_config.model
+        # Load everything eagerly in __init__ using custom helpers
+        self.vae = custom_init_vae(model, device=self.device)
+        self.text_encoder = custom_init_text_encoder(model, device=self.device)
+        self.transformer = CustomFusionModel(CONFIG)
+        load_custom_checkpoint(self.transformer,
+            checkpoint_path=os.path.join(model, "model.safetensors"))
+        # NO weights_sources defined — bypasses standard loader
+
+    def load_weights(self, weights):
+        pass  # No-op — all weights loaded in __init__
+```
+
+**Pattern 2: Use standard loader with custom `load_weights`** (BAGEL style)
+
+When weights are in safetensors format but need name remapping:
+
+```python
+class CustomPipeline(nn.Module):
+    def __init__(self, *, od_config, prefix=""):
+        super().__init__()
+        # Instantiate model architecture without weights
+        self.bagel = BagelModel(config)
+        self.vae = AutoEncoder(ae_params)
+
+        # Point loader at the safetensors in the model root
+        self.weights_sources = [
+            DiffusersPipelineLoader.ComponentSource(
+                model_or_path=od_config.model,
+                subfolder=None,  # weights at root, not in subfolder
+                prefix="",
+                fall_back_to_pt=False,
+            )
+        ]
+
+    def load_weights(self, weights):
+        # Custom name remapping for non-diffusers weight names
+        params = dict(self.named_parameters())
+        loaded = set()
+        for name, tensor in weights:
+            # Remap original weight names to vllm-omni module names
+            name = self._remap_weight_name(name)
+            if name in params:
+                default_weight_loader(params[name], tensor)
+                loaded.add(name)
+        return loaded
+```
+
+### B5. Create the `model_index.json`
+
+Custom models need a `model_index.json` at the model root for vllm-omni to discover them. For custom models, this is minimal:
+
+```json
+{
+    "_class_name": "YourModelPipeline",
+    "custom_key": "path/to/custom_weights.safetensors"
+}
+```
+
+The `_class_name` must match a key in `_DIFFUSION_MODELS` in `registry.py`. Additional keys are model-specific (accessed via `od_config.model_config`).
+
+If the model's weights come from multiple HF repos, write a **download script** that:
+1. Downloads from each repo
+2. Assembles into a single directory
+3. Generates `model_index.json`
+4. Installs any external dependencies (git clone + `.pth` file)
+
+Place at: `examples/offline_inference/<name>/download_<name>.py`
+
+### B6. Handle multi-modal inputs
+
+If the model accepts images, audio, or other multi-modal inputs, implement the protocol classes from `vllm_omni/diffusion/models/interface.py`:
+
+```python
+from vllm_omni.diffusion.models.interface import SupportImageInput, SupportAudioInput
+
+class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput):
+    # Protocol markers — the engine uses these to enable proper input routing
+    pass
+```
+
+Preprocessing for custom models is typically done **inside `forward()`** rather than via registered pre-process functions, since the logic is often tightly coupled to the model.
+
+### B7. Continue at Step 4 below.
+
+---
+
+## Common Steps (Both Paths)
+
+### Step 4: Register Model in registry.py
+
+Edit `vllm_omni/diffusion/registry.py`:
+
+```python
+_DIFFUSION_MODELS = {
+    "YourModelPipeline": ("your_model_name", "pipeline_your_model", "YourModelPipeline"),
+}
+_DIFFUSION_POST_PROCESS_FUNCS = {
+    "YourModelPipeline": "get_your_model_post_process_func",  # if applicable
+}
+_DIFFUSION_PRE_PROCESS_FUNCS = {
+    "YourModelPipeline": "get_your_model_pre_process_func",  # if applicable
+}
+```
+
+The registry key is the `_class_name` from `model_index.json`. The tuple is `(folder_name, module_file, class_name)`.
+
+Create `__init__.py` exporting the pipeline class and any factory functions.
+
+### Step 5: Run, Test, Debug
+
+Use the appropriate existing example script:
+
+| Category | Script |
+|----------|--------|
+| Text-to-Image | `examples/offline_inference/text_to_image/text_to_image.py` |
+| Text-to-Video | `examples/offline_inference/text_to_video/text_to_video.py` |
+| Image-to-Video | `examples/offline_inference/image_to_video/image_to_video.py` |
+| Image-to-Image | `examples/offline_inference/image_to_image/image_edit.py` |
+| Text-to-Audio | `examples/offline_inference/text_to_audio/text_to_audio.py` |
+
+For custom/Omni models that don't fit these categories, create a dedicated example script.
+
+**Validation**: No errors, output is meaningful, quality matches reference implementation.
+
+See [references/troubleshooting.md](references/troubleshooting.md) for common errors.
+
+### Step 6: Add Example Scripts
+
+For Omni or custom models, create:
+- `examples/offline_inference/your_model_name/` — offline script + README
+- `examples/online_serving/your_model_name/` — server script + client
+- Download script if weights require assembly from multiple sources
+
+### Step 7: Update Documentation
+
+Required updates:
+1. `docs/user_guide/diffusion/parallelism_acceleration.md` — parallelism support table
+2. `docs/user_guide/diffusion/teacache.md` — if TeaCache supported
+3. `docs/user_guide/diffusion/cache_dit_acceleration.md` — if Cache-DiT supported
+4. `examples/offline_inference/xxx/README.md` — offline example docs
+5. `examples/online_serve/xxx/README.md` — online serve docs
+
+### Step 8: Add E2E Tests (Recommended)
+
+Create `tests/e2e/online_serving/test_your_model_expansion.py`.
+
+## Iterative Development Tips
+
+1. **Start minimal**: Basic generation first, no parallelism/caching
+2. **Use `--enforce-eager`**: Disable torch.compile during debugging
+3. **Use small models**: Test with smaller variants first
+4. **Check tensor shapes**: Most errors are reshape mismatches in attention
+5. **Add parallelism incrementally**: TP → SP → CFG parallel
+6. **For custom models**: Get the model running with the original code first, then progressively replace components with vllm-omni equivalents
+
+## Reference Files
+
+- [Transformer Adaptation](references/transformer-adaptation.md) — porting transformers from diffusers
+- [Custom Model Patterns](references/custom-model-patterns.md) — patterns for non-diffusers models
+- [Parallelism Patterns](references/parallelism-patterns.md) — TP, SP, CFG parallel
+- [Troubleshooting](references/troubleshooting.md) — common errors and fixes
diff --git a/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md b/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md
new file mode 100644
index 0000000..2434e0b
--- /dev/null
+++ b/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md
@@ -0,0 +1,273 @@
+# Custom Model Patterns Reference
+
+Patterns for adding models that don't come from the standard diffusers pipeline format.
+
+## Directory Structure Comparison
+
+### Diffusers-based model (e.g., Wan2.2)
+
+```
+vllm_omni/diffusion/models/wan2_2/
+├── __init__.py                    # Exports pipeline + transformer + helpers
+├── pipeline_wan2_2.py             # Pipeline: loads components via from_pretrained()
+├── pipeline_wan2_2_i2v.py         # Variant pipeline for image-to-video
+└── wan2_2_transformer.py          # Transformer: ported from diffusers, uses Attention layer
+```
+
+The transformer is loaded separately via `weights_sources` + `load_weights()`. Non-transformer components (VAE, text encoder) are loaded in `__init__` via `from_pretrained()`.
+
+### Custom model with external deps (e.g., DreamID-Omni)
+
+```
+vllm_omni/diffusion/models/dreamid_omni/
+├── __init__.py                    # Exports pipeline only
+├── pipeline_dreamid_omni.py       # Pipeline: loads ALL weights in __init__ via custom helpers
+├── fusion.py                      # Custom fusion architecture (video + audio cross-attention)
+└── wan2_2.py                      # Re-implemented Wan backbone with split API
+
+examples/offline_inference/x_to_video_audio/
+└── download_dreamid_omni.py       # Downloads weights from 3 HF repos + clones code repo
+```
+
+All weights loaded eagerly in `__init__`. `load_weights()` is a no-op. External dependency (`dreamid_omni` package) imported with try/except.
+
+### Custom model with ported code (e.g., BAGEL)
+
+```
+vllm_omni/diffusion/models/bagel/
+├── __init__.py
+├── pipeline_bagel.py              # Pipeline: instantiates models, uses weights_sources
+├── bagel_transformer.py           # Full LLM backbone (Qwen2-MoT) ported into vllm-omni
+└── autoencoder.py                 # Custom VAE ported from original repo
+```
+
+Model code is fully ported (no external dependency). Uses `weights_sources` and `load_weights()` with custom name remapping to handle non-diffusers safetensors format.
+
+## Weight Loading Patterns
+
+### Pattern 1: Standard diffusers flow (Wan2.2, Z-Image, FLUX)
+
+```
+init → create transformer (empty) → set weights_sources → [loader calls load_weights()]
+```
+
+- `weights_sources` points to safetensors in HF subfolder (e.g., `transformer/`)
+- `load_weights()` receives `(name, tensor)` pairs from the loader
+- Name remapping handles diffusers→vllm-omni differences (QKV fusion, Sequential index removal)
+
+### Pattern 2: Custom safetensors at root (BAGEL)
+
+```
+init → create all models (empty) → set weights_sources(subfolder=None) → [loader calls load_weights()]
+```
+
+- `weights_sources` points to **root** of model directory, not a subfolder
+- Weights have non-diffusers names (e.g., `bagel.language_model.model.layers.0.self_attn.q_proj.weight`)
+- `load_weights()` does heavy name normalization
+
+```python
+self.weights_sources = [
+    DiffusersPipelineLoader.ComponentSource(
+        model_or_path=od_config.model,
+        subfolder=None,      # root directory
+        prefix="",           # no prefix stripping
+        fall_back_to_pt=False,
+    )
+]
+```
+
+### Pattern 3: Fully custom loading (DreamID-Omni)
+
+```
+init → load ALL weights eagerly via custom helpers → load_weights() = no-op
+```
+
+- No `weights_sources` attribute — standard loader finds nothing to iterate
+- Custom init functions (e.g., `init_wan_vae_2_2()`, `load_fusion_checkpoint()`) handle downloading and loading
+- `load_weights()` is `pass`
+- Weights may come from multiple HF repos in different formats (`.pth`, `.safetensors`)
+
+Use this when:
+- The original model has complex, well-tested loading code you don't want to rewrite
+- Weights span multiple HF repos
+- Weight format is non-standard (e.g., a single `.pth` file, not sharded safetensors)
+
+## model_index.json for Custom Models
+
+Standard diffusers `model_index.json`:
+```json
+{
+    "_class_name": "WanPipeline",
+    "_diffusers_version": "0.35.0.dev0",
+    "scheduler": ["diffusers", "UniPCMultistepScheduler"],
+    "transformer": ["diffusers", "WanTransformer3DModel"],
+    "vae": ["diffusers", "AutoencoderKLWan"]
+}
+```
+
+Custom model `model_index.json` (minimal):
+```json
+{
+    "_class_name": "DreamIDOmniPipeline",
+    "fusion": "DreamID-Omni/dreamid_omni.safetensors"
+}
+```
+
+The only **required** field is `_class_name` — it must match a key in `_DIFFUSION_MODELS` in `registry.py`. Other fields are model-specific and accessible via `od_config.model_config` dict.
+
+## External Dependency Management
+
+### Git clone + .pth injection (DreamID-Omni pattern)
+
+```python
+def download_dependency():
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    with open(LOCK_FILE, "w") as f:
+        fcntl.flock(f, fcntl.LOCK_EX)
+        if not DEPENDENCY_DIR.exists():
+            subprocess.run([
+                "git", "clone", "--depth", "1",
+                REPO_URL, "--branch", BRANCH,
+                str(DEPENDENCY_DIR)
+            ], check=True)
+        fcntl.flock(f, fcntl.LOCK_UN)
+
+    # Add to Python path via .pth file
+    site_packages = Path(site.getsitepackages()[0])
+    pth_file = site_packages / "vllm_omni_dependency.pth"
+    pth_file.write_text(str(DEPENDENCY_DIR))
+```
+
+### Direct port (BAGEL pattern)
+
+Copy essential files from the original repo into `vllm_omni/diffusion/models/<name>/`. Adapt imports to use vllm-omni utilities. Benefits: no external dependency, no git clone step. Drawback: must maintain the ported code.
+
+## Multi-Modal Input/Output Protocols
+
+Custom models that handle images, audio, or video I/O should implement protocol classes:
+
+```python
+from vllm_omni.diffusion.models.interface import (
+    SupportImageInput,    # Model accepts image input
+    SupportAudioInput,    # Model accepts audio input
+    SupportAudioOutput,   # Model produces audio output
+)
+
+class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput, SupportAudioOutput):
+    pass  # Protocol markers enable proper engine routing
+```
+
+The engine checks `isinstance(pipeline, SupportImageInput)` at startup to configure input validation and warmup behavior.
+
+## Hardcoded Config vs Config Files
+
+Diffusers models use `config.json` in each subfolder. Custom models often use:
+
+**Module-level config dicts** (DreamID-Omni):
+```python
+VIDEO_CONFIG = {
+    "patch_size": [1, 2, 2], "model_type": "ti2v",
+    "dim": 3072, "ffn_dim": 14336, "num_heads": 24, "num_layers": 30, ...
+}
+```
+
+**Loaded from custom JSON** (BAGEL):
+```python
+cfg_path = os.path.join(model_path, "config.json")
+with open(cfg_path) as f:
+    bagel_cfg = json.load(f)
+vae_cfg = bagel_cfg.get("vae_config", {})
+```
+
+## Custom Architecture Patterns
+
+### Split forward API (DreamID-Omni)
+
+When a fusion model needs to interleave blocks from two backbones:
+
+```python
+class WanModel(nn.Module):
+    def prepare_transformer_block_kwargs(self, x, t, context, ...):
+        # Patch embed, time embed, text embed, RoPE
+        return x, e, kwargs
+
+    def post_transformer_block_out(self, x, grid_sizes, e):
+        # Output projection, unpatchify
+        return output
+
+    def forward(self, *args, **kwargs):
+        raise NotImplementedError  # Fusion model handles block iteration
+```
+
+The `FusionModel` then iterates blocks in lock-step:
+```python
+for video_block, audio_block in zip(self.video_model.blocks, self.audio_model.blocks):
+    video_out = video_block(video_hidden, ...)
+    audio_out = audio_block(audio_hidden, ...)
+    # Cross-attend between modalities
+    video_out = cross_attention(video_out, audio_out)
+    audio_out = cross_attention(audio_out, video_out)
+```
+
+### LLM-as-denoiser (BAGEL)
+
+When the backbone is a language model that also does diffusion:
+
+```python
+class BagelModel(nn.Module):
+    def __init__(self):
+        self.language_model = Qwen2MoTForCausalLM(config)
+        self.vit_model = SiglipVisionModel(vit_config)
+```
+
+The LLM processes both text tokens and latent image tokens in a single forward pass, using KV caching for the text portion.
+
+## Pre/Post Processing for Custom Models
+
+Custom models typically handle pre/post processing **inside `forward()`** rather than via registered functions, because the logic is tightly coupled:
+
+```python
+def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput:
+    # Inline preprocessing
+    image = self._load_and_resize_image(req.prompts[0].get("multi_modal_data", {}).get("image"))
+    image_latent = self._vae_encode(image)
+
+    # ... denoising loop ...
+
+    # Inline postprocessing
+    pil_image = self._decode_to_pil(latents)
+    return DiffusionOutput(output=[pil_image])
+```
+
+If pre/post functions are not registered in `_DIFFUSION_PRE_PROCESS_FUNCS` / `_DIFFUSION_POST_PROCESS_FUNCS`, the engine simply skips those steps.
+
+## Download Script Template
+
+```python
+# examples/offline_inference/<name>/download_<name>.py
+from huggingface_hub import snapshot_download
+import json, os
+
+def main(output_dir):
+    # Download model weights from HF
+    snapshot_download(repo_id="org/model-weights", local_dir=os.path.join(output_dir, "weights"))
+
+    # Download additional components if from separate repos
+    snapshot_download(repo_id="org/vae-weights", local_dir=os.path.join(output_dir, "vae"),
+        allow_patterns=["*.safetensors"])
+
+    # Generate model_index.json
+    config = {"_class_name": "YourPipeline", "custom_key": "weights/model.safetensors"}
+    with open(os.path.join(output_dir, "model_index.json"), "w") as f:
+        json.dump(config, f, indent=2)
+
+    # Install external code dependency (if needed)
+    download_dependency()
+
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output-dir", default="./your_model")
+    args = parser.parse_args()
+    main(args.output_dir)
+```
diff --git a/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md b/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md
new file mode 100644
index 0000000..c6f32d1
--- /dev/null
+++ b/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md
@@ -0,0 +1,114 @@
+# Parallelism Patterns Reference
+
+## Tensor Parallelism (TP)
+
+Replace standard `nn.Linear` with vLLM's parallel linear layers:
+
+| Pattern | vLLM Layer | When to Use |
+|---------|-----------|-------------|
+| Fan-out (first in FFN) | `ColumnParallelLinear` | Projection that splits output across ranks |
+| Fan-in (second in FFN) | `RowParallelLinear` | Projection that gathers across ranks |
+| QKV projection | `QKVParallelLinear` | Fused Q/K/V for self-attention |
+| Single Q or K or V | `ColumnParallelLinear` | Separate projections (cross-attention) |
+| Attention output | `RowParallelLinear` | Output projection after attention |
+
+```python
+from vllm.model_executor.layers.linear import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+    QKVParallelLinear,
+)
+
+class TPFeedForward(nn.Module):
+    def __init__(self, dim, ffn_dim):
+        super().__init__()
+        self.fc1 = ColumnParallelLinear(dim, ffn_dim)
+        self.fc2 = RowParallelLinear(ffn_dim, dim)
+
+    def forward(self, x):
+        x, _ = self.fc1(x)
+        x = torch.nn.functional.gelu(x)
+        x, _ = self.fc2(x)
+        return x
+```
+
+**TP constraints**: `hidden_dim`, `num_heads`, and `num_kv_heads` must be divisible by `tp_size`.
+
+### RMSNorm with TP
+
+When RMSNorm sits between TP-sharded dimensions, use `DistributedRMSNorm` from the Wan2.2 implementation pattern — it computes global RMS via all-reduce across TP ranks.
+
+## CFG Parallelism
+
+Inherit `CFGParallelMixin` in your pipeline and implement `predict_noise()`:
+
+```python
+from vllm_omni.diffusion.distributed.cfg_parallel.cfg_parallel import CFGParallelMixin
+
+class MyPipeline(nn.Module, CFGParallelMixin):
+    def predict_noise(self, model, latent_model_input, t, prompt_embeds, **kwargs):
+        return model(latent_model_input, t, prompt_embeds, **kwargs)
+
+    def forward(self, req):
+        # In the denoising loop:
+        noise_pred = self.predict_noise_maybe_with_cfg(
+            model=self.transformer,
+            sample=latents,
+            timestep=t,
+            prompt_embeds=prompt_embeds,
+            guidance_scale=guidance_scale,
+            do_cfg=guidance_scale > 1.0,
+        )
+        latents = self.scheduler_step_maybe_with_cfg(
+            self.scheduler, noise_pred, t, latents
+        )
+```
+
+## Sequence Parallelism (SP)
+
+SP is applied non-intrusively via the `_sp_plan` dict on the transformer class. The framework applies hooks at module boundaries to shard/gather sequences.
+
+```python
+from vllm_omni.diffusion.distributed.sp_plan import (
+    SequenceParallelInput,
+    SequenceParallelOutput,
+)
+
+class MyTransformer(nn.Module):
+    _sp_plan = {
+        # Split hidden_states input on dim=1 before first block
+        "blocks.0": SequenceParallelInput(split_dim=1),
+        # Gather output on dim=1 after final projection
+        "proj_out": SequenceParallelOutput(gather_dim=1),
+    }
+```
+
+For RoPE that needs splitting, add an entry for the RoPE module:
+
+```python
+_sp_plan = {
+    "rope": SequenceParallelInput(split_dim=1, split_output=True, auto_pad=True),
+    "blocks.0": SequenceParallelInput(split_dim=1),
+    "proj_out": SequenceParallelOutput(gather_dim=1),
+}
+```
+
+The `auto_pad=True` flag handles variable sequence lengths by padding to be divisible by SP degree and creating attention masks accordingly.
+
+## VAE Patch Parallelism
+
+If using `DistributedAutoencoderKLWan` or similar distributed VAE, the framework handles spatial sharding automatically. Set `vae_patch_parallel_size` in the parallel config.
+
+## HSDP (Hybrid Sharded Data Parallel)
+
+HSDP uses PyTorch FSDP2 to shard transformer weights. No code changes needed in the model — the loader handles it. Set `use_hsdp=True` in `DiffusionParallelConfig`.
+
+## Adding Parallelism Incrementally
+
+Recommended order:
+1. **Basic single-GPU**: Get generation working first
+2. **Tensor Parallelism**: Replace Linear layers, update `load_weights` for QKV fusion
+3. **CFG Parallel**: Add `CFGParallelMixin`, implement `predict_noise`
+4. **Sequence Parallelism**: Add `_sp_plan` to transformer
+5. **HSDP**: Usually works out-of-box after TP is done
+6. **VAE Patch Parallel**: Switch to distributed VAE class
diff --git a/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md b/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md
new file mode 100644
index 0000000..6e344b6
--- /dev/null
+++ b/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md
@@ -0,0 +1,218 @@
+# Transformer Adaptation Reference
+
+## Adapting a Diffusers Transformer to vLLM-Omni
+
+### Step-by-step Checklist
+
+1. Copy the transformer class from diffusers source
+2. Remove all mixin classes — inherit only from `nn.Module`
+3. Replace attention dispatch with `vllm_omni.diffusion.attention.layer.Attention`
+4. Replace logger with `vllm.logger.init_logger`
+5. Add `od_config: OmniDiffusionConfig | None = None` to `__init__`
+6. Remove training-only code (gradient checkpointing, dropout)
+7. Add `load_weights()` method for weight loading from safetensors
+8. Add class-level attributes for acceleration features
+
+### Mixin Removal
+
+Remove these diffusers mixins (and their imports):
+
+```python
+# Remove all of these:
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.attention_processor import AttentionModuleMixin
+from diffusers.loaders import PeftAdapterMixin, FromOriginalModelMixin
+
+# Replace:
+class MyTransformer(ModelMixin, ConfigMixin, AttentionModuleMixin):
+# With:
+class MyTransformer(nn.Module):
+```
+
+Also remove `@register_to_config` decorators from `__init__`.
+
+### Attention Replacement
+
+The vLLM-Omni `Attention` layer wraps backend selection (FlashAttention, SDPA, SageAttn, etc.) and supports sequence parallelism hooks.
+
+**QKV tensor shape must be `[batch, seq_len, num_heads, head_dim]`.**
+
+#### Self-Attention Pattern
+
+```python
+from vllm_omni.diffusion.attention.layer import Attention
+from vllm_omni.diffusion.attention.backends.abstract import AttentionMetadata
+
+class SelfAttentionBlock(nn.Module):
+    def __init__(self, dim, num_heads):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+
+        self.to_q = nn.Linear(dim, dim)
+        self.to_k = nn.Linear(dim, dim)
+        self.to_v = nn.Linear(dim, dim)
+        self.to_out = nn.Linear(dim, dim)
+
+        self.attn = Attention(
+            num_heads=num_heads,
+            head_size=self.head_dim,
+            softmax_scale=1.0 / (self.head_dim ** 0.5),
+            causal=False,
+            num_kv_heads=num_heads,
+        )
+
+    def forward(self, x, attn_mask=None):
+        B, S, _ = x.shape
+        q = self.to_q(x).view(B, S, self.num_heads, self.head_dim)
+        k = self.to_k(x).view(B, S, self.num_heads, self.head_dim)
+        v = self.to_v(x).view(B, S, self.num_heads, self.head_dim)
+
+        attn_metadata = AttentionMetadata(attn_mask=attn_mask)
+        out = self.attn(q, k, v, attn_metadata=attn_metadata)
+        out = out.reshape(B, S, -1)
+        return self.to_out(out)
+```
+
+#### Fused QKV with TP (Advanced)
+
+For tensor parallelism, use vLLM's parallel linear layers:
+
+```python
+from vllm.model_executor.layers.linear import (
+    QKVParallelLinear, RowParallelLinear
+)
+
+class TPSelfAttention(nn.Module):
+    def __init__(self, dim, num_heads):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+
+        self.to_qkv = QKVParallelLinear(
+            hidden_size=dim,
+            head_size=self.head_dim,
+            total_num_heads=num_heads,
+            total_num_kv_heads=num_heads,
+        )
+        self.to_out = RowParallelLinear(dim, dim)
+
+        self.attn = Attention(
+            num_heads=num_heads,
+            head_size=self.head_dim,
+            softmax_scale=1.0 / (self.head_dim ** 0.5),
+            causal=False,
+            num_kv_heads=num_heads,
+        )
+```
+
+### Logger Replacement
+
+```python
+# Replace:
+from diffusers.utils import logging
+logger = logging.get_logger(__name__)
+
+# With:
+from vllm.logger import init_logger
+logger = init_logger(__name__)
+```
+
+### Custom Layers from vLLM-Omni
+
+Available utility layers:
+
+```python
+from vllm.model_executor.layers.layernorm import RMSNorm
+from vllm_omni.diffusion.layers.rope import RotaryEmbedding
+from vllm_omni.diffusion.layers.adalayernorm import AdaLayerNorm
+```
+
+### Config Support
+
+```python
+from vllm_omni.diffusion.data import OmniDiffusionConfig
+
+class MyTransformer(nn.Module):
+    def __init__(self, *, od_config=None, num_layers=28, hidden_size=3072, **kwargs):
+        super().__init__()
+        self.od_config = od_config
+        self.parallel_config = od_config.parallel_config if od_config else None
+        # ... build layers
+```
+
+The transformer config values come from `model_index.json` → `config.json` in the transformer subfolder. The pipeline uses `get_transformer_config_kwargs(od_config.tf_model_config, TransformerClass)` to filter config keys to match the `__init__` signature.
+
+### Weight Loading
+
+The `load_weights` method receives an iterable of `(name, tensor)` from safetensors files, with the prefix (e.g., `"transformer."`) already stripped by the loader.
+
+```python
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+
+class MyTransformer(nn.Module):
+    def load_weights(self, weights):
+        params = dict(self.named_parameters())
+        loaded = set()
+        for name, tensor in weights:
+            # Optional: remap names from diffusers to vllm-omni naming
+            # e.g., "ff.net.0.proj" -> "ff.net_0.proj"
+
+            if name in params:
+                param = params[name]
+                if hasattr(param, "weight_loader"):
+                    param.weight_loader(param, tensor)
+                else:
+                    default_weight_loader(param, tensor)
+                loaded.add(name)
+        return loaded
+```
+
+#### QKV Fusion in load_weights
+
+If you fused separate Q/K/V into a `QKVParallelLinear`, you need to map diffusers' separate weight names:
+
+```python
+stacked_params_mapping = [
+    ("to_qkv", "to_q", "q"),
+    ("to_qkv", "to_k", "k"),
+    ("to_qkv", "to_v", "v"),
+]
+
+def load_weights(self, weights):
+    params = dict(self.named_parameters())
+    loaded = set()
+    for name, tensor in weights:
+        for fused_name, orig_name, shard_id in stacked_params_mapping:
+            if orig_name in name:
+                name = name.replace(orig_name, fused_name)
+                param = params[name]
+                param.weight_loader(param, tensor, shard_id)
+                loaded.add(name)
+                break
+        else:
+            # Normal loading
+            ...
+    return loaded
+```
+
+### Class-Level Attributes for Features
+
+```python
+class MyTransformer(nn.Module):
+    # torch.compile: list block class names that repeat and can be compiled
+    _repeated_blocks = ["MyTransformerBlock"]
+
+    # CPU offload: attribute name of the nn.ModuleList containing blocks
+    _layerwise_offload_blocks_attr = "blocks"
+
+    # LoRA: mapping of fused param names to original param names
+    packed_modules_mapping = {"to_qkv": ["to_q", "to_k", "to_v"]}
+
+    # Sequence parallelism plan (advanced — add after basic impl works)
+    _sp_plan = {
+        "blocks.0": SequenceParallelInput(split_dim=1),
+        "proj_out": SequenceParallelOutput(gather_dim=1),
+    }
+```
diff --git a/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md b/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md
new file mode 100644
index 0000000..4c63cdf
--- /dev/null
+++ b/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md
@@ -0,0 +1,103 @@
+# Troubleshooting Reference
+
+## Common Errors When Adding a Diffusion Model
+
+### ImportError / ModuleNotFoundError
+
+**Cause**: Missing or incorrect registration.
+
+**Fix checklist**:
+1. Model registered in `vllm_omni/diffusion/registry.py` `_DIFFUSION_MODELS` dict
+2. `__init__.py` exports the pipeline class
+3. Pipeline file exists at the correct path: `vllm_omni/diffusion/models/{folder}/{file}.py`
+4. Class name in registry matches the actual class name in the file
+
+### Shape Mismatch in Attention
+
+**Symptom**: `RuntimeError: shape mismatch` or `expected 4D tensor`
+
+**Cause**: QKV tensors not reshaped to `[batch, seq_len, num_heads, head_dim]`.
+
+**Fix**: Before calling `self.attn(q, k, v, ...)`, ensure:
+```python
+q = q.view(batch, seq_len, self.num_heads, self.head_dim)
+k = k.view(batch, kv_seq_len, self.num_kv_heads, self.head_dim)
+v = v.view(batch, kv_seq_len, self.num_kv_heads, self.head_dim)
+```
+
+After attention, reshape back:
+```python
+out = out.reshape(batch, seq_len, -1)
+```
+
+### Weight Loading Failures
+
+**Symptom**: `RuntimeError: size mismatch for parameter ...` or missing keys
+
+**Debugging**:
+1. Print diffusers weight names: `safetensors.safe_open(path, "pt").keys()`
+2. Print model parameter names: `dict(model.named_parameters()).keys()`
+3. Compare and add name remappings in `load_weights()`
+
+**Common remappings needed**:
+- `ff.net.0.proj` → `ff.net_0.proj` (PyTorch Sequential indexing)
+- `.to_out.0.` → `.to_out.` (Sequential unwrapping)
+- `scale_shift_table` → moved to a wrapper module
+
+### Black/Blank/Noisy Output
+
+**Possible causes**:
+1. **Wrong latent normalization**: Check VAE expects latents scaled by `vae.config.scaling_factor`
+2. **Wrong scheduler**: Using the wrong scheduler class or wrong `flow_shift`
+3. **Missing CFG**: Some models require `guidance_scale > 1.0` with negative prompt
+4. **Wrong timestep format**: Some schedulers expect float, others expect int/long
+5. **Missing post-processing**: Raw VAE output may need denormalization
+
+**Quick test**: Run with diffusers directly using the same seed and compare latents at each step.
+
+### OOM (Out of Memory)
+
+**Solutions** (in order of preference):
+1. `--enforce-eager` to disable torch.compile (saves compile memory)
+2. `--enable-cpu-offload` for model-level offload
+3. `--enable-layerwise-offload` for block-level offload (better for large models)
+4. `--vae-use-slicing --vae-use-tiling` for VAE memory reduction
+5. Reduce resolution: `--height 480 --width 832`
+6. Use TP: `--tensor-parallel-size 2`
+
+### Different Output vs Diffusers Reference
+
+**Common causes**:
+1. **Attention backend difference**: FlashAttention vs SDPA may produce slightly different results. Set `DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA` to match diffusers
+2. **Float precision**: vLLM-Omni may use bfloat16 where diffusers uses float32 for some operations
+3. **Missing normalization**: Check all LayerNorm/RMSNorm are preserved
+4. **Scheduler rounding**: Some schedulers have numerical sensitivity
+
+### Tensor Parallel Errors
+
+**Symptom**: `AssertionError: not divisible` or incorrect output with TP>1
+
+**Fix**:
+1. Verify `hidden_dim % tp_size == 0` and `num_heads % tp_size == 0`
+2. Ensure `ColumnParallelLinear` / `RowParallelLinear` are used correctly
+3. Check that norms between parallel layers use distributed norm if needed
+4. Verify `load_weights` handles TP sharding for norm weights
+
+### Model Not Detected / Wrong Pipeline Class
+
+**Symptom**: `ValueError: Model class ... not found in diffusion model registry`
+
+**Cause**: The model's `model_index.json` has a `_class_name` for the pipeline that doesn't match registry keys.
+
+**Fix**: The registry key must match the diffusers pipeline class name from `model_index.json`. If using a different name, map it in the registry:
+```python
+"DiffusersPipelineClassName": ("your_folder", "your_file", "YourVllmClassName"),
+```
+
+## Debugging Workflow
+
+1. **Add verbose logging**: Use `logger.info()` to print tensor shapes at each stage
+2. **Compare step-by-step**: Run diffusers and vllm-omni side by side, comparing tensors after each major operation
+3. **Use small configs**: Reduce `num_inference_steps=2`, small resolution for fast iteration
+4. **Test transformer isolation**: Feed the same input to both diffusers and vllm-omni transformers, compare outputs
+5. **Binary search for bugs**: Comment out blocks/layers to isolate where divergence starts