diff --git a/skills/vllm-omni-add-diffusion-model/SKILL.md b/skills/vllm-omni-add-diffusion-model/SKILL.md new file mode 100644 index 0000000..353422e --- /dev/null +++ b/skills/vllm-omni-add-diffusion-model/SKILL.md @@ -0,0 +1,308 @@ +--- +name: add-diffusion-model +description: Add a new diffusion model (text-to-image, text-to-video, image-to-video, text-to-audio, image editing) to vLLM-Omni. Use when integrating a new diffusion model, porting a diffusers pipeline or a custom model repo to vllm-omni, creating a new DiT transformer adapter, or adding diffusion model support. +--- + +# Adding a Diffusion Model to vLLM-Omni + +## Overview + +This skill guides you through adding a new diffusion model to vLLM-Omni. The model may come from HuggingFace Diffusers (structured pipeline) or from a private/custom repo. The workflow differs significantly depending on the source. + +## Prerequisites + +Before starting, determine: + +1. **Model category**: Text-to-Image, Text-to-Video, Image-to-Video, Image Editing, Text-to-Audio, or Omni +2. **Reference source**: Diffusers pipeline, custom repo, or a combination +3. **Model HuggingFace ID** or local checkpoint path +4. **Architecture**: Scheduler, text encoder, VAE, transformer/backbone + +## Step 0: Classify the Migration Path + +Check the model's HF repo for `model_index.json`. This determines your path: + +| Scenario | How to identify | Migration path | +|----------|----------------|----------------| +| **Already supported** | `_class_name` in `model_index.json` matches a key in `_DIFFUSION_MODELS` in `registry.py` | Skip to Step 5 (test) and Step 7 (docs) | +| **Diffusers-based** | Has standard `model_index.json` with `_diffusers_version`, subfolders for `transformer/`, `vae/`, etc. | Follow **Path A** below | +| **Custom/private repo** | No diffusers `model_index.json`, weights in non-standard format, custom model code in a separate git repo | Follow **Path B** below | +| **Hybrid** | Has some diffusers components (VAE) but custom transformer/fusion | Mix of Path A and Path B | + +## Path A: Diffusers-Based Model + +For models with a standard diffusers layout. See [references/transformer-adaptation.md](references/transformer-adaptation.md) for detailed code patterns. + +### A1. Analyze `model_index.json` + +Identify components: `transformer`, `scheduler`, `vae`, `text_encoder`, `tokenizer`. + +### A2. Create model directory + +``` +vllm_omni/diffusion/models/your_model_name/ +├── __init__.py +├── pipeline_your_model.py +└── your_model_transformer.py +``` + +### A3. Adapt transformer + +1. Copy from diffusers source. Remove mixins (`ModelMixin`, `ConfigMixin`, `AttentionModuleMixin`). +2. Replace attention with `vllm_omni.diffusion.attention.layer.Attention` (QKV shape: `[B, seq, heads, head_dim]`). +3. Add `od_config: OmniDiffusionConfig | None = None` to `__init__`. +4. Add `load_weights()` method mapping diffusers weight names to vllm-omni names. +5. Add class attributes: `_repeated_blocks`, `_layerwise_offload_blocks_attr`. + +### A4. Adapt pipeline + +Inherit from `nn.Module`. The key contract: + +```python +class YourPipeline(nn.Module): + def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""): + # Load VAE, text encoder, tokenizer via from_pretrained() + # Instantiate transformer (weights loaded later via weights_sources) + self.weights_sources = [ + DiffusersPipelineLoader.ComponentSource( + model_or_path=od_config.model, subfolder="transformer", + prefix="transformer.", fall_back_to_pt=True)] + + def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput: + # Encode prompt → prepare latents → denoise loop → VAE decode + return DiffusionOutput(output=output) + + def load_weights(self, weights): + return AutoWeightsLoader(self).load_weights(weights) +``` + +Add post/pre-process functions in the same pipeline file. Register them in `registry.py`. + +### A5. Register, test, docs → continue at Step 4 below. + +--- + +## Path B: Custom/Private Repo Model + +For models without a diffusers pipeline — weights in custom format, model code in a private repo. Real examples: DreamID-Omni, BAGEL, HunyuanImage3. + +### B1. Understand the reference repo + +Study the original model's code to identify: +- **Model architecture files** (transformers, fusion modules, embeddings) +- **Weight format** (safetensors, `.pth`, custom checkpoint structure) +- **Weight loading helpers** (custom init functions, checkpoint loaders) +- **Pre/post-processing** (image/audio transforms, tokenization, VAE encode/decode) +- **External dependencies** (packages not on PyPI) +- **Config format** (JSON config files, hardcoded dicts) + +### B2. Decide what lives WHERE + +This is the key design decision for custom models. Follow these placement rules: + +| Code type | Where to place | Example | +|-----------|---------------|---------| +| **Pipeline orchestration** (init, forward, denoise loop) | `vllm_omni/diffusion/models//pipeline_.py` | Always required | +| **Custom transformer/backbone** (ported and adapted to vllm-omni) | `vllm_omni/diffusion/models//_transformer.py` or similar | `wan2_2.py`, `fusion.py`, `bagel_transformer.py` | +| **Custom sub-models** (VAE, fusion, autoencoder) | `vllm_omni/diffusion/models//` as separate files | `autoencoder.py`, `fusion.py` | +| **External dependency code** (original repo utilities) | **External repo**, installed via download script or pip | `dreamid_omni` package via git clone | +| **Hardcoded model configs** | Module-level dicts in pipeline file | `VIDEO_CONFIG`, `AUDIO_CONFIG` dicts | +| **Download/setup script** | `examples/offline_inference//download_.py` | `download_dreamid_omni.py` | +| **Custom `model_index.json`** | Generated by download script, placed at model root | Minimal: `{"_class_name": "YourPipeline", ...}` | + +### B3. Handle external dependencies + +If the model's code lives in a separate git repo: + +**Option 1: Import with graceful fallback** (recommended for models with external utils) + +```python +try: + from external_model.utils import init_vae, load_checkpoint +except ImportError: + raise ImportError( + "Failed to import from dependency 'external_model'. " + "Please run the download script first." + ) +``` + +**Option 2: Port the code directly** (preferred when feasible) + +Copy the essential model files into `vllm_omni/diffusion/models//` and adapt them. This avoids external dependencies. BAGEL does this — `autoencoder.py` and `bagel_transformer.py` are ported directly. + +**Decision criteria**: Port if the code is self-contained and won't diverge. Use external deps if the model repo is actively maintained and the code is complex. + +### B4. Handle custom weight loading + +Custom models have two patterns for weight loading: + +**Pattern 1: Bypass standard loader** (DreamID-Omni style) + +When the original model has complex custom init functions that load weights in `__init__`: + +```python +class CustomPipeline(nn.Module): + def __init__(self, *, od_config, prefix=""): + super().__init__() + model = od_config.model + # Load everything eagerly in __init__ using custom helpers + self.vae = custom_init_vae(model, device=self.device) + self.text_encoder = custom_init_text_encoder(model, device=self.device) + self.transformer = CustomFusionModel(CONFIG) + load_custom_checkpoint(self.transformer, + checkpoint_path=os.path.join(model, "model.safetensors")) + # NO weights_sources defined — bypasses standard loader + + def load_weights(self, weights): + pass # No-op — all weights loaded in __init__ +``` + +**Pattern 2: Use standard loader with custom `load_weights`** (BAGEL style) + +When weights are in safetensors format but need name remapping: + +```python +class CustomPipeline(nn.Module): + def __init__(self, *, od_config, prefix=""): + super().__init__() + # Instantiate model architecture without weights + self.bagel = BagelModel(config) + self.vae = AutoEncoder(ae_params) + + # Point loader at the safetensors in the model root + self.weights_sources = [ + DiffusersPipelineLoader.ComponentSource( + model_or_path=od_config.model, + subfolder=None, # weights at root, not in subfolder + prefix="", + fall_back_to_pt=False, + ) + ] + + def load_weights(self, weights): + # Custom name remapping for non-diffusers weight names + params = dict(self.named_parameters()) + loaded = set() + for name, tensor in weights: + # Remap original weight names to vllm-omni module names + name = self._remap_weight_name(name) + if name in params: + default_weight_loader(params[name], tensor) + loaded.add(name) + return loaded +``` + +### B5. Create the `model_index.json` + +Custom models need a `model_index.json` at the model root for vllm-omni to discover them. For custom models, this is minimal: + +```json +{ + "_class_name": "YourModelPipeline", + "custom_key": "path/to/custom_weights.safetensors" +} +``` + +The `_class_name` must match a key in `_DIFFUSION_MODELS` in `registry.py`. Additional keys are model-specific (accessed via `od_config.model_config`). + +If the model's weights come from multiple HF repos, write a **download script** that: +1. Downloads from each repo +2. Assembles into a single directory +3. Generates `model_index.json` +4. Installs any external dependencies (git clone + `.pth` file) + +Place at: `examples/offline_inference//download_.py` + +### B6. Handle multi-modal inputs + +If the model accepts images, audio, or other multi-modal inputs, implement the protocol classes from `vllm_omni/diffusion/models/interface.py`: + +```python +from vllm_omni.diffusion.models.interface import SupportImageInput, SupportAudioInput + +class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput): + # Protocol markers — the engine uses these to enable proper input routing + pass +``` + +Preprocessing for custom models is typically done **inside `forward()`** rather than via registered pre-process functions, since the logic is often tightly coupled to the model. + +### B7. Continue at Step 4 below. + +--- + +## Common Steps (Both Paths) + +### Step 4: Register Model in registry.py + +Edit `vllm_omni/diffusion/registry.py`: + +```python +_DIFFUSION_MODELS = { + "YourModelPipeline": ("your_model_name", "pipeline_your_model", "YourModelPipeline"), +} +_DIFFUSION_POST_PROCESS_FUNCS = { + "YourModelPipeline": "get_your_model_post_process_func", # if applicable +} +_DIFFUSION_PRE_PROCESS_FUNCS = { + "YourModelPipeline": "get_your_model_pre_process_func", # if applicable +} +``` + +The registry key is the `_class_name` from `model_index.json`. The tuple is `(folder_name, module_file, class_name)`. + +Create `__init__.py` exporting the pipeline class and any factory functions. + +### Step 5: Run, Test, Debug + +Use the appropriate existing example script: + +| Category | Script | +|----------|--------| +| Text-to-Image | `examples/offline_inference/text_to_image/text_to_image.py` | +| Text-to-Video | `examples/offline_inference/text_to_video/text_to_video.py` | +| Image-to-Video | `examples/offline_inference/image_to_video/image_to_video.py` | +| Image-to-Image | `examples/offline_inference/image_to_image/image_edit.py` | +| Text-to-Audio | `examples/offline_inference/text_to_audio/text_to_audio.py` | + +For custom/Omni models that don't fit these categories, create a dedicated example script. + +**Validation**: No errors, output is meaningful, quality matches reference implementation. + +See [references/troubleshooting.md](references/troubleshooting.md) for common errors. + +### Step 6: Add Example Scripts + +For Omni or custom models, create: +- `examples/offline_inference/your_model_name/` — offline script + README +- `examples/online_serving/your_model_name/` — server script + client +- Download script if weights require assembly from multiple sources + +### Step 7: Update Documentation + +Required updates: +1. `docs/user_guide/diffusion/parallelism_acceleration.md` — parallelism support table +2. `docs/user_guide/diffusion/teacache.md` — if TeaCache supported +3. `docs/user_guide/diffusion/cache_dit_acceleration.md` — if Cache-DiT supported +4. `examples/offline_inference/xxx/README.md` — offline example docs +5. `examples/online_serve/xxx/README.md` — online serve docs + +### Step 8: Add E2E Tests (Recommended) + +Create `tests/e2e/online_serving/test_your_model_expansion.py`. + +## Iterative Development Tips + +1. **Start minimal**: Basic generation first, no parallelism/caching +2. **Use `--enforce-eager`**: Disable torch.compile during debugging +3. **Use small models**: Test with smaller variants first +4. **Check tensor shapes**: Most errors are reshape mismatches in attention +5. **Add parallelism incrementally**: TP → SP → CFG parallel +6. **For custom models**: Get the model running with the original code first, then progressively replace components with vllm-omni equivalents + +## Reference Files + +- [Transformer Adaptation](references/transformer-adaptation.md) — porting transformers from diffusers +- [Custom Model Patterns](references/custom-model-patterns.md) — patterns for non-diffusers models +- [Parallelism Patterns](references/parallelism-patterns.md) — TP, SP, CFG parallel +- [Troubleshooting](references/troubleshooting.md) — common errors and fixes diff --git a/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md b/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md new file mode 100644 index 0000000..2434e0b --- /dev/null +++ b/skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md @@ -0,0 +1,273 @@ +# Custom Model Patterns Reference + +Patterns for adding models that don't come from the standard diffusers pipeline format. + +## Directory Structure Comparison + +### Diffusers-based model (e.g., Wan2.2) + +``` +vllm_omni/diffusion/models/wan2_2/ +├── __init__.py # Exports pipeline + transformer + helpers +├── pipeline_wan2_2.py # Pipeline: loads components via from_pretrained() +├── pipeline_wan2_2_i2v.py # Variant pipeline for image-to-video +└── wan2_2_transformer.py # Transformer: ported from diffusers, uses Attention layer +``` + +The transformer is loaded separately via `weights_sources` + `load_weights()`. Non-transformer components (VAE, text encoder) are loaded in `__init__` via `from_pretrained()`. + +### Custom model with external deps (e.g., DreamID-Omni) + +``` +vllm_omni/diffusion/models/dreamid_omni/ +├── __init__.py # Exports pipeline only +├── pipeline_dreamid_omni.py # Pipeline: loads ALL weights in __init__ via custom helpers +├── fusion.py # Custom fusion architecture (video + audio cross-attention) +└── wan2_2.py # Re-implemented Wan backbone with split API + +examples/offline_inference/x_to_video_audio/ +└── download_dreamid_omni.py # Downloads weights from 3 HF repos + clones code repo +``` + +All weights loaded eagerly in `__init__`. `load_weights()` is a no-op. External dependency (`dreamid_omni` package) imported with try/except. + +### Custom model with ported code (e.g., BAGEL) + +``` +vllm_omni/diffusion/models/bagel/ +├── __init__.py +├── pipeline_bagel.py # Pipeline: instantiates models, uses weights_sources +├── bagel_transformer.py # Full LLM backbone (Qwen2-MoT) ported into vllm-omni +└── autoencoder.py # Custom VAE ported from original repo +``` + +Model code is fully ported (no external dependency). Uses `weights_sources` and `load_weights()` with custom name remapping to handle non-diffusers safetensors format. + +## Weight Loading Patterns + +### Pattern 1: Standard diffusers flow (Wan2.2, Z-Image, FLUX) + +``` +init → create transformer (empty) → set weights_sources → [loader calls load_weights()] +``` + +- `weights_sources` points to safetensors in HF subfolder (e.g., `transformer/`) +- `load_weights()` receives `(name, tensor)` pairs from the loader +- Name remapping handles diffusers→vllm-omni differences (QKV fusion, Sequential index removal) + +### Pattern 2: Custom safetensors at root (BAGEL) + +``` +init → create all models (empty) → set weights_sources(subfolder=None) → [loader calls load_weights()] +``` + +- `weights_sources` points to **root** of model directory, not a subfolder +- Weights have non-diffusers names (e.g., `bagel.language_model.model.layers.0.self_attn.q_proj.weight`) +- `load_weights()` does heavy name normalization + +```python +self.weights_sources = [ + DiffusersPipelineLoader.ComponentSource( + model_or_path=od_config.model, + subfolder=None, # root directory + prefix="", # no prefix stripping + fall_back_to_pt=False, + ) +] +``` + +### Pattern 3: Fully custom loading (DreamID-Omni) + +``` +init → load ALL weights eagerly via custom helpers → load_weights() = no-op +``` + +- No `weights_sources` attribute — standard loader finds nothing to iterate +- Custom init functions (e.g., `init_wan_vae_2_2()`, `load_fusion_checkpoint()`) handle downloading and loading +- `load_weights()` is `pass` +- Weights may come from multiple HF repos in different formats (`.pth`, `.safetensors`) + +Use this when: +- The original model has complex, well-tested loading code you don't want to rewrite +- Weights span multiple HF repos +- Weight format is non-standard (e.g., a single `.pth` file, not sharded safetensors) + +## model_index.json for Custom Models + +Standard diffusers `model_index.json`: +```json +{ + "_class_name": "WanPipeline", + "_diffusers_version": "0.35.0.dev0", + "scheduler": ["diffusers", "UniPCMultistepScheduler"], + "transformer": ["diffusers", "WanTransformer3DModel"], + "vae": ["diffusers", "AutoencoderKLWan"] +} +``` + +Custom model `model_index.json` (minimal): +```json +{ + "_class_name": "DreamIDOmniPipeline", + "fusion": "DreamID-Omni/dreamid_omni.safetensors" +} +``` + +The only **required** field is `_class_name` — it must match a key in `_DIFFUSION_MODELS` in `registry.py`. Other fields are model-specific and accessible via `od_config.model_config` dict. + +## External Dependency Management + +### Git clone + .pth injection (DreamID-Omni pattern) + +```python +def download_dependency(): + CACHE_DIR.mkdir(parents=True, exist_ok=True) + with open(LOCK_FILE, "w") as f: + fcntl.flock(f, fcntl.LOCK_EX) + if not DEPENDENCY_DIR.exists(): + subprocess.run([ + "git", "clone", "--depth", "1", + REPO_URL, "--branch", BRANCH, + str(DEPENDENCY_DIR) + ], check=True) + fcntl.flock(f, fcntl.LOCK_UN) + + # Add to Python path via .pth file + site_packages = Path(site.getsitepackages()[0]) + pth_file = site_packages / "vllm_omni_dependency.pth" + pth_file.write_text(str(DEPENDENCY_DIR)) +``` + +### Direct port (BAGEL pattern) + +Copy essential files from the original repo into `vllm_omni/diffusion/models//`. Adapt imports to use vllm-omni utilities. Benefits: no external dependency, no git clone step. Drawback: must maintain the ported code. + +## Multi-Modal Input/Output Protocols + +Custom models that handle images, audio, or video I/O should implement protocol classes: + +```python +from vllm_omni.diffusion.models.interface import ( + SupportImageInput, # Model accepts image input + SupportAudioInput, # Model accepts audio input + SupportAudioOutput, # Model produces audio output +) + +class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput, SupportAudioOutput): + pass # Protocol markers enable proper engine routing +``` + +The engine checks `isinstance(pipeline, SupportImageInput)` at startup to configure input validation and warmup behavior. + +## Hardcoded Config vs Config Files + +Diffusers models use `config.json` in each subfolder. Custom models often use: + +**Module-level config dicts** (DreamID-Omni): +```python +VIDEO_CONFIG = { + "patch_size": [1, 2, 2], "model_type": "ti2v", + "dim": 3072, "ffn_dim": 14336, "num_heads": 24, "num_layers": 30, ... +} +``` + +**Loaded from custom JSON** (BAGEL): +```python +cfg_path = os.path.join(model_path, "config.json") +with open(cfg_path) as f: + bagel_cfg = json.load(f) +vae_cfg = bagel_cfg.get("vae_config", {}) +``` + +## Custom Architecture Patterns + +### Split forward API (DreamID-Omni) + +When a fusion model needs to interleave blocks from two backbones: + +```python +class WanModel(nn.Module): + def prepare_transformer_block_kwargs(self, x, t, context, ...): + # Patch embed, time embed, text embed, RoPE + return x, e, kwargs + + def post_transformer_block_out(self, x, grid_sizes, e): + # Output projection, unpatchify + return output + + def forward(self, *args, **kwargs): + raise NotImplementedError # Fusion model handles block iteration +``` + +The `FusionModel` then iterates blocks in lock-step: +```python +for video_block, audio_block in zip(self.video_model.blocks, self.audio_model.blocks): + video_out = video_block(video_hidden, ...) + audio_out = audio_block(audio_hidden, ...) + # Cross-attend between modalities + video_out = cross_attention(video_out, audio_out) + audio_out = cross_attention(audio_out, video_out) +``` + +### LLM-as-denoiser (BAGEL) + +When the backbone is a language model that also does diffusion: + +```python +class BagelModel(nn.Module): + def __init__(self): + self.language_model = Qwen2MoTForCausalLM(config) + self.vit_model = SiglipVisionModel(vit_config) +``` + +The LLM processes both text tokens and latent image tokens in a single forward pass, using KV caching for the text portion. + +## Pre/Post Processing for Custom Models + +Custom models typically handle pre/post processing **inside `forward()`** rather than via registered functions, because the logic is tightly coupled: + +```python +def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput: + # Inline preprocessing + image = self._load_and_resize_image(req.prompts[0].get("multi_modal_data", {}).get("image")) + image_latent = self._vae_encode(image) + + # ... denoising loop ... + + # Inline postprocessing + pil_image = self._decode_to_pil(latents) + return DiffusionOutput(output=[pil_image]) +``` + +If pre/post functions are not registered in `_DIFFUSION_PRE_PROCESS_FUNCS` / `_DIFFUSION_POST_PROCESS_FUNCS`, the engine simply skips those steps. + +## Download Script Template + +```python +# examples/offline_inference//download_.py +from huggingface_hub import snapshot_download +import json, os + +def main(output_dir): + # Download model weights from HF + snapshot_download(repo_id="org/model-weights", local_dir=os.path.join(output_dir, "weights")) + + # Download additional components if from separate repos + snapshot_download(repo_id="org/vae-weights", local_dir=os.path.join(output_dir, "vae"), + allow_patterns=["*.safetensors"]) + + # Generate model_index.json + config = {"_class_name": "YourPipeline", "custom_key": "weights/model.safetensors"} + with open(os.path.join(output_dir, "model_index.json"), "w") as f: + json.dump(config, f, indent=2) + + # Install external code dependency (if needed) + download_dependency() + +if __name__ == "__main__": + import argparse + parser = argparse.ArgumentParser() + parser.add_argument("--output-dir", default="./your_model") + args = parser.parse_args() + main(args.output_dir) +``` diff --git a/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md b/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md new file mode 100644 index 0000000..c6f32d1 --- /dev/null +++ b/skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md @@ -0,0 +1,114 @@ +# Parallelism Patterns Reference + +## Tensor Parallelism (TP) + +Replace standard `nn.Linear` with vLLM's parallel linear layers: + +| Pattern | vLLM Layer | When to Use | +|---------|-----------|-------------| +| Fan-out (first in FFN) | `ColumnParallelLinear` | Projection that splits output across ranks | +| Fan-in (second in FFN) | `RowParallelLinear` | Projection that gathers across ranks | +| QKV projection | `QKVParallelLinear` | Fused Q/K/V for self-attention | +| Single Q or K or V | `ColumnParallelLinear` | Separate projections (cross-attention) | +| Attention output | `RowParallelLinear` | Output projection after attention | + +```python +from vllm.model_executor.layers.linear import ( + ColumnParallelLinear, + RowParallelLinear, + QKVParallelLinear, +) + +class TPFeedForward(nn.Module): + def __init__(self, dim, ffn_dim): + super().__init__() + self.fc1 = ColumnParallelLinear(dim, ffn_dim) + self.fc2 = RowParallelLinear(ffn_dim, dim) + + def forward(self, x): + x, _ = self.fc1(x) + x = torch.nn.functional.gelu(x) + x, _ = self.fc2(x) + return x +``` + +**TP constraints**: `hidden_dim`, `num_heads`, and `num_kv_heads` must be divisible by `tp_size`. + +### RMSNorm with TP + +When RMSNorm sits between TP-sharded dimensions, use `DistributedRMSNorm` from the Wan2.2 implementation pattern — it computes global RMS via all-reduce across TP ranks. + +## CFG Parallelism + +Inherit `CFGParallelMixin` in your pipeline and implement `predict_noise()`: + +```python +from vllm_omni.diffusion.distributed.cfg_parallel.cfg_parallel import CFGParallelMixin + +class MyPipeline(nn.Module, CFGParallelMixin): + def predict_noise(self, model, latent_model_input, t, prompt_embeds, **kwargs): + return model(latent_model_input, t, prompt_embeds, **kwargs) + + def forward(self, req): + # In the denoising loop: + noise_pred = self.predict_noise_maybe_with_cfg( + model=self.transformer, + sample=latents, + timestep=t, + prompt_embeds=prompt_embeds, + guidance_scale=guidance_scale, + do_cfg=guidance_scale > 1.0, + ) + latents = self.scheduler_step_maybe_with_cfg( + self.scheduler, noise_pred, t, latents + ) +``` + +## Sequence Parallelism (SP) + +SP is applied non-intrusively via the `_sp_plan` dict on the transformer class. The framework applies hooks at module boundaries to shard/gather sequences. + +```python +from vllm_omni.diffusion.distributed.sp_plan import ( + SequenceParallelInput, + SequenceParallelOutput, +) + +class MyTransformer(nn.Module): + _sp_plan = { + # Split hidden_states input on dim=1 before first block + "blocks.0": SequenceParallelInput(split_dim=1), + # Gather output on dim=1 after final projection + "proj_out": SequenceParallelOutput(gather_dim=1), + } +``` + +For RoPE that needs splitting, add an entry for the RoPE module: + +```python +_sp_plan = { + "rope": SequenceParallelInput(split_dim=1, split_output=True, auto_pad=True), + "blocks.0": SequenceParallelInput(split_dim=1), + "proj_out": SequenceParallelOutput(gather_dim=1), +} +``` + +The `auto_pad=True` flag handles variable sequence lengths by padding to be divisible by SP degree and creating attention masks accordingly. + +## VAE Patch Parallelism + +If using `DistributedAutoencoderKLWan` or similar distributed VAE, the framework handles spatial sharding automatically. Set `vae_patch_parallel_size` in the parallel config. + +## HSDP (Hybrid Sharded Data Parallel) + +HSDP uses PyTorch FSDP2 to shard transformer weights. No code changes needed in the model — the loader handles it. Set `use_hsdp=True` in `DiffusionParallelConfig`. + +## Adding Parallelism Incrementally + +Recommended order: +1. **Basic single-GPU**: Get generation working first +2. **Tensor Parallelism**: Replace Linear layers, update `load_weights` for QKV fusion +3. **CFG Parallel**: Add `CFGParallelMixin`, implement `predict_noise` +4. **Sequence Parallelism**: Add `_sp_plan` to transformer +5. **HSDP**: Usually works out-of-box after TP is done +6. **VAE Patch Parallel**: Switch to distributed VAE class diff --git a/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md b/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md new file mode 100644 index 0000000..6e344b6 --- /dev/null +++ b/skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md @@ -0,0 +1,218 @@ +# Transformer Adaptation Reference + +## Adapting a Diffusers Transformer to vLLM-Omni + +### Step-by-step Checklist + +1. Copy the transformer class from diffusers source +2. Remove all mixin classes — inherit only from `nn.Module` +3. Replace attention dispatch with `vllm_omni.diffusion.attention.layer.Attention` +4. Replace logger with `vllm.logger.init_logger` +5. Add `od_config: OmniDiffusionConfig | None = None` to `__init__` +6. Remove training-only code (gradient checkpointing, dropout) +7. Add `load_weights()` method for weight loading from safetensors +8. Add class-level attributes for acceleration features + +### Mixin Removal + +Remove these diffusers mixins (and their imports): + +```python +# Remove all of these: +from diffusers.models.modeling_utils import ModelMixin +from diffusers.configuration_utils import ConfigMixin, register_to_config +from diffusers.models.attention_processor import AttentionModuleMixin +from diffusers.loaders import PeftAdapterMixin, FromOriginalModelMixin + +# Replace: +class MyTransformer(ModelMixin, ConfigMixin, AttentionModuleMixin): +# With: +class MyTransformer(nn.Module): +``` + +Also remove `@register_to_config` decorators from `__init__`. + +### Attention Replacement + +The vLLM-Omni `Attention` layer wraps backend selection (FlashAttention, SDPA, SageAttn, etc.) and supports sequence parallelism hooks. + +**QKV tensor shape must be `[batch, seq_len, num_heads, head_dim]`.** + +#### Self-Attention Pattern + +```python +from vllm_omni.diffusion.attention.layer import Attention +from vllm_omni.diffusion.attention.backends.abstract import AttentionMetadata + +class SelfAttentionBlock(nn.Module): + def __init__(self, dim, num_heads): + super().__init__() + self.num_heads = num_heads + self.head_dim = dim // num_heads + + self.to_q = nn.Linear(dim, dim) + self.to_k = nn.Linear(dim, dim) + self.to_v = nn.Linear(dim, dim) + self.to_out = nn.Linear(dim, dim) + + self.attn = Attention( + num_heads=num_heads, + head_size=self.head_dim, + softmax_scale=1.0 / (self.head_dim ** 0.5), + causal=False, + num_kv_heads=num_heads, + ) + + def forward(self, x, attn_mask=None): + B, S, _ = x.shape + q = self.to_q(x).view(B, S, self.num_heads, self.head_dim) + k = self.to_k(x).view(B, S, self.num_heads, self.head_dim) + v = self.to_v(x).view(B, S, self.num_heads, self.head_dim) + + attn_metadata = AttentionMetadata(attn_mask=attn_mask) + out = self.attn(q, k, v, attn_metadata=attn_metadata) + out = out.reshape(B, S, -1) + return self.to_out(out) +``` + +#### Fused QKV with TP (Advanced) + +For tensor parallelism, use vLLM's parallel linear layers: + +```python +from vllm.model_executor.layers.linear import ( + QKVParallelLinear, RowParallelLinear +) + +class TPSelfAttention(nn.Module): + def __init__(self, dim, num_heads): + super().__init__() + self.num_heads = num_heads + self.head_dim = dim // num_heads + + self.to_qkv = QKVParallelLinear( + hidden_size=dim, + head_size=self.head_dim, + total_num_heads=num_heads, + total_num_kv_heads=num_heads, + ) + self.to_out = RowParallelLinear(dim, dim) + + self.attn = Attention( + num_heads=num_heads, + head_size=self.head_dim, + softmax_scale=1.0 / (self.head_dim ** 0.5), + causal=False, + num_kv_heads=num_heads, + ) +``` + +### Logger Replacement + +```python +# Replace: +from diffusers.utils import logging +logger = logging.get_logger(__name__) + +# With: +from vllm.logger import init_logger +logger = init_logger(__name__) +``` + +### Custom Layers from vLLM-Omni + +Available utility layers: + +```python +from vllm.model_executor.layers.layernorm import RMSNorm +from vllm_omni.diffusion.layers.rope import RotaryEmbedding +from vllm_omni.diffusion.layers.adalayernorm import AdaLayerNorm +``` + +### Config Support + +```python +from vllm_omni.diffusion.data import OmniDiffusionConfig + +class MyTransformer(nn.Module): + def __init__(self, *, od_config=None, num_layers=28, hidden_size=3072, **kwargs): + super().__init__() + self.od_config = od_config + self.parallel_config = od_config.parallel_config if od_config else None + # ... build layers +``` + +The transformer config values come from `model_index.json` → `config.json` in the transformer subfolder. The pipeline uses `get_transformer_config_kwargs(od_config.tf_model_config, TransformerClass)` to filter config keys to match the `__init__` signature. + +### Weight Loading + +The `load_weights` method receives an iterable of `(name, tensor)` from safetensors files, with the prefix (e.g., `"transformer."`) already stripped by the loader. + +```python +from vllm.model_executor.model_loader.weight_utils import default_weight_loader + +class MyTransformer(nn.Module): + def load_weights(self, weights): + params = dict(self.named_parameters()) + loaded = set() + for name, tensor in weights: + # Optional: remap names from diffusers to vllm-omni naming + # e.g., "ff.net.0.proj" -> "ff.net_0.proj" + + if name in params: + param = params[name] + if hasattr(param, "weight_loader"): + param.weight_loader(param, tensor) + else: + default_weight_loader(param, tensor) + loaded.add(name) + return loaded +``` + +#### QKV Fusion in load_weights + +If you fused separate Q/K/V into a `QKVParallelLinear`, you need to map diffusers' separate weight names: + +```python +stacked_params_mapping = [ + ("to_qkv", "to_q", "q"), + ("to_qkv", "to_k", "k"), + ("to_qkv", "to_v", "v"), +] + +def load_weights(self, weights): + params = dict(self.named_parameters()) + loaded = set() + for name, tensor in weights: + for fused_name, orig_name, shard_id in stacked_params_mapping: + if orig_name in name: + name = name.replace(orig_name, fused_name) + param = params[name] + param.weight_loader(param, tensor, shard_id) + loaded.add(name) + break + else: + # Normal loading + ... + return loaded +``` + +### Class-Level Attributes for Features + +```python +class MyTransformer(nn.Module): + # torch.compile: list block class names that repeat and can be compiled + _repeated_blocks = ["MyTransformerBlock"] + + # CPU offload: attribute name of the nn.ModuleList containing blocks + _layerwise_offload_blocks_attr = "blocks" + + # LoRA: mapping of fused param names to original param names + packed_modules_mapping = {"to_qkv": ["to_q", "to_k", "to_v"]} + + # Sequence parallelism plan (advanced — add after basic impl works) + _sp_plan = { + "blocks.0": SequenceParallelInput(split_dim=1), + "proj_out": SequenceParallelOutput(gather_dim=1), + } +``` diff --git a/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md b/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md new file mode 100644 index 0000000..4c63cdf --- /dev/null +++ b/skills/vllm-omni-add-diffusion-model/references/troubleshooting.md @@ -0,0 +1,103 @@ +# Troubleshooting Reference + +## Common Errors When Adding a Diffusion Model + +### ImportError / ModuleNotFoundError + +**Cause**: Missing or incorrect registration. + +**Fix checklist**: +1. Model registered in `vllm_omni/diffusion/registry.py` `_DIFFUSION_MODELS` dict +2. `__init__.py` exports the pipeline class +3. Pipeline file exists at the correct path: `vllm_omni/diffusion/models/{folder}/{file}.py` +4. Class name in registry matches the actual class name in the file + +### Shape Mismatch in Attention + +**Symptom**: `RuntimeError: shape mismatch` or `expected 4D tensor` + +**Cause**: QKV tensors not reshaped to `[batch, seq_len, num_heads, head_dim]`. + +**Fix**: Before calling `self.attn(q, k, v, ...)`, ensure: +```python +q = q.view(batch, seq_len, self.num_heads, self.head_dim) +k = k.view(batch, kv_seq_len, self.num_kv_heads, self.head_dim) +v = v.view(batch, kv_seq_len, self.num_kv_heads, self.head_dim) +``` + +After attention, reshape back: +```python +out = out.reshape(batch, seq_len, -1) +``` + +### Weight Loading Failures + +**Symptom**: `RuntimeError: size mismatch for parameter ...` or missing keys + +**Debugging**: +1. Print diffusers weight names: `safetensors.safe_open(path, "pt").keys()` +2. Print model parameter names: `dict(model.named_parameters()).keys()` +3. Compare and add name remappings in `load_weights()` + +**Common remappings needed**: +- `ff.net.0.proj` → `ff.net_0.proj` (PyTorch Sequential indexing) +- `.to_out.0.` → `.to_out.` (Sequential unwrapping) +- `scale_shift_table` → moved to a wrapper module + +### Black/Blank/Noisy Output + +**Possible causes**: +1. **Wrong latent normalization**: Check VAE expects latents scaled by `vae.config.scaling_factor` +2. **Wrong scheduler**: Using the wrong scheduler class or wrong `flow_shift` +3. **Missing CFG**: Some models require `guidance_scale > 1.0` with negative prompt +4. **Wrong timestep format**: Some schedulers expect float, others expect int/long +5. **Missing post-processing**: Raw VAE output may need denormalization + +**Quick test**: Run with diffusers directly using the same seed and compare latents at each step. + +### OOM (Out of Memory) + +**Solutions** (in order of preference): +1. `--enforce-eager` to disable torch.compile (saves compile memory) +2. `--enable-cpu-offload` for model-level offload +3. `--enable-layerwise-offload` for block-level offload (better for large models) +4. `--vae-use-slicing --vae-use-tiling` for VAE memory reduction +5. Reduce resolution: `--height 480 --width 832` +6. Use TP: `--tensor-parallel-size 2` + +### Different Output vs Diffusers Reference + +**Common causes**: +1. **Attention backend difference**: FlashAttention vs SDPA may produce slightly different results. Set `DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA` to match diffusers +2. **Float precision**: vLLM-Omni may use bfloat16 where diffusers uses float32 for some operations +3. **Missing normalization**: Check all LayerNorm/RMSNorm are preserved +4. **Scheduler rounding**: Some schedulers have numerical sensitivity + +### Tensor Parallel Errors + +**Symptom**: `AssertionError: not divisible` or incorrect output with TP>1 + +**Fix**: +1. Verify `hidden_dim % tp_size == 0` and `num_heads % tp_size == 0` +2. Ensure `ColumnParallelLinear` / `RowParallelLinear` are used correctly +3. Check that norms between parallel layers use distributed norm if needed +4. Verify `load_weights` handles TP sharding for norm weights + +### Model Not Detected / Wrong Pipeline Class + +**Symptom**: `ValueError: Model class ... not found in diffusion model registry` + +**Cause**: The model's `model_index.json` has a `_class_name` for the pipeline that doesn't match registry keys. + +**Fix**: The registry key must match the diffusers pipeline class name from `model_index.json`. If using a different name, map it in the registry: +```python +"DiffusersPipelineClassName": ("your_folder", "your_file", "YourVllmClassName"), +``` + +## Debugging Workflow + +1. **Add verbose logging**: Use `logger.info()` to print tensor shapes at each stage +2. **Compare step-by-step**: Run diffusers and vllm-omni side by side, comparing tensors after each major operation +3. **Use small configs**: Reduce `num_inference_steps=2`, small resolution for fast iteration +4. **Test transformer isolation**: Feed the same input to both diffusers and vllm-omni transformers, compare outputs +5. **Binary search for bugs**: Comment out blocks/layers to isolate where divergence starts