Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
308 changes: 308 additions & 0 deletions skills/vllm-omni-add-diffusion-model/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
---
name: add-diffusion-model
description: Add a new diffusion model (text-to-image, text-to-video, image-to-video, text-to-audio, image editing) to vLLM-Omni. Use when integrating a new diffusion model, porting a diffusers pipeline or a custom model repo to vllm-omni, creating a new DiT transformer adapter, or adding diffusion model support.
---

# Adding a Diffusion Model to vLLM-Omni

## Overview

This skill guides you through adding a new diffusion model to vLLM-Omni. The model may come from HuggingFace Diffusers (structured pipeline) or from a private/custom repo. The workflow differs significantly depending on the source.

## Prerequisites

Before starting, determine:

1. **Model category**: Text-to-Image, Text-to-Video, Image-to-Video, Image Editing, Text-to-Audio, or Omni
2. **Reference source**: Diffusers pipeline, custom repo, or a combination
3. **Model HuggingFace ID** or local checkpoint path
4. **Architecture**: Scheduler, text encoder, VAE, transformer/backbone

## Step 0: Classify the Migration Path

Check the model's HF repo for `model_index.json`. This determines your path:

| Scenario | How to identify | Migration path |
|----------|----------------|----------------|
| **Already supported** | `_class_name` in `model_index.json` matches a key in `_DIFFUSION_MODELS` in `registry.py` | Skip to Step 5 (test) and Step 7 (docs) |
| **Diffusers-based** | Has standard `model_index.json` with `_diffusers_version`, subfolders for `transformer/`, `vae/`, etc. | Follow **Path A** below |
| **Custom/private repo** | No diffusers `model_index.json`, weights in non-standard format, custom model code in a separate git repo | Follow **Path B** below |
| **Hybrid** | Has some diffusers components (VAE) but custom transformer/fusion | Mix of Path A and Path B |

## Path A: Diffusers-Based Model

For models with a standard diffusers layout. See [references/transformer-adaptation.md](references/transformer-adaptation.md) for detailed code patterns.

### A1. Analyze `model_index.json`

Identify components: `transformer`, `scheduler`, `vae`, `text_encoder`, `tokenizer`.

### A2. Create model directory

```
vllm_omni/diffusion/models/your_model_name/
├── __init__.py
├── pipeline_your_model.py
└── your_model_transformer.py
```

### A3. Adapt transformer

1. Copy from diffusers source. Remove mixins (`ModelMixin`, `ConfigMixin`, `AttentionModuleMixin`).
2. Replace attention with `vllm_omni.diffusion.attention.layer.Attention` (QKV shape: `[B, seq, heads, head_dim]`).
3. Add `od_config: OmniDiffusionConfig | None = None` to `__init__`.
4. Add `load_weights()` method mapping diffusers weight names to vllm-omni names.
5. Add class attributes: `_repeated_blocks`, `_layerwise_offload_blocks_attr`.

### A4. Adapt pipeline

Inherit from `nn.Module`. The key contract:

```python
class YourPipeline(nn.Module):
def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
# Load VAE, text encoder, tokenizer via from_pretrained()
# Instantiate transformer (weights loaded later via weights_sources)
self.weights_sources = [
DiffusersPipelineLoader.ComponentSource(
model_or_path=od_config.model, subfolder="transformer",
prefix="transformer.", fall_back_to_pt=True)]

def forward(self, req: OmniDiffusionRequest) -> DiffusionOutput:
# Encode prompt → prepare latents → denoise loop → VAE decode
return DiffusionOutput(output=output)

def load_weights(self, weights):
return AutoWeightsLoader(self).load_weights(weights)
```

Add post/pre-process functions in the same pipeline file. Register them in `registry.py`.

### A5. Register, test, docs → continue at Step 4 below.

---

## Path B: Custom/Private Repo Model

For models without a diffusers pipeline — weights in custom format, model code in a private repo. Real examples: DreamID-Omni, BAGEL, HunyuanImage3.

### B1. Understand the reference repo

Study the original model's code to identify:
- **Model architecture files** (transformers, fusion modules, embeddings)
- **Weight format** (safetensors, `.pth`, custom checkpoint structure)
- **Weight loading helpers** (custom init functions, checkpoint loaders)
- **Pre/post-processing** (image/audio transforms, tokenization, VAE encode/decode)
- **External dependencies** (packages not on PyPI)
- **Config format** (JSON config files, hardcoded dicts)

### B2. Decide what lives WHERE

This is the key design decision for custom models. Follow these placement rules:

| Code type | Where to place | Example |
|-----------|---------------|---------|
| **Pipeline orchestration** (init, forward, denoise loop) | `vllm_omni/diffusion/models/<name>/pipeline_<name>.py` | Always required |
| **Custom transformer/backbone** (ported and adapted to vllm-omni) | `vllm_omni/diffusion/models/<name>/<name>_transformer.py` or similar | `wan2_2.py`, `fusion.py`, `bagel_transformer.py` |
| **Custom sub-models** (VAE, fusion, autoencoder) | `vllm_omni/diffusion/models/<name>/` as separate files | `autoencoder.py`, `fusion.py` |
| **External dependency code** (original repo utilities) | **External repo**, installed via download script or pip | `dreamid_omni` package via git clone |
| **Hardcoded model configs** | Module-level dicts in pipeline file | `VIDEO_CONFIG`, `AUDIO_CONFIG` dicts |
| **Download/setup script** | `examples/offline_inference/<name>/download_<name>.py` | `download_dreamid_omni.py` |
| **Custom `model_index.json`** | Generated by download script, placed at model root | Minimal: `{"_class_name": "YourPipeline", ...}` |

### B3. Handle external dependencies

If the model's code lives in a separate git repo:

**Option 1: Import with graceful fallback** (recommended for models with external utils)

```python
try:
from external_model.utils import init_vae, load_checkpoint
except ImportError:
raise ImportError(
"Failed to import from dependency 'external_model'. "
"Please run the download script first."
)
```

**Option 2: Port the code directly** (preferred when feasible)

Copy the essential model files into `vllm_omni/diffusion/models/<name>/` and adapt them. This avoids external dependencies. BAGEL does this — `autoencoder.py` and `bagel_transformer.py` are ported directly.

**Decision criteria**: Port if the code is self-contained and won't diverge. Use external deps if the model repo is actively maintained and the code is complex.

### B4. Handle custom weight loading

Custom models have two patterns for weight loading:

**Pattern 1: Bypass standard loader** (DreamID-Omni style)

When the original model has complex custom init functions that load weights in `__init__`:

```python
class CustomPipeline(nn.Module):
def __init__(self, *, od_config, prefix=""):
super().__init__()
model = od_config.model
# Load everything eagerly in __init__ using custom helpers
self.vae = custom_init_vae(model, device=self.device)
self.text_encoder = custom_init_text_encoder(model, device=self.device)
Comment on lines +149 to +150

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In this code snippet, self.device is used without being defined, which could be confusing for developers following the guide. Please clarify how the device should be obtained. It might be retrieved from od_config or passed as an argument to __init__.

self.transformer = CustomFusionModel(CONFIG)
load_custom_checkpoint(self.transformer,
checkpoint_path=os.path.join(model, "model.safetensors"))
# NO weights_sources defined — bypasses standard loader

def load_weights(self, weights):
pass # No-op — all weights loaded in __init__
```

**Pattern 2: Use standard loader with custom `load_weights`** (BAGEL style)

When weights are in safetensors format but need name remapping:

```python
class CustomPipeline(nn.Module):
def __init__(self, *, od_config, prefix=""):
super().__init__()
# Instantiate model architecture without weights
self.bagel = BagelModel(config)
self.vae = AutoEncoder(ae_params)

# Point loader at the safetensors in the model root
self.weights_sources = [
DiffusersPipelineLoader.ComponentSource(
model_or_path=od_config.model,
subfolder=None, # weights at root, not in subfolder
prefix="",
fall_back_to_pt=False,
)
]

def load_weights(self, weights):
# Custom name remapping for non-diffusers weight names
params = dict(self.named_parameters())
loaded = set()
for name, tensor in weights:
# Remap original weight names to vllm-omni module names
name = self._remap_weight_name(name)
if name in params:
default_weight_loader(params[name], tensor)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function default_weight_loader is used here, but its import is not shown in the example. For clarity and consistency with other code snippets in this guide, please mention that it should be imported from vllm.model_executor.model_loader.weight_utils.

loaded.add(name)
return loaded
```

### B5. Create the `model_index.json`

Custom models need a `model_index.json` at the model root for vllm-omni to discover them. For custom models, this is minimal:

```json
{
"_class_name": "YourModelPipeline",
"custom_key": "path/to/custom_weights.safetensors"
}
```

The `_class_name` must match a key in `_DIFFUSION_MODELS` in `registry.py`. Additional keys are model-specific (accessed via `od_config.model_config`).

If the model's weights come from multiple HF repos, write a **download script** that:
1. Downloads from each repo
2. Assembles into a single directory
3. Generates `model_index.json`
4. Installs any external dependencies (git clone + `.pth` file)

Place at: `examples/offline_inference/<name>/download_<name>.py`

### B6. Handle multi-modal inputs

If the model accepts images, audio, or other multi-modal inputs, implement the protocol classes from `vllm_omni/diffusion/models/interface.py`:

```python
from vllm_omni.diffusion.models.interface import SupportImageInput, SupportAudioInput

class MyPipeline(nn.Module, SupportImageInput, SupportAudioInput):
# Protocol markers — the engine uses these to enable proper input routing
pass
```

Preprocessing for custom models is typically done **inside `forward()`** rather than via registered pre-process functions, since the logic is often tightly coupled to the model.

### B7. Continue at Step 4 below.

---

## Common Steps (Both Paths)

### Step 4: Register Model in registry.py

Edit `vllm_omni/diffusion/registry.py`:

```python
_DIFFUSION_MODELS = {
"YourModelPipeline": ("your_model_name", "pipeline_your_model", "YourModelPipeline"),
}
_DIFFUSION_POST_PROCESS_FUNCS = {
"YourModelPipeline": "get_your_model_post_process_func", # if applicable
}
_DIFFUSION_PRE_PROCESS_FUNCS = {
"YourModelPipeline": "get_your_model_pre_process_func", # if applicable
}
```

The registry key is the `_class_name` from `model_index.json`. The tuple is `(folder_name, module_file, class_name)`.

Create `__init__.py` exporting the pipeline class and any factory functions.

### Step 5: Run, Test, Debug

Use the appropriate existing example script:

| Category | Script |
|----------|--------|
| Text-to-Image | `examples/offline_inference/text_to_image/text_to_image.py` |
| Text-to-Video | `examples/offline_inference/text_to_video/text_to_video.py` |
| Image-to-Video | `examples/offline_inference/image_to_video/image_to_video.py` |
| Image-to-Image | `examples/offline_inference/image_to_image/image_edit.py` |
| Text-to-Audio | `examples/offline_inference/text_to_audio/text_to_audio.py` |

For custom/Omni models that don't fit these categories, create a dedicated example script.

**Validation**: No errors, output is meaningful, quality matches reference implementation.

See [references/troubleshooting.md](references/troubleshooting.md) for common errors.

### Step 6: Add Example Scripts

For Omni or custom models, create:
- `examples/offline_inference/your_model_name/` — offline script + README
- `examples/online_serving/your_model_name/` — server script + client
- Download script if weights require assembly from multiple sources

### Step 7: Update Documentation

Required updates:
1. `docs/user_guide/diffusion/parallelism_acceleration.md` — parallelism support table
2. `docs/user_guide/diffusion/teacache.md` — if TeaCache supported
3. `docs/user_guide/diffusion/cache_dit_acceleration.md` — if Cache-DiT supported
4. `examples/offline_inference/xxx/README.md` — offline example docs
5. `examples/online_serve/xxx/README.md` — online serve docs

### Step 8: Add E2E Tests (Recommended)

Create `tests/e2e/online_serving/test_your_model_expansion.py`.

@fhfuih fhfuih Mar 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a short description of how the tests should be written:

- Pick a common or suggested combination of diffusion features (parallelism, quantization, caching, CPU offloading, etc.) and write one test case with these feautre(s) turned on.
- The test case should be named `def test_{your_model_name}`
- Refer to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py` for the available helper functions, constants, and fixtures to reuse in your test. (Do not need to apply multiple test cases of complete diffusion feature set in this file. Only add one test case as instructed above)
- Set num_inference_steps to 2, set image dimension to 512*512. For any other input & params, also do it similarly to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py`


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- For diffusion models e2e test, take `tests/e2e/online_serving/test_qwen3_omni_expansion.py` as reference.
- All the features (acceleration, quantization) supported for this model should be tested.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_qwen3_omni_expansion is Qwen 3 Omni, not diffusion

For diffusion, Qwen Image Edit is okay for now, but if the Qwen Image PR is merged (vllm-project/vllm-omni#1869), can use that as a better reference because more features are included.

Or a canonical ruleset is as follow:

  • 1 GPU: TeaCache & GGUF (or fallback to FP8, or disable it) & Layer-wise CPU offloading (or fallback to Module-wise)
  • 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Ulysses = 2
  • 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & Ring = 2
  • 2 GPUs: TeaCache & FP8 (or fallback to GGUF, or disable it) & CFG Parallel = 2
  • 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Tensor Parallel = 2 & VAE Patch Parallel = 2
  • 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & HSDP = 2 & VAE Patch Parallel = 2

## Iterative Development Tips

1. **Start minimal**: Basic generation first, no parallelism/caching
2. **Use `--enforce-eager`**: Disable torch.compile during debugging
3. **Use small models**: Test with smaller variants first
4. **Check tensor shapes**: Most errors are reshape mismatches in attention
5. **Add parallelism incrementally**: TP → SP → CFG parallel
6. **For custom models**: Get the model running with the original code first, then progressively replace components with vllm-omni equivalents

## Reference Files

- [Transformer Adaptation](references/transformer-adaptation.md) — porting transformers from diffusers
- [Custom Model Patterns](references/custom-model-patterns.md) — patterns for non-diffusers models
- [Parallelism Patterns](references/parallelism-patterns.md) — TP, SP, CFG parallel
- [Troubleshooting](references/troubleshooting.md) — common errors and fixes
Loading