Skip to content

[SKILL] add-diffusion-model#15

Merged
hsliuustc0106 merged 2 commits into
hsliuustc0106:mainfrom
SamitHuang:add-diffusion-model-skill
Mar 21, 2026
Merged

[SKILL] add-diffusion-model#15
hsliuustc0106 merged 2 commits into
hsliuustc0106:mainfrom
SamitHuang:add-diffusion-model-skill

Conversation

@SamitHuang

@SamitHuang SamitHuang commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Add a new skill vllm-omni-add-diffusion-model that guides developers through adding a new diffusion model to vLLM-Omni.

What it covers

  • Two migration paths: diffusers-based models (Path A) and custom/private repo models (Path B)
  • Step-by-step workflow with checklist: analyze → adapt transformer → adapt pipeline → register → test → docs
  • Custom model patterns (Path B) based on real implementations:
    • Weight loading bypass (DreamID-Omni style: eager load in __init__, no-op load_weights)
    • Standard loader with custom name remapping (BAGEL style: weights_sources + custom load_weights)
    • Code placement rules (what goes in vllm_omni/diffusion/models/<name>/ vs external deps vs download scripts)
    • model_index.json generation for multi-repo weights
    • Multi-modal I/O protocol classes (SupportImageInput, SupportAudioInput)

Reference files

File Content
SKILL.md (307 lines) Main workflow — Path A (diffusers) and Path B (custom)
references/custom-model-patterns.md Weight loading, directory layout, external deps, download scripts
references/transformer-adaptation.md Attention replacement, mixin removal, QKV fusion, load_weights
references/parallelism-patterns.md TP, SP, CFG parallel, HSDP
references/troubleshooting.md Common errors and debugging workflow

Tested by

  • Applied the skill to add Wan-AI/Wan2.1-T2V-1.3B-Diffusers and stable diffusion 1.5 support to vLLM-Omni — see companion PR at vllm-project/vllm-omni.

Covers both diffusers-based and custom/private repo models with
step-by-step workflow, code patterns, and troubleshooting guide.

Reference files:
- transformer-adaptation.md: porting diffusers transformers
- custom-model-patterns.md: patterns for non-diffusers models
  (DreamID-Omni, BAGEL, HunyuanImage3)
- parallelism-patterns.md: TP, SP, CFG parallel
- troubleshooting.md: common errors and fixes

Made-with: Cursor
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new 'skill' designed to standardize and streamline the integration of diffusion models into the vLLM-Omni framework. It provides a structured, step-by-step guide for developers, covering both standard Diffusers-based models and custom implementations. The added documentation aims to simplify complex tasks such as model adaptation, weight management, and the application of various parallelism techniques, thereby enhancing the framework's capability to support a wider range of diffusion models efficiently.

Highlights

  • New Skill Introduction: A new skill, vllm-omni-add-diffusion-model, has been added to guide developers through the process of integrating new diffusion models into vLLM-Omni.
  • Dual Migration Paths: The skill provides detailed workflows for two distinct migration paths: Diffusers-based models (Path A) and custom/private repository models (Path B), accommodating various model sources.
  • Comprehensive Reference Documentation: Extensive reference files have been included, covering critical aspects such as custom model patterns, transformer adaptation, parallelism strategies (TP, SP, CFG parallel, HSDP), and a dedicated troubleshooting guide.
  • Practical Implementation Guidance: The documentation offers practical patterns for weight loading (including bypass and custom remapping), handling external dependencies, generating model_index.json for custom models, and implementing multi-modal input protocols.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • skills/vllm-omni-add-diffusion-model/SKILL.md
    • Added a comprehensive guide for integrating new diffusion models into vLLM-Omni, detailing two migration paths (Diffusers-based and Custom/Private Repo) and common steps.
  • skills/vllm-omni-add-diffusion-model/references/custom-model-patterns.md
    • Added a reference document outlining patterns for integrating custom diffusion models, including directory structure, weight loading, model_index.json creation, external dependencies, and multi-modal input protocols.
  • skills/vllm-omni-add-diffusion-model/references/parallelism-patterns.md
    • Added a reference document detailing various parallelism patterns (Tensor Parallelism, CFG Parallelism, Sequence Parallelism, VAE Patch Parallelism, HSDP) and a recommended incremental approach for adding them.
  • skills/vllm-omni-add-diffusion-model/references/transformer-adaptation.md
    • Added a reference document providing a step-by-step guide for adapting Diffusers transformers to vLLM-Omni, covering mixin removal, attention replacement, logger replacement, config support, and weight loading.
  • skills/vllm-omni-add-diffusion-model/references/troubleshooting.md
    • Added a reference document listing common errors encountered when adding diffusion models, along with their causes, fixes, and a general debugging workflow.
Activity
  • No specific human activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured guide (a "skill") for adding new diffusion models to vLLM-Omni. This is a valuable addition that will significantly help developers extend the library's capabilities. The documentation clearly outlines the process for both standard diffusers-based models and custom models, covering important topics like weight loading, parallelism, and troubleshooting. My review includes several suggestions to enhance the clarity and completeness of the code examples provided, which should make the guide even more user-friendly.

Comment on lines +149 to +150
self.vae = custom_init_vae(model, device=self.device)
self.text_encoder = custom_init_text_encoder(model, device=self.device)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In this code snippet, self.device is used without being defined, which could be confusing for developers following the guide. Please clarify how the device should be obtained. It might be retrieved from od_config or passed as an argument to __init__.

# Remap original weight names to vllm-omni module names
name = self._remap_weight_name(name)
if name in params:
default_weight_loader(params[name], tensor)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function default_weight_loader is used here, but its import is not shown in the example. For clarity and consistency with other code snippets in this guide, please mention that it should be imported from vllm.model_executor.model_loader.weight_utils.

Comment on lines +123 to +138
def download_dependency():
CACHE_DIR.mkdir(parents=True, exist_ok=True)
with open(LOCK_FILE, "w") as f:
fcntl.flock(f, fcntl.LOCK_EX)
if not DEPENDENCY_DIR.exists():
subprocess.run([
"git", "clone", "--depth", "1",
REPO_URL, "--branch", BRANCH,
str(DEPENDENCY_DIR)
], check=True)
fcntl.flock(f, fcntl.LOCK_UN)

# Add to Python path via .pth file
site_packages = Path(site.getsitepackages()[0])
pth_file = site_packages / "vllm_omni_dependency.pth"
pth_file.write_text(str(DEPENDENCY_DIR))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The download_dependency function example uses several undefined variables (e.g., CACHE_DIR, LOCK_FILE, DEPENDENCY_DIR, REPO_URL, BRANCH). To make this code snippet more practical and easier for a developer to adapt, consider adding placeholder definitions for these variables as module-level constants.

```python
# examples/offline_inference/<name>/download_<name>.py
from huggingface_hub import snapshot_download
import json, os

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to follow common Python style guides (like PEP 8), it's recommended to have one import per line.

Suggested change
import json, os
import json
import os

json.dump(config, f, indent=2)

# Install external code dependency (if needed)
download_dependency()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Download Script Template includes a call to download_dependency(), but its definition is provided much earlier in this document. This might be confusing for someone who is just using the template. To improve clarity, you could add a comment indicating where to find or how to implement this function.

For example:

    # Install external code dependency (if needed)
    # See "External Dependency Management" section for implementation.
    download_dependency()

Comment on lines +194 to +196
else:
# Normal loading
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ... placeholder under the # Normal loading comment is a bit vague. To make this advanced example for QKV fusion more complete and easier to understand, please replace the placeholder with the actual code for handling non-fused parameters. Also, ensure default_weight_loader is imported if it's not in scope.

Suggested change
else:
# Normal loading
...
else:
# Normal loading
if name in params:
param = params[name]
if hasattr(param, "weight_loader"):
param.weight_loader(param, tensor)
else:
default_weight_loader(param, tensor)
loaded.add(name)

@SamitHuang SamitHuang changed the title Add skill: add-diffusion-model [SKILL] add-diffusion-model Mar 16, 2026
@SamitHuang

Copy link
Copy Markdown
Contributor Author

@wtomin PTAL

Comment thread skills/vllm-omni-add-diffusion-model/SKILL.md Outdated
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
### Step 8: Add E2E Tests (Recommended)

Create `tests/e2e/online_serving/test_your_model_expansion.py`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- For diffusion models e2e test, take `tests/e2e/online_serving/test_qwen3_omni_expansion.py` as reference.
- All the features (acceleration, quantization) supported for this model should be tested.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_qwen3_omni_expansion is Qwen 3 Omni, not diffusion

For diffusion, Qwen Image Edit is okay for now, but if the Qwen Image PR is merged (vllm-project/vllm-omni#1869), can use that as a better reference because more features are included.

Or a canonical ruleset is as follow:

  • 1 GPU: TeaCache & GGUF (or fallback to FP8, or disable it) & Layer-wise CPU offloading (or fallback to Module-wise)
  • 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Ulysses = 2
  • 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & Ring = 2
  • 2 GPUs: TeaCache & FP8 (or fallback to GGUF, or disable it) & CFG Parallel = 2
  • 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Tensor Parallel = 2 & VAE Patch Parallel = 2
  • 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & HSDP = 2 & VAE Patch Parallel = 2

@wtomin

wtomin commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

@fhfuih, @david6666666 PTAL

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may also add a option for benchmark serving for the new model and comparing it with diffusers

@hsliuustc0106

Copy link
Copy Markdown
Owner

I think we may also add a option for benchmark serving for the new model and comparing it with diffusers

also, specify the hardware

@SamitHuang

Copy link
Copy Markdown
Contributor Author

I think we may also add a option for benchmark serving for the new model and comparing it with diffusers

should we add another benchmark serving skill?

@hsliuustc0106

Copy link
Copy Markdown
Owner

🔍 PR #15 Critical Review

Changes: +1016 lines (new skill)


🔴 Critical Issues

# Issue Severity
1 Missing ARCHITECTURE.md update - New skill not added to project structure docs HIGH
2 Missing README.md update - Skills index doesn't list new skill HIGH
3 Inconsistent skill name - Frontmatter uses add-diffusion-model but project convention is vllm-omni-<topic> MEDIUM

⚠️ Technical Concerns

Issue Details
Unverifiable import paths Many imports like vllm_omni.diffusion.attention.layer.Attention, CFGParallelMixin cannot be verified against this skills repo
Test file references unverified References examples/offline_inference/ paths that don't exist in this repo
Duplicate coverage? vllm-omni-contrib skill already exists — scope overlap unclear

✅ Content Quality

Strengths:

  • Clear two-path structure (diffusers vs custom) matches real-world scenarios
  • Concrete code examples with proper patterns
  • Troubleshooting section addresses real pain points
  • Weight loading patterns well-documented

Weaknesses:

  • SKILL.md at 308 lines — pushing soft limit; could move more to references
  • No version indicator for which vLLM-Omni version this targets

📋 Structure Compliance

Check Status
Skill directory naming (vllm-omni-*)
SKILL.md with YAML frontmatter
References in subfolder
ARCHITECTURE.md updated ❌ MISSING
README.md skills index updated ❌ MISSING

🚦 Recommendation

Block until:

  1. Add entry to docs/ARCHITECTURE.md under skills tree
  2. Add row to README.md skills index table

Consider:
3. Change frontmatter name: to vllm-omni-add-diffusion-model for consistency
4. Clarify relationship with existing vllm-omni-contrib skill (is this a specialized subset?)


Review generated with Claude Code

@hsliuustc0106

Copy link
Copy Markdown
Owner

I think we may also add a option for benchmark serving for the new model and comparing it with diffusers

should we add another benchmark serving skill?

sounds good. we can trigger that benchmark serving skill in this skill.

@hsliuustc0106

Copy link
Copy Markdown
Owner

update the repo readme.md please


### Step 8: Add E2E Tests (Recommended)

Create `tests/e2e/online_serving/test_your_model_expansion.py`.

@fhfuih fhfuih Mar 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a short description of how the tests should be written:

- Pick a common or suggested combination of diffusion features (parallelism, quantization, caching, CPU offloading, etc.) and write one test case with these feautre(s) turned on.
- The test case should be named `def test_{your_model_name}`
- Refer to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py` for the available helper functions, constants, and fixtures to reuse in your test. (Do not need to apply multiple test cases of complete diffusion feature set in this file. Only add one test case as instructed above)
- Set num_inference_steps to 2, set image dimension to 512*512. For any other input & params, also do it similarly to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py`

@hsliuustc0106 hsliuustc0106 merged commit ab59462 into hsliuustc0106:main Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants