[SKILL] add-diffusion-model#15
Conversation
Covers both diffusers-based and custom/private repo models with step-by-step workflow, code patterns, and troubleshooting guide. Reference files: - transformer-adaptation.md: porting diffusers transformers - custom-model-patterns.md: patterns for non-diffusers models (DreamID-Omni, BAGEL, HunyuanImage3) - parallelism-patterns.md: TP, SP, CFG parallel - troubleshooting.md: common errors and fixes Made-with: Cursor
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new 'skill' designed to standardize and streamline the integration of diffusion models into the vLLM-Omni framework. It provides a structured, step-by-step guide for developers, covering both standard Diffusers-based models and custom implementations. The added documentation aims to simplify complex tasks such as model adaptation, weight management, and the application of various parallelism techniques, thereby enhancing the framework's capability to support a wider range of diffusion models efficiently. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured guide (a "skill") for adding new diffusion models to vLLM-Omni. This is a valuable addition that will significantly help developers extend the library's capabilities. The documentation clearly outlines the process for both standard diffusers-based models and custom models, covering important topics like weight loading, parallelism, and troubleshooting. My review includes several suggestions to enhance the clarity and completeness of the code examples provided, which should make the guide even more user-friendly.
| self.vae = custom_init_vae(model, device=self.device) | ||
| self.text_encoder = custom_init_text_encoder(model, device=self.device) |
| # Remap original weight names to vllm-omni module names | ||
| name = self._remap_weight_name(name) | ||
| if name in params: | ||
| default_weight_loader(params[name], tensor) |
| def download_dependency(): | ||
| CACHE_DIR.mkdir(parents=True, exist_ok=True) | ||
| with open(LOCK_FILE, "w") as f: | ||
| fcntl.flock(f, fcntl.LOCK_EX) | ||
| if not DEPENDENCY_DIR.exists(): | ||
| subprocess.run([ | ||
| "git", "clone", "--depth", "1", | ||
| REPO_URL, "--branch", BRANCH, | ||
| str(DEPENDENCY_DIR) | ||
| ], check=True) | ||
| fcntl.flock(f, fcntl.LOCK_UN) | ||
|
|
||
| # Add to Python path via .pth file | ||
| site_packages = Path(site.getsitepackages()[0]) | ||
| pth_file = site_packages / "vllm_omni_dependency.pth" | ||
| pth_file.write_text(str(DEPENDENCY_DIR)) |
There was a problem hiding this comment.
The download_dependency function example uses several undefined variables (e.g., CACHE_DIR, LOCK_FILE, DEPENDENCY_DIR, REPO_URL, BRANCH). To make this code snippet more practical and easier for a developer to adapt, consider adding placeholder definitions for these variables as module-level constants.
| ```python | ||
| # examples/offline_inference/<name>/download_<name>.py | ||
| from huggingface_hub import snapshot_download | ||
| import json, os |
| json.dump(config, f, indent=2) | ||
|
|
||
| # Install external code dependency (if needed) | ||
| download_dependency() |
There was a problem hiding this comment.
The Download Script Template includes a call to download_dependency(), but its definition is provided much earlier in this document. This might be confusing for someone who is just using the template. To improve clarity, you could add a comment indicating where to find or how to implement this function.
For example:
# Install external code dependency (if needed)
# See "External Dependency Management" section for implementation.
download_dependency()| else: | ||
| # Normal loading | ||
| ... |
There was a problem hiding this comment.
The ... placeholder under the # Normal loading comment is a bit vague. To make this advanced example for QKV fusion more complete and easier to understand, please replace the placeholder with the actual code for handling non-fused parameters. Also, ensure default_weight_loader is imported if it's not in scope.
| else: | |
| # Normal loading | |
| ... | |
| else: | |
| # Normal loading | |
| if name in params: | |
| param = params[name] | |
| if hasattr(param, "weight_loader"): | |
| param.weight_loader(param, tensor) | |
| else: | |
| default_weight_loader(param, tensor) | |
| loaded.add(name) |
|
@wtomin PTAL |
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
| ### Step 8: Add E2E Tests (Recommended) | ||
|
|
||
| Create `tests/e2e/online_serving/test_your_model_expansion.py`. | ||
|
|
There was a problem hiding this comment.
| - For diffusion models e2e test, take `tests/e2e/online_serving/test_qwen3_omni_expansion.py` as reference. | |
| - All the features (acceleration, quantization) supported for this model should be tested. |
There was a problem hiding this comment.
test_qwen3_omni_expansion is Qwen 3 Omni, not diffusion
For diffusion, Qwen Image Edit is okay for now, but if the Qwen Image PR is merged (vllm-project/vllm-omni#1869), can use that as a better reference because more features are included.
Or a canonical ruleset is as follow:
- 1 GPU: TeaCache & GGUF (or fallback to FP8, or disable it) & Layer-wise CPU offloading (or fallback to Module-wise)
- 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Ulysses = 2
- 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & Ring = 2
- 2 GPUs: TeaCache & FP8 (or fallback to GGUF, or disable it) & CFG Parallel = 2
- 2 GPUs: Cache-DiT & FP8 (or fallback to GGUF, or disable it) & Tensor Parallel = 2 & VAE Patch Parallel = 2
- 2 GPUs: Cache-DiT & GGUF (or fallback to FP8, or disable it) & HSDP = 2 & VAE Patch Parallel = 2
|
@fhfuih, @david6666666 PTAL |
hsliuustc0106
left a comment
There was a problem hiding this comment.
I think we may also add a option for benchmark serving for the new model and comparing it with diffusers
also, specify the hardware |
should we add another benchmark serving skill? |
🔍 PR #15 Critical ReviewChanges: +1016 lines (new skill) 🔴 Critical Issues
|
| Issue | Details |
|---|---|
| Unverifiable import paths | Many imports like vllm_omni.diffusion.attention.layer.Attention, CFGParallelMixin cannot be verified against this skills repo |
| Test file references unverified | References examples/offline_inference/ paths that don't exist in this repo |
| Duplicate coverage? | vllm-omni-contrib skill already exists — scope overlap unclear |
✅ Content Quality
Strengths:
- Clear two-path structure (diffusers vs custom) matches real-world scenarios
- Concrete code examples with proper patterns
- Troubleshooting section addresses real pain points
- Weight loading patterns well-documented
Weaknesses:
- SKILL.md at 308 lines — pushing soft limit; could move more to references
- No version indicator for which vLLM-Omni version this targets
📋 Structure Compliance
| Check | Status |
|---|---|
Skill directory naming (vllm-omni-*) |
✅ |
| SKILL.md with YAML frontmatter | ✅ |
| References in subfolder | ✅ |
| ARCHITECTURE.md updated | ❌ MISSING |
| README.md skills index updated | ❌ MISSING |
🚦 Recommendation
Block until:
- Add entry to
docs/ARCHITECTURE.mdunder skills tree - Add row to
README.mdskills index table
Consider:
3. Change frontmatter name: to vllm-omni-add-diffusion-model for consistency
4. Clarify relationship with existing vllm-omni-contrib skill (is this a specialized subset?)
Review generated with Claude Code
sounds good. we can trigger that benchmark serving skill in this skill. |
|
update the repo readme.md please |
|
|
||
| ### Step 8: Add E2E Tests (Recommended) | ||
|
|
||
| Create `tests/e2e/online_serving/test_your_model_expansion.py`. |
There was a problem hiding this comment.
Maybe add a short description of how the tests should be written:
- Pick a common or suggested combination of diffusion features (parallelism, quantization, caching, CPU offloading, etc.) and write one test case with these feautre(s) turned on.
- The test case should be named `def test_{your_model_name}`
- Refer to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py` for the available helper functions, constants, and fixtures to reuse in your test. (Do not need to apply multiple test cases of complete diffusion feature set in this file. Only add one test case as instructed above)
- Set num_inference_steps to 2, set image dimension to 512*512. For any other input & params, also do it similarly to `tests/e2e/online_serving/test_qwen_image_edit_expansion.py`
Summary
Add a new skill
vllm-omni-add-diffusion-modelthat guides developers through adding a new diffusion model to vLLM-Omni.What it covers
__init__, no-opload_weights)weights_sources+ customload_weights)vllm_omni/diffusion/models/<name>/vs external deps vs download scripts)model_index.jsongeneration for multi-repo weightsSupportImageInput,SupportAudioInput)Reference files
SKILL.md(307 lines)references/custom-model-patterns.mdreferences/transformer-adaptation.mdload_weightsreferences/parallelism-patterns.mdreferences/troubleshooting.mdTested by
Wan-AI/Wan2.1-T2V-1.3B-Diffusersand stable diffusion 1.5 support to vLLM-Omni — see companion PR at vllm-project/vllm-omni.