Add diffusion-perf-optim skill for diffusion inference optimization#19
Conversation
Comprehensive guide covering all lossless (torch.compile, parallelism, CPU offload, quantization) and lossy (TeaCache, Cache-DiT) optimization methods for vLLM-Omni diffusion models. Includes per-model support tables, ready-to-use recipes, and a decision flowchart. Made-with: Cursor
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new skill guide designed to help users significantly enhance the performance of diffusion model inference within vLLM-Omni. It consolidates a wide array of optimization strategies, ranging from quality-preserving techniques like Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured guide for optimizing diffusion model inference in vLLM-Omni. The document effectively covers both lossless and lossy optimization methods, provides clear explanations, and includes practical recipes and support tables. The addition of a decision flowchart is particularly helpful for users. My feedback focuses on minor improvements to terminology consistency and mathematical notation for enhanced clarity.
|
|
||
| **How**: `--ring-degree N` (offline) or `--ring N` (online serving) | ||
|
|
||
| **Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`. |
There was a problem hiding this comment.
For better clarity and consistency with programming contexts, consider using * instead of × for multiplication. Additionally, ensure consistent terminology throughout the document; consider using 'Ulysses-SP' instead of 'Ulysses' when referring to the parallelism method, as defined in the heading, to avoid potential ambiguity.
| **Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`. | |
| ulysses_degree * ring_degree = total SP GPUs |
…ructions - Replace hardcoded model support tables with "Discovering Current Capabilities" section that points to source-of-truth files and grep commands for checking model support - Add "Extending This Skill" section with step-by-step instructions for adding new lossless methods, lossy methods, quantization, and parallelism strategies - Prefer "check the code" over static tables so the skill stays current as new optimization methods are added to vllm-omni Made-with: Cursor
hsliuustc0106
left a comment
There was a problem hiding this comment.
I found one blocking repository-consistency issue and one usability problem in the new skill. The frontmatter name does not match the directory, so this will fail the repo's validator once the branch is checked out, and the early parallelism guidance points readers to a support table that the skill never actually includes.
| @@ -0,0 +1,434 @@ | |||
| --- | |||
| name: optimize-diffusion-perf | |||
There was a problem hiding this comment.
The frontmatter name should match the skill directory in this repo. With skills/vllm-omni-diffusion-perf-optim/ but name: optimize-diffusion-perf, scripts/validate_all.py will flag a consistency error (Directory name != frontmatter name) as soon as this branch is validated.
hsliuustc0106
left a comment
There was a problem hiding this comment.
One structural comment after reading the whole skill: this SKILL.md is unusually long for this repo and a fair amount of the back half reads more like on-demand reference material than trigger-time guidance. I think the core optimization workflow belongs in the skill, but the capability-discovery details, recipe catalog, and maintenance instructions would be easier to consume if they were split into references/ files.
| - Video models: 20–40 steps | ||
| - Distilled models: 4–8 steps | ||
|
|
||
| ## Discovering Current Capabilities |
There was a problem hiding this comment.
From here onward the document shifts from a trigger-time workflow into deep reference material. That is useful content, but it does not need to live in SKILL.md to make the skill effective. This repo's architecture explicitly uses references/ to keep the main skill lean, so I would strongly consider moving Discovering Current Capabilities into a dedicated reference and leaving just a short pointer here.
- Update baseline/optimization methodology to prioritize online serving measurements over offline benchmarks - Clarify torch.compile behavior: first-request warmup penalty, recommend --enforce-eager for latency-sensitive deployments - Update all Quick Recipes to show online serving (vllm serve) commands - Add note about offline vs online measurement differences - Update tips to emphasize online serving benchmarking Made-with: Cursor
- Fix frontmatter name to match directory: vllm-omni-diffusion-perf-optim - Replace "support table below" with link to parallelism_acceleration.md Signed-off-by: samithuang <285365963@qq.com>
vllm-omni-diffusion-perf-optimskill that provides a comprehensive guide for optimizing diffusion model inference in vLLM-Omni.Test Result — LTX-2 Online Serving Optimization (480×768, 41 frames, 20 steps, H800)
The skill was tested by systematically optimizing
Lightricks/LTX-2text-to-video inference with online serving (vllm serve):Key findings:
recipe created with this skill: vllm-project/vllm-omni#1971
NOTE: this skill should be updated when a new optimization feature is added