Skip to content

Add diffusion-perf-optim skill for diffusion inference optimization#19

Merged
hsliuustc0106 merged 4 commits into
hsliuustc0106:mainfrom
SamitHuang:diffusion-perf-optim
Mar 21, 2026
Merged

Add diffusion-perf-optim skill for diffusion inference optimization#19
hsliuustc0106 merged 4 commits into
hsliuustc0106:mainfrom
SamitHuang:diffusion-perf-optim

Conversation

@SamitHuang

@SamitHuang SamitHuang commented Mar 18, 2026

Copy link
Copy Markdown
Contributor
  • Add a new vllm-omni-diffusion-perf-optim skill that provides a comprehensive guide for optimizing diffusion model inference in vLLM-Omni.
  • Covers all lossless methods (torch.compile, Ulysses/Ring/CFG/TP parallelism, CPU offload, VAE optimizations, FP8/GGUF quantization) and lossy methods (TeaCache, Cache-DiT with DBCache/TaylorSeer/SCM).
  • Includes per-model support tables, ready-to-use CLI recipes, and a decision flowchart.

Test Result — LTX-2 Online Serving Optimization (480×768, 41 frames, 20 steps, H800)

The skill was tested by systematically optimizing Lightricks/LTX-2 text-to-video inference with online serving (vllm serve):

Configuration Inference (s) Speedup Type
Baseline (eager, 1 GPU) 10.3 1.00×
torch.compile (1 GPU, warm request) ~10.3 ~1.00× Lossless
FP8 quantization (eager, 1 GPU) ~10.3 ~1.00× Lossless (VRAM savings)
4-GPU Ulysses SP (eager) ~10.3 ~1.00× Lossless
Cache-DiT (eager, 1 GPU) 7.4 avg ~1.4× Lossy
4-GPU Ulysses SP + Cache-DiT (eager) 4.7 avg ~2.2× Lossless + Lossy

Key findings:

  • Best combo: 4-GPU Ulysses SP + Cache-DiT = 2.2× speedup. The two methods synergize — Cache-DiT reduces per-step computation, making Ulysses SP communication overhead worthwhile
  • Cache-DiT alone provides ~1.4× speedup in online serving (single GPU)
  • torch.compile warm-request latency matches eager baseline on H800; first request pays ~6s compilation overhead
  • Ulysses SP (4 GPU) alone shows no measurable speedup for 41-frame generation — communication overhead outweighs compute savings at this sequence length. However, it combines effectively with Cache-DiT
  • FP8 doesn't speed up LTX-2 on H800 (compute-bound) but reduces VRAM

Note: Online serving numbers differ from offline benchmarks due to server overhead, IPC, and measurement methodology. Always benchmark in online serving mode for deployment decisions.

recipe created with this skill: vllm-project/vllm-omni#1971

NOTE: this skill should be updated when a new optimization feature is added

Comprehensive guide covering all lossless (torch.compile, parallelism,
CPU offload, quantization) and lossy (TeaCache, Cache-DiT) optimization
methods for vLLM-Omni diffusion models. Includes per-model support
tables, ready-to-use recipes, and a decision flowchart.

Made-with: Cursor
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new skill guide designed to help users significantly enhance the performance of diffusion model inference within vLLM-Omni. It consolidates a wide array of optimization strategies, ranging from quality-preserving techniques like torch.compile and various parallelism methods to quality-trading approaches such as caching and reduced inference steps. The guide aims to empower users with practical knowledge, including model-specific support, command-line recipes, and a clear decision-making flowchart, to effectively tune their diffusion pipelines for speed, latency, and VRAM efficiency.

Highlights

  • New Skill Added: A new vllm-omni-diffusion-perf-optim skill has been added to provide a comprehensive guide for optimizing diffusion model inference in vLLM-Omni.
  • Comprehensive Optimization Guide: The guide covers a wide range of optimization methods, including both lossless techniques (torch.compile, Ulysses/Ring/CFG/TP parallelism, CPU offload, VAE optimizations, FP8/GGUF quantization) and lossy methods (TeaCache, Cache-DiT with DBCache/TaylorSeer/SCM).
  • Practical Resources: It includes per-model support tables, ready-to-use CLI recipes, and a decision flowchart to help users effectively apply optimizations.
  • Performance Validation: Test results for LTX-2 text-to-video inference are provided, demonstrating significant speedups (up to 1.77x) with various optimization combinations, highlighting the effectiveness of Cache-DiT and Ulysses SP.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured guide for optimizing diffusion model inference in vLLM-Omni. The document effectively covers both lossless and lossy optimization methods, provides clear explanations, and includes practical recipes and support tables. The addition of a decision flowchart is particularly helpful for users. My feedback focuses on minor improvements to terminology consistency and mathematical notation for enhanced clarity.


**How**: `--ring-degree N` (offline) or `--ring N` (online serving)

**Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better clarity and consistency with programming contexts, consider using * instead of × for multiplication. Additionally, ensure consistent terminology throughout the document; consider using 'Ulysses-SP' instead of 'Ulysses' when referring to the parallelism method, as defined in the heading, to avoid potential ambiguity.

Suggested change
**Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.
ulysses_degree * ring_degree = total SP GPUs

…ructions

- Replace hardcoded model support tables with "Discovering Current
  Capabilities" section that points to source-of-truth files and
  grep commands for checking model support
- Add "Extending This Skill" section with step-by-step instructions
  for adding new lossless methods, lossy methods, quantization, and
  parallelism strategies
- Prefer "check the code" over static tables so the skill stays
  current as new optimization methods are added to vllm-omni

Made-with: Cursor

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking repository-consistency issue and one usability problem in the new skill. The frontmatter name does not match the directory, so this will fail the repo's validator once the branch is checked out, and the early parallelism guidance points readers to a support table that the skill never actually includes.

@@ -0,0 +1,434 @@
---
name: optimize-diffusion-perf

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontmatter name should match the skill directory in this repo. With skills/vllm-omni-diffusion-perf-optim/ but name: optimize-diffusion-perf, scripts/validate_all.py will flag a consistency error (Directory name != frontmatter name) as soon as this branch is validated.

Comment thread skills/vllm-omni-diffusion-perf-optim/SKILL.md Outdated

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One structural comment after reading the whole skill: this SKILL.md is unusually long for this repo and a fair amount of the back half reads more like on-demand reference material than trigger-time guidance. I think the core optimization workflow belongs in the skill, but the capability-discovery details, recipe catalog, and maintenance instructions would be easier to consume if they were split into references/ files.

- Video models: 20–40 steps
- Distilled models: 4–8 steps

## Discovering Current Capabilities

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here onward the document shifts from a trigger-time workflow into deep reference material. That is useful content, but it does not need to live in SKILL.md to make the skill effective. This repo's architecture explicitly uses references/ to keep the main skill lean, so I would strongly consider moving Discovering Current Capabilities into a dedicated reference and leaving just a short pointer here.

Comment thread skills/vllm-omni-diffusion-perf-optim/SKILL.md
- Update baseline/optimization methodology to prioritize online serving
  measurements over offline benchmarks
- Clarify torch.compile behavior: first-request warmup penalty, recommend
  --enforce-eager for latency-sensitive deployments
- Update all Quick Recipes to show online serving (vllm serve) commands
- Add note about offline vs online measurement differences
- Update tips to emphasize online serving benchmarking

Made-with: Cursor
- Fix frontmatter name to match directory: vllm-omni-diffusion-perf-optim
- Replace "support table below" with link to parallelism_acceleration.md

Signed-off-by: samithuang <285365963@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants