Add diffusion-perf-optim skill for diffusion inference optimization by SamitHuang · Pull Request #19 · hsliuustc0106/vllm-omni-skills

SamitHuang · 2026-03-18T03:20:38Z

Add a new vllm-omni-diffusion-perf-optim skill that provides a comprehensive guide for optimizing diffusion model inference in vLLM-Omni.
Covers all lossless methods (torch.compile, Ulysses/Ring/CFG/TP parallelism, CPU offload, VAE optimizations, FP8/GGUF quantization) and lossy methods (TeaCache, Cache-DiT with DBCache/TaylorSeer/SCM).
Includes per-model support tables, ready-to-use CLI recipes, and a decision flowchart.

Test Result — LTX-2 Online Serving Optimization (480×768, 41 frames, 20 steps, H800)

The skill was tested by systematically optimizing Lightricks/LTX-2 text-to-video inference with online serving (vllm serve):

Configuration	Inference (s)	Speedup	Type
Baseline (eager, 1 GPU)	10.3	1.00×	—
torch.compile (1 GPU, warm request)	~10.3	~1.00×	Lossless
FP8 quantization (eager, 1 GPU)	~10.3	~1.00×	Lossless (VRAM savings)
4-GPU Ulysses SP (eager)	~10.3	~1.00×	Lossless
Cache-DiT (eager, 1 GPU)	7.4 avg	~1.4×	Lossy
4-GPU Ulysses SP + Cache-DiT (eager)	4.7 avg	~2.2×	Lossless + Lossy

Key findings:

Best combo: 4-GPU Ulysses SP + Cache-DiT = 2.2× speedup. The two methods synergize — Cache-DiT reduces per-step computation, making Ulysses SP communication overhead worthwhile
Cache-DiT alone provides ~1.4× speedup in online serving (single GPU)
torch.compile warm-request latency matches eager baseline on H800; first request pays ~6s compilation overhead
Ulysses SP (4 GPU) alone shows no measurable speedup for 41-frame generation — communication overhead outweighs compute savings at this sequence length. However, it combines effectively with Cache-DiT
FP8 doesn't speed up LTX-2 on H800 (compute-bound) but reduces VRAM

Note: Online serving numbers differ from offline benchmarks due to server overhead, IPC, and measurement methodology. Always benchmark in online serving mode for deployment decisions.

recipe created with this skill: vllm-project/vllm-omni#1971

NOTE: this skill should be updated when a new optimization feature is added

Comprehensive guide covering all lossless (torch.compile, parallelism, CPU offload, quantization) and lossy (TeaCache, Cache-DiT) optimization methods for vLLM-Omni diffusion models. Includes per-model support tables, ready-to-use recipes, and a decision flowchart. Made-with: Cursor

gemini-code-assist · 2026-03-18T03:20:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new skill guide designed to help users significantly enhance the performance of diffusion model inference within vLLM-Omni. It consolidates a wide array of optimization strategies, ranging from quality-preserving techniques like torch.compile and various parallelism methods to quality-trading approaches such as caching and reduced inference steps. The guide aims to empower users with practical knowledge, including model-specific support, command-line recipes, and a clear decision-making flowchart, to effectively tune their diffusion pipelines for speed, latency, and VRAM efficiency.

Highlights

New Skill Added: A new vllm-omni-diffusion-perf-optim skill has been added to provide a comprehensive guide for optimizing diffusion model inference in vLLM-Omni.
Comprehensive Optimization Guide: The guide covers a wide range of optimization methods, including both lossless techniques (torch.compile, Ulysses/Ring/CFG/TP parallelism, CPU offload, VAE optimizations, FP8/GGUF quantization) and lossy methods (TeaCache, Cache-DiT with DBCache/TaylorSeer/SCM).
Practical Resources: It includes per-model support tables, ready-to-use CLI recipes, and a decision flowchart to help users effectively apply optimizations.
Performance Validation: Test results for LTX-2 text-to-video inference are provided, demonstrating significant speedups (up to 1.77x) with various optimization combinations, highlighting the effectiveness of Cache-DiT and Ulysses SP.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive and well-structured guide for optimizing diffusion model inference in vLLM-Omni. The document effectively covers both lossless and lossy optimization methods, provides clear explanations, and includes practical recipes and support tables. The addition of a decision flowchart is particularly helpful for users. My feedback focuses on minor improvements to terminology consistency and mathematical notation for enhanced clarity.

gemini-code-assist · 2026-03-18T03:22:18Z

+
+**How**: `--ring-degree N` (offline) or `--ring N` (online serving)
+
+**Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.


For better clarity and consistency with programming contexts, consider using * instead of × for multiplication. Additionally, ensure consistent terminology throughout the document; consider using 'Ulysses-SP' instead of 'Ulysses' when referring to the parallelism method, as defined in the heading, to avoid potential ambiguity.

Suggested change

**Note**: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.

ulysses_degree * ring_degree = total SP GPUs

…ructions - Replace hardcoded model support tables with "Discovering Current Capabilities" section that points to source-of-truth files and grep commands for checking model support - Add "Extending This Skill" section with step-by-step instructions for adding new lossless methods, lossy methods, quantization, and parallelism strategies - Prefer "check the code" over static tables so the skill stays current as new optimization methods are added to vllm-omni Made-with: Cursor

hsliuustc0106

I found one blocking repository-consistency issue and one usability problem in the new skill. The frontmatter name does not match the directory, so this will fail the repo's validator once the branch is checked out, and the early parallelism guidance points readers to a support table that the skill never actually includes.

hsliuustc0106 · 2026-03-18T05:56:24Z

@@ -0,0 +1,434 @@
+---
+name: optimize-diffusion-perf


The frontmatter name should match the skill directory in this repo. With skills/vllm-omni-diffusion-perf-optim/ but name: optimize-diffusion-perf, scripts/validate_all.py will flag a consistency error (Directory name != frontmatter name) as soon as this branch is validated.

hsliuustc0106

One structural comment after reading the whole skill: this SKILL.md is unusually long for this repo and a fair amount of the back half reads more like on-demand reference material than trigger-time guidance. I think the core optimization workflow belongs in the skill, but the capability-discovery details, recipe catalog, and maintenance instructions would be easier to consume if they were split into references/ files.

hsliuustc0106 · 2026-03-18T06:06:59Z

+- Video models: 20–40 steps
+- Distilled models: 4–8 steps
+
+## Discovering Current Capabilities


From here onward the document shifts from a trigger-time workflow into deep reference material. That is useful content, but it does not need to live in SKILL.md to make the skill effective. This repo's architecture explicitly uses references/ to keep the main skill lean, so I would strongly consider moving Discovering Current Capabilities into a dedicated reference and leaving just a short pointer here.

- Update baseline/optimization methodology to prioritize online serving measurements over offline benchmarks - Clarify torch.compile behavior: first-request warmup penalty, recommend --enforce-eager for latency-sensitive deployments - Update all Quick Recipes to show online serving (vllm serve) commands - Add note about offline vs online measurement differences - Update tips to emphasize online serving benchmarking Made-with: Cursor

- Fix frontmatter name to match directory: vllm-omni-diffusion-perf-optim - Replace "support table below" with link to parallelism_acceleration.md Signed-off-by: samithuang <285365963@qq.com>

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

SamitHuang mentioned this pull request Mar 18, 2026

[Doc] Add LTX-2 online serving deployment recipes with optimization benchmarks vllm-project/vllm-omni#1971

Merged

4 tasks

hsliuustc0106 reviewed Mar 18, 2026

View reviewed changes

SamitHuang added 2 commits March 18, 2026 07:54

Fix frontmatter name and support table reference

5776377

- Fix frontmatter name to match directory: vllm-omni-diffusion-perf-optim - Replace "support table below" with link to parallelism_acceleration.md Signed-off-by: samithuang <285365963@qq.com>

This was referenced Mar 20, 2026

[CI] Add Flux2 Klein Tests vllm-project/vllm-omni#2027

Merged

[Test] Add L4 diffusion feature test for LongCat-Image vllm-project/vllm-omni#1970

Merged

hsliuustc0106 merged commit 0e03d3a into hsliuustc0106:main Mar 21, 2026

wtomin mentioned this pull request Mar 26, 2026

[RFC]: vLLM-Omni Diffusion Module — Q2 2026 Roadmap vllm-project/vllm-omni#2226

Open

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add diffusion-perf-optim skill for diffusion inference optimization#19

Add diffusion-perf-optim skill for diffusion inference optimization#19
hsliuustc0106 merged 4 commits into
hsliuustc0106:mainfrom
SamitHuang:diffusion-perf-optim

SamitHuang commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 Mar 18, 2026

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		How: `--ring-degree N` (offline) or `--ring N` (online serving)

		Note: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.

	Note: Can combine with Ulysses: `ulysses_degree × ring_degree = total SP GPUs`.
	ulysses_degree * ring_degree = total SP GPUs

Conversation

SamitHuang commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Result — LTX-2 Online Serving Optimization (480×768, 41 frames, 20 steps, H800)

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SamitHuang commented Mar 18, 2026 •

edited

Loading