[Model] Add two stages inference for model LTX-2 distilled. by Songrui625 · Pull Request #2260 · vllm-project/vllm-omni

Songrui625 · 2026-03-27T08:37:02Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR add two stages inference for model LTX-2 refer to huggingface/diffusers#12934

Two stages inference is recommended approach to achieve production quality generation, we provide LTX2TwoStagesPipeline (for T2V) and LTX2ImageToVideoTwoStagesPipeline (for I2V) to do this. The pipeline is composed of two stages:

Stage 1: Generate a video at the target resolution using diffusion sampling with classifier-free guidance (CFG). This stage produces a coherent low-noise video sequence that respects the text/image conditioning.
Stage 2: Upsample the Stage 1 output by 2 and refine details using a distilled LoRA model to improve fidelity and visual quality. Stage 2 may apply lighter CFG to preserve the structure from Stage 1 while enhancing texture and sharpness.

For now it only supported distilled model rootonchair/LTX-2-19b-distilled. The distilled LoRA model ltx-2-19b-distilled-lora-384.safetensors is in the main repo but doesn't have the related adpater config, which cause we couldn't load it directly.

How to use

model: rootonchair/LTX-2-19b-distilled
Set option --model-class-name to value LTX2TwoStagesPipeline (for T2V) or LTX2ImageToVideoTwoStagesPipeline (for I2V)
Set option --guidance-scale to value 1.0

Test Plan

Regression tests are all passed to make sure no breaking change. And we also need to bench the acceleration methods for two stages pipeline.

Text-To-Video

LTX-2 distilled T2V single stage sample

python3 text_to_video.py --model /data00/models/LTX-2-19b-distilled --prompt 'A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie'\''s. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.' --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." --guidance-scale 1.0 --width 768 --height 512 --num-inference-steps 8 --num-frames 121 --frame-rate 24 --seed 42 --enforce-eager --output ltx2_t2v_sample.mp4

LTX-2 distilled T2V two stages sample

python3 text_to_video.py --model /data00/models/LTX-2-19b-distilled --model-class-name LTX2TwoStagesPipeline --prompt 'A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie'\''s. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.' --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." --guidance-scale 1.0 --width 768 --height 512 --num-inference-steps 8 --num-frames 121 --frame-rate 24 --seed 42 --enforce-eager --output ltx2_t2v_2st_sample.mp4

Image-To-Video

LTX-2 distilled I2V single stage sample

python3 image_to_video.py --model /data00/models/LTX-2-19b-distilled --image /data00/ltx2_i2v_input.png --model-class-name LTX2ImageToVideoTwoStagesPipeline --prompt "A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red \"Diner\" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: \"Welcome to Rosie's. What can I get for you today?\" The mood is inviting, timeless, and full of classic American diner charm." --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." --guidance-scale 1.0 --width 768 --height 512 --num-inference-steps 8 --num-frames 121 --fps 24 --seed 45 --enforce-eager --output ltx2_i2v_2st_sample.mp4

LTX-2 distilled I2V two stages sample

python3 image_to_video.py --model /data00/models/LTX-2-19b-distilled --image /data00/ltx2_i2v_input.png --model-class-name LTX2ImageToVideoPipeline --prompt "A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red \"Diner\" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: \"Welcome to Rosie's. What can I get for you today?\" The mood is inviting, timeless, and full of classic American diner charm." --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." --guidance-scale 1.0 --width 768 --height 512 --num-inference-steps 8 --num-frames 121 --fps 24 --seed 45 --enforce-eager --output ltx2_i2v_sample.mp4

Test Result

Text-To-Video

single stage inference:

ltx2_t2v_sample.mp4

two stages inference

ltx2_t2v_2st_sample.mp4

Image-To-Video

single stage inference:

ltx2_i2v_sample.mp4

two stages inference:

ltx2_i2v_2st_sample.mp4

Benchmark

We bench it by text_to_video.py and image_to_video.py both with option '--enforce-eager' on NVIDIA H20.

Sampling Parameters:

model: rootonchair/LTX-2-19b-distilled
prompt: A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie's. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.
width: 768
height: 512
guidance_scale: 1.0
num_frames: 121
frame_rate: 24
seed: 45
image(for I2V):

	Cache-Dit	Ulysses-SP 2	Ring Attention 2	CFG Parallel 2
Text-To-Video(LTX2TwoStatgesPipeline)	49.5590 s	31.7546s	31.4847s	51.3056 s
Image-To-Video(LTX2ImageToVideoTwoStatgesPipeline)	52.7701 s	32.0508 s	34.0860 s	52.6714 s

Future Work

L4 E2E test for LTX-2
Distilled LoRA support that Lightricks/LTX-2 could apply two stages inference

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9551a1b24d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Songrui625 · 2026-03-27T08:52:53Z

@david6666666 PTAL! Thanks!

lishunyang12 · 2026-03-27T08:59:03Z

please open a new RFC and attach your design doc using this template in your RFC

Songrui625 · 2026-03-27T09:04:10Z

please open a new RFC and attach your design doc using this template in your RFC

Sorry to bother you. This PR is more of a model-related implementation than a new feature. I have already changed both the PR title and commit title.

david6666666

I believe we need to test all acceleration methods of LTX-2 T2V and I2V to ensure functionality is unaffected. Alternatively, we could add an E2E L4 level test for monitoring, similar to #2087.

Songrui625 · 2026-03-31T08:51:39Z

I believe we need to test all acceleration methods of LTX-2 T2V and I2V to ensure functionality is unaffected. Alternatively, we could add an E2E L4 level test for monitoring, similar to #2087.

Thanks, David, all acceleration methods regression tests (Ulysses-SP, Ring Attention, Cache-DiT and CFG-Parallel) are passed. And I have attached the benchmark results of two stages pipeline in the PR description. PTAL again.

david6666666 · 2026-03-31T09:04:33Z

@SamitHuang help take a look, thx

Songrui625 · 2026-04-01T08:03:49Z

@SamitHuang Hi, this PR is ready to go on. Please take a look, thanks!

Signed-off-by: Songrui625 <songrui625@gmail.com>

…2ImageToVideoTwoStagesPipeline` Signed-off-by: Songrui625 <songrui625@gmail.com>

Signed-off-by: Songrui625 <songrui625@gmail.com>

Songrui625 · 2026-04-03T06:56:17Z

@lishunyang12 @SamitHuang @wtomin This PR is ready. PTAL, thanks! 😊

lishunyang12

LGTM

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

Songrui625 · 2026-04-15T07:53:18Z

LGTM. Can you help to create a L4 e2e test on follow pr, covering the existing diffusion features supported (See #1217). As for how to create a L4 e2e test, please refer to #1832 .

The PR about L4 e2e tests has committed. PTAL, thanks. #2815

…s API PR vllm-project#2309 renamed DiffusionLoRAManager.set_active_adapter (singular) to set_active_adapters (plural) with list signatures. LTX2 distilled stage was added to upstream in vllm-project#2260 after vllm-project#2309 branched, so its two call sites were written against the old singular API and broke when this branch merged upstream/main. Wrap the single LoRARequest / scale in one-element lists to match the new signature; behavior is identical. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ultranationalism <www913363043@gmail.com>

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

Songrui625 requested a review from hsliuustc0106 as a code owner March 27, 2026 08:37

Songrui625 force-pushed the ltx2-2-stage branch from 9551a1b to 32b1d20 Compare March 27, 2026 08:39

chatgpt-codex-connector Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py Outdated

Comment thread vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

Comment thread vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

Songrui625 force-pushed the ltx2-2-stage branch 2 times, most recently from 28df8a8 to 8a21b38 Compare March 27, 2026 08:50

Songrui625 changed the title ~~[Feature] Add two stages inference for model LTX-2 distilled.~~ [Model] Add two stages inference for model LTX-2 distilled. Mar 27, 2026

Songrui625 force-pushed the ltx2-2-stage branch from 8a21b38 to bdec48c Compare March 27, 2026 09:25

hsliuustc0106 requested review from ZJY0516 and david6666666 March 27, 2026 09:37

david6666666 requested changes Mar 30, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py Outdated

Comment thread vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py Outdated

Songrui625 force-pushed the ltx2-2-stage branch 6 times, most recently from 429876c to 793d43c Compare March 31, 2026 08:48

Songrui625 requested a review from david6666666 March 31, 2026 08:52

Songrui625 force-pushed the ltx2-2-stage branch 4 times, most recently from 1a9de72 to 564075f Compare April 1, 2026 07:58

Songrui625 force-pushed the ltx2-2-stage branch from 564075f to 0d6640b Compare April 2, 2026 06:22

hsliuustc0106 requested review from SamitHuang and wtomin April 2, 2026 06:38

Songrui625 added 14 commits April 3, 2026 13:03

Apply linter and formatter

1849e14

Signed-off-by: Songrui625 <songrui625@gmail.com>

Update supported models document

02ac2ef

Signed-off-by: Songrui625 <songrui625@gmail.com>

Fix missing declaration

79375be

Signed-off-by: Songrui625 <songrui625@gmail.com>

Fix audio latents lack of padding in SP

57705ea

Signed-off-by: Songrui625 <songrui625@gmail.com>

Do not duplicate RoPE coodinate in CFG parallel path

b9ddec5

Signed-off-by: Songrui625 <songrui625@gmail.com>

Use self-contained progress bar rather than tqdm

917e9f2

Signed-off-by: Songrui625 <songrui625@gmail.com>

Fix cache_dit acceleration in two stages pipeline of LTX-2

e6c5d12

Signed-off-by: Songrui625 <songrui625@gmail.com>

Fix Ulysses-SP inference of pipeline LTX2TwoStagesPipeline and `LTX…

e7424e0

…2ImageToVideoTwoStagesPipeline` Signed-off-by: Songrui625 <songrui625@gmail.com>

Apply formatter

38df79e

Signed-off-by: Songrui625 <songrui625@gmail.com>

Register two-stages pipeline of LTX-2 to cache dit dedicated adapter

7469f48

Signed-off-by: Songrui625 <songrui625@gmail.com>

Apply formatter

b11022a

Signed-off-by: Songrui625 <songrui625@gmail.com>

Fix typo

7e9691c

Signed-off-by: Songrui625 <songrui625@gmail.com>

Use logger helper function to init logger and apply formatter

6285eda

Signed-off-by: Songrui625 <songrui625@gmail.com>

Use dedicate request object in the second stage inference

746cb6a

Signed-off-by: Songrui625 <songrui625@gmail.com>

Songrui625 force-pushed the ltx2-2-stage branch from 2e031a1 to 746cb6a Compare April 3, 2026 05:30

Songrui625 requested a review from lishunyang12 April 3, 2026 05:35

Fix typo

3315be7

Signed-off-by: Songrui625 <songrui625@gmail.com>

lishunyang12 approved these changes Apr 3, 2026

View reviewed changes

david6666666 merged commit cd71567 into vllm-project:main Apr 3, 2026
8 checks passed

linyueqian pushed a commit to JuanPZuluaga/vllm-omni that referenced this pull request Apr 3, 2026

[Model] Add two stages inference for model LTX-2 distilled. (vllm-pro…

1d3cfa4

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

TKONIY mentioned this pull request Apr 4, 2026

[Diffusion] Refactor LTX2 to use unified CFG parallel framework #2160

Merged

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026

[Model] Add two stages inference for model LTX-2 distilled. (vllm-pro…

9ee34ec

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Model] Add two stages inference for model LTX-2 distilled. (vllm-pro…

9df002b

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

This was referenced Apr 14, 2026

[Feature]: Distilled LoRA support for Diffusion Models #2782

Open

[Test] Add L4 complete diffusion feature test for LTX-2 model #2815

Open

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Model] Add two stages inference for model LTX-2 distilled. (vllm-pro…

f0af189

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Model] Add two stages inference for model LTX-2 distilled. (vllm-pro…

5872e16

…ject#2260) Signed-off-by: Songrui625 <songrui625@gmail.com>

Conversation

Songrui625 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

How to use

Test Plan

Text-To-Video

Image-To-Video

Test Result

Text-To-Video

Image-To-Video

Benchmark

Future Work

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Songrui625 commented Mar 27, 2026

Uh oh!

lishunyang12 commented Mar 27, 2026

Uh oh!

Songrui625 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Songrui625 commented Mar 31, 2026

Uh oh!

david6666666 commented Mar 31, 2026

Uh oh!

Songrui625 commented Apr 1, 2026

Uh oh!

Songrui625 commented Apr 3, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Songrui625 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Songrui625 commented Mar 27, 2026 •

edited

Loading

Songrui625 commented Mar 27, 2026 •

edited

Loading