[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5 by david6666666 · Pull Request #3979 · vllm-project/vllm-omni

david6666666 · 2026-05-29T08:36:24Z

Summary

add a distributed HunyuanVideo 1.5 VAE wrapper with VAE patch parallel support for both tiled encode and tiled decode
wire the wrapper into HunyuanVideo 1.5 T2V and I2V pipelines
make SP trim encoder padding tokens before attention so USP outputs match the single-GPU fast path
fix VAE patch executor rank assignment when VAEPP uses fewer ranks than the DiT world size
update feature docs and add tile split/merge unit coverage for HunyuanVideo 1.5 VAE encode/decode

Accuracy and Performance

All runs used B300 GPUs 4-7, eager mode, 480p, 33 frames, 50 inference steps, seed 42.
Videos are GitHub-hosted release assets from the fork and are not committed to the codebase.

Case	Config	Time	Accuracy	Video
T2V baseline	1 GPU, no tiling	21.6034s	reference	mp4
T2V baseline	1 GPU, VAE tiling	21.7741s	reference	mp4
T2V	USP2 + VAEPP2	15.5280s	SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline	mp4
T2V	CFG2 + USP2 + VAEPP4	8.0243s	SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline	mp4
I2V baseline	1 GPU, VAE tiling	22.8928s	reference	mp4
I2V	CFG2 + USP2 + VAEPP4	9.0800s	SSIM 1.000000 / PSNR inf vs tiling baseline	mp4

Compatibility smoke tests:

Case	Config	Time	Result	Video
T2V smoke	TP2 + USP2 + VAEPP4, 5 frames / 2 steps	0.6139s	passed	mp4
T2V smoke	TP2 + CFG2 + VAEPP4, 5 frames / 2 steps	0.4062s	passed	mp4

Tests

python -m compileall -q on changed Python files
pytest tests/diffusion/distributed/test_autoencoder_kl_hunyuan.py -q
pre-commit run --files docs/user_guide/diffusion_features.md tests/diffusion/distributed/test_autoencoder_kl_hunyuan.py vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_hunyuan_video_15.py vllm_omni/diffusion/distributed/autoencoders/distributed_vae_executor.py vllm_omni/diffusion/models/hunyuan_video/hunyuan_video_15_transformer.py vllm_omni/diffusion/models/hunyuan_video/pipeline_hunyuan_video_1_5.py vllm_omni/diffusion/models/hunyuan_video/pipeline_hunyuan_video_1_5_i2v.py

Note: local validation emitted the existing vLLM/vLLM-Omni major/minor mismatch warning (vLLM-Omni 0.20.1.dev139, vLLM 0.21.0), but the checks and offline runs completed successfully.

Signed-off-by: david6666666 <530634352@qq.com>

chatgpt-codex-connector · 2026-05-29T08:36:32Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

david6666666 · 2026-05-29T08:44:53Z

Added GitHub-hosted validation videos for each generated case. These files are release assets in the fork and are not committed to the codebase.

Case	Config	Time	Accuracy / Result	Video
T2V baseline	1 GPU, no tiling	21.6034s	reference	mp4
T2V baseline	1 GPU, VAE tiling	21.7741s	reference	mp4
T2V	USP2 + VAEPP2	15.5280s	SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline	mp4
T2V	CFG2 + USP2 + VAEPP4	8.0243s	SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline	mp4
I2V baseline	1 GPU, VAE tiling	22.8928s	reference	mp4
I2V	CFG2 + USP2 + VAEPP4	9.0800s	SSIM 1.000000 / PSNR inf vs tiling baseline	mp4
T2V smoke	TP2 + USP2 + VAEPP4, 5 frames / 2 steps	0.6139s	passed	mp4
T2V smoke	TP2 + CFG2 + VAEPP4, 5 frames / 2 steps	0.4062s	passed	mp4

david6666666 · 2026-05-29T09:02:26Z

        # 2. local decode
        assigned = self._balance_tasks(tiletask_list, pp_size)
-        local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
+        local_tasks = assigned[self.rank] if self.rank < pp_size else []


fixed by #3928

chatgpt-codex-connector · 2026-05-30T01:42:31Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

david6666666 · 2026-05-30T01:58:45Z

@gcanlin @lishunyang12 ptal thx

david6666666 · 2026-05-30T02:07:53Z

VAE Patch Parallel Flow

This diagram shows where VAE encode/decode happens in HunyuanVideo 1.5 and how VAE patch parallelism distributes tiled VAE work across ranks.

flowchart TD
    A["Pipeline calls VAE"] --> B{"T2V or I2V?"}

    B -->|"T2V"| C["Denoised latents"]
    C --> D["VAE decode"]

    B -->|"I2V"| E["Input image"]
    E --> F["VAE encode: image -> image_latents"]
    F --> G["Use image_latents as first-frame condition"]
    G --> H["Denoised latents"]
    H --> D

    D --> I{"vae_patch_parallel_size > 1 and use_tiling?"}
    F --> I

    I -->|"No"| J["Use original diffusers tiled_encode / tiled_decode"]
    I -->|"Yes"| K["Distributed VAE Executor"]

    K --> L["1. Split tensor into H/W spatial tiles"]
    L --> M["2. Balance tile workload across VAEPP ranks"]
    M --> N0["Rank 0 executes assigned tiles"]
    M --> N1["Rank 1 executes assigned tiles"]
    M --> N2["Rank 2 executes assigned tiles"]
    M --> N3["Rank 3 executes assigned tiles"]

    N0 --> O["all_gather tile outputs and metadata"]
    N1 --> O
    N2 --> O
    N3 --> O

    O --> P["Rank 0 reconstructs tile grid by coordinates"]
    P --> Q["Blend overlap regions with blend_v / blend_h"]
    Q --> R["Crop row_limit regions and concatenate"]
    R --> S["Broadcast full result back to all ranks"]

    S --> T{"Current VAE operation"}
    T -->|"encode"| U["Return full image_latents"]
    T -->|"decode"| V["Return full video pixels"]

Precision Impact

VAEPP does not change the VAE math relative to the single-GPU tiled VAE path. It only changes execution placement: tiles are computed on different ranks, then gathered and merged in the same grid order.

flowchart LR
    A["Single-GPU tiled VAE"] --> A1["Same tile split"]
    A1 --> A2["Same encoder / decoder"]
    A2 --> A3["Same overlap blend"]
    A3 --> A4["Same concat order"]

    B["VAE patch parallel"] --> B1["Same tile split"]
    B1 --> B2["Different ranks execute different tiles"]
    B2 --> B3["all_gather to rank 0"]
    B3 --> B4["Same overlap blend"]
    B4 --> B5["Same concat order"]

    A4 --> C["Output"]
    B5 --> C

    C --> D["Matches single-GPU tiling baseline"]
    D --> E["SSIM 1.0 / PSNR inf in validation"]

For T2V, the VAEPP coverage is decode-only because there is no input image to encode. For I2V, the coverage includes both encode and decode: the input image is encoded into first-frame condition latents, and the final denoised latents are decoded back to video frames.

gcanlin

LGTM.

gcanlin

Can we add usp + vae pp to CI?

david6666666 · 2026-05-30T06:55:55Z

Can we add usp + vae pp to CI?

will add x2v function test follow up

…llm-project#3979) Signed-off-by: david6666666 <530634352@qq.com>

Support HunyuanVideo 1.5 USP and VAE patch parallel

9c36d65

Signed-off-by: david6666666 <530634352@qq.com>

david6666666 requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, hsliuustc0106, princepride, wtomin and yenuo26 as code owners May 29, 2026 08:36

david6666666 marked this pull request as draft May 29, 2026 08:39

david6666666 commented May 29, 2026

View reviewed changes

Merge branch 'main' into codex/hv15-usp-vae-pp

dbb746c

david6666666 marked this pull request as ready for review May 30, 2026 01:42

gcanlin approved these changes May 30, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label May 30, 2026

gcanlin reviewed May 30, 2026

View reviewed changes

Merge branch 'main' into codex/hv15-usp-vae-pp

313dc8e

david6666666 merged commit a345675 into vllm-project:main May 30, 2026
7 of 8 checks passed

david6666666 mentioned this pull request Jun 1, 2026

[Rebase] Rebase to vllm releases/v0.22.0 #3891

Merged

5 tasks

86MaxCao pushed a commit to 86MaxCao/vllm-omni that referenced this pull request Jun 4, 2026

[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5 (v…

1842579

…llm-project#3979) Signed-off-by: david6666666 <530634352@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5#3979

[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5#3979
david6666666 merged 3 commits into
vllm-project:mainfrom
david6666666:codex/hv15-usp-vae-pp

david6666666 commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

david6666666 commented May 29, 2026

Uh oh!

david6666666 May 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

david6666666 commented May 30, 2026

Uh oh!

david6666666 commented May 30, 2026

Uh oh!

gcanlin left a comment

Uh oh!

gcanlin left a comment

Uh oh!

Uh oh!

david6666666 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

david6666666 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Accuracy and Performance

Tests

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

david6666666 commented May 29, 2026

Uh oh!

david6666666 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 30, 2026

Uh oh!

david6666666 commented May 30, 2026

Uh oh!

david6666666 commented May 30, 2026

VAE Patch Parallel Flow

Precision Impact

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david6666666 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david6666666 commented May 29, 2026 •

edited

Loading