Skip to content

[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5#3979

Merged
david6666666 merged 3 commits into
vllm-project:mainfrom
david6666666:codex/hv15-usp-vae-pp
May 30, 2026
Merged

[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5#3979
david6666666 merged 3 commits into
vllm-project:mainfrom
david6666666:codex/hv15-usp-vae-pp

Conversation

@david6666666
Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 commented May 29, 2026

Summary

  • add a distributed HunyuanVideo 1.5 VAE wrapper with VAE patch parallel support for both tiled encode and tiled decode
  • wire the wrapper into HunyuanVideo 1.5 T2V and I2V pipelines
  • make SP trim encoder padding tokens before attention so USP outputs match the single-GPU fast path
  • fix VAE patch executor rank assignment when VAEPP uses fewer ranks than the DiT world size
  • update feature docs and add tile split/merge unit coverage for HunyuanVideo 1.5 VAE encode/decode

Accuracy and Performance

All runs used B300 GPUs 4-7, eager mode, 480p, 33 frames, 50 inference steps, seed 42.
Videos are GitHub-hosted release assets from the fork and are not committed to the codebase.

Case Config Time Accuracy Video
T2V baseline 1 GPU, no tiling 21.6034s reference mp4
T2V baseline 1 GPU, VAE tiling 21.7741s reference mp4
T2V USP2 + VAEPP2 15.5280s SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline mp4
T2V CFG2 + USP2 + VAEPP4 8.0243s SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline mp4
I2V baseline 1 GPU, VAE tiling 22.8928s reference mp4
I2V CFG2 + USP2 + VAEPP4 9.0800s SSIM 1.000000 / PSNR inf vs tiling baseline mp4

Compatibility smoke tests:

Case Config Time Result Video
T2V smoke TP2 + USP2 + VAEPP4, 5 frames / 2 steps 0.6139s passed mp4
T2V smoke TP2 + CFG2 + VAEPP4, 5 frames / 2 steps 0.4062s passed mp4

Tests

  • python -m compileall -q on changed Python files
  • pytest tests/diffusion/distributed/test_autoencoder_kl_hunyuan.py -q
  • pre-commit run --files docs/user_guide/diffusion_features.md tests/diffusion/distributed/test_autoencoder_kl_hunyuan.py vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_hunyuan_video_15.py vllm_omni/diffusion/distributed/autoencoders/distributed_vae_executor.py vllm_omni/diffusion/models/hunyuan_video/hunyuan_video_15_transformer.py vllm_omni/diffusion/models/hunyuan_video/pipeline_hunyuan_video_1_5.py vllm_omni/diffusion/models/hunyuan_video/pipeline_hunyuan_video_1_5_i2v.py

Note: local validation emitted the existing vLLM/vLLM-Omni major/minor mismatch warning (vLLM-Omni 0.20.1.dev139, vLLM 0.21.0), but the checks and offline runs completed successfully.

Signed-off-by: david6666666 <530634352@qq.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@david6666666 david6666666 marked this pull request as draft May 29, 2026 08:39
@david6666666
Copy link
Copy Markdown
Collaborator Author

Added GitHub-hosted validation videos for each generated case. These files are release assets in the fork and are not committed to the codebase.

Case Config Time Accuracy / Result Video
T2V baseline 1 GPU, no tiling 21.6034s reference mp4
T2V baseline 1 GPU, VAE tiling 21.7741s reference mp4
T2V USP2 + VAEPP2 15.5280s SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline mp4
T2V CFG2 + USP2 + VAEPP4 8.0243s SSIM 1.000000 / PSNR inf vs tiling baseline; SSIM 0.959790 / PSNR 38.925356 vs no-tiling baseline mp4
I2V baseline 1 GPU, VAE tiling 22.8928s reference mp4
I2V CFG2 + USP2 + VAEPP4 9.0800s SSIM 1.000000 / PSNR inf vs tiling baseline mp4
T2V smoke TP2 + USP2 + VAEPP4, 5 frames / 2 steps 0.6139s passed mp4
T2V smoke TP2 + CFG2 + VAEPP4, 5 frames / 2 steps 0.4062s passed mp4

# 2. local decode
assigned = self._balance_tasks(tiletask_list, pp_size)
local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
local_tasks = assigned[self.rank] if self.rank < pp_size else []
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by #3928

@david6666666 david6666666 marked this pull request as ready for review May 30, 2026 01:42
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@david6666666
Copy link
Copy Markdown
Collaborator Author

@gcanlin @lishunyang12 ptal thx

@david6666666
Copy link
Copy Markdown
Collaborator Author

VAE Patch Parallel Flow

This diagram shows where VAE encode/decode happens in HunyuanVideo 1.5 and how VAE patch parallelism distributes tiled VAE work across ranks.

flowchart TD
    A["Pipeline calls VAE"] --> B{"T2V or I2V?"}

    B -->|"T2V"| C["Denoised latents"]
    C --> D["VAE decode"]

    B -->|"I2V"| E["Input image"]
    E --> F["VAE encode: image -> image_latents"]
    F --> G["Use image_latents as first-frame condition"]
    G --> H["Denoised latents"]
    H --> D

    D --> I{"vae_patch_parallel_size > 1 and use_tiling?"}
    F --> I

    I -->|"No"| J["Use original diffusers tiled_encode / tiled_decode"]
    I -->|"Yes"| K["Distributed VAE Executor"]

    K --> L["1. Split tensor into H/W spatial tiles"]
    L --> M["2. Balance tile workload across VAEPP ranks"]
    M --> N0["Rank 0 executes assigned tiles"]
    M --> N1["Rank 1 executes assigned tiles"]
    M --> N2["Rank 2 executes assigned tiles"]
    M --> N3["Rank 3 executes assigned tiles"]

    N0 --> O["all_gather tile outputs and metadata"]
    N1 --> O
    N2 --> O
    N3 --> O

    O --> P["Rank 0 reconstructs tile grid by coordinates"]
    P --> Q["Blend overlap regions with blend_v / blend_h"]
    Q --> R["Crop row_limit regions and concatenate"]
    R --> S["Broadcast full result back to all ranks"]

    S --> T{"Current VAE operation"}
    T -->|"encode"| U["Return full image_latents"]
    T -->|"decode"| V["Return full video pixels"]
Loading

Precision Impact

VAEPP does not change the VAE math relative to the single-GPU tiled VAE path. It only changes execution placement: tiles are computed on different ranks, then gathered and merged in the same grid order.

flowchart LR
    A["Single-GPU tiled VAE"] --> A1["Same tile split"]
    A1 --> A2["Same encoder / decoder"]
    A2 --> A3["Same overlap blend"]
    A3 --> A4["Same concat order"]

    B["VAE patch parallel"] --> B1["Same tile split"]
    B1 --> B2["Different ranks execute different tiles"]
    B2 --> B3["all_gather to rank 0"]
    B3 --> B4["Same overlap blend"]
    B4 --> B5["Same concat order"]

    A4 --> C["Output"]
    B5 --> C

    C --> D["Matches single-GPU tiling baseline"]
    D --> E["SSIM 1.0 / PSNR inf in validation"]
Loading

For T2V, the VAEPP coverage is decode-only because there is no input image to encode. For I2V, the coverage includes both encode and decode: the input image is encoded into first-frame condition latents, and the final denoised latents are decoded back to video frames.

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gcanlin gcanlin added the ready label to trigger buildkite CI label May 30, 2026
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add usp + vae pp to CI?

@david6666666 david6666666 merged commit a345675 into vllm-project:main May 30, 2026
7 of 8 checks passed
@david6666666
Copy link
Copy Markdown
Collaborator Author

Can we add usp + vae pp to CI?

will add x2v function test follow up

86MaxCao pushed a commit to 86MaxCao/vllm-omni that referenced this pull request Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants