[Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5#3979
Conversation
Signed-off-by: david6666666 <530634352@qq.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Added GitHub-hosted validation videos for each generated case. These files are release assets in the fork and are not committed to the codebase.
|
| # 2. local decode | ||
| assigned = self._balance_tasks(tiletask_list, pp_size) | ||
| local_tasks = assigned[self.rank] if pp_size <= self.world_size else [] | ||
| local_tasks = assigned[self.rank] if self.rank < pp_size else [] |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@gcanlin @lishunyang12 ptal thx |
VAE Patch Parallel FlowThis diagram shows where VAE encode/decode happens in HunyuanVideo 1.5 and how VAE patch parallelism distributes tiled VAE work across ranks. flowchart TD
A["Pipeline calls VAE"] --> B{"T2V or I2V?"}
B -->|"T2V"| C["Denoised latents"]
C --> D["VAE decode"]
B -->|"I2V"| E["Input image"]
E --> F["VAE encode: image -> image_latents"]
F --> G["Use image_latents as first-frame condition"]
G --> H["Denoised latents"]
H --> D
D --> I{"vae_patch_parallel_size > 1 and use_tiling?"}
F --> I
I -->|"No"| J["Use original diffusers tiled_encode / tiled_decode"]
I -->|"Yes"| K["Distributed VAE Executor"]
K --> L["1. Split tensor into H/W spatial tiles"]
L --> M["2. Balance tile workload across VAEPP ranks"]
M --> N0["Rank 0 executes assigned tiles"]
M --> N1["Rank 1 executes assigned tiles"]
M --> N2["Rank 2 executes assigned tiles"]
M --> N3["Rank 3 executes assigned tiles"]
N0 --> O["all_gather tile outputs and metadata"]
N1 --> O
N2 --> O
N3 --> O
O --> P["Rank 0 reconstructs tile grid by coordinates"]
P --> Q["Blend overlap regions with blend_v / blend_h"]
Q --> R["Crop row_limit regions and concatenate"]
R --> S["Broadcast full result back to all ranks"]
S --> T{"Current VAE operation"}
T -->|"encode"| U["Return full image_latents"]
T -->|"decode"| V["Return full video pixels"]
Precision ImpactVAEPP does not change the VAE math relative to the single-GPU tiled VAE path. It only changes execution placement: tiles are computed on different ranks, then gathered and merged in the same grid order. flowchart LR
A["Single-GPU tiled VAE"] --> A1["Same tile split"]
A1 --> A2["Same encoder / decoder"]
A2 --> A3["Same overlap blend"]
A3 --> A4["Same concat order"]
B["VAE patch parallel"] --> B1["Same tile split"]
B1 --> B2["Different ranks execute different tiles"]
B2 --> B3["all_gather to rank 0"]
B3 --> B4["Same overlap blend"]
B4 --> B5["Same concat order"]
A4 --> C["Output"]
B5 --> C
C --> D["Matches single-GPU tiling baseline"]
D --> E["SSIM 1.0 / PSNR inf in validation"]
For T2V, the VAEPP coverage is decode-only because there is no input image to encode. For I2V, the coverage includes both encode and decode: the input image is encoded into first-frame condition latents, and the final denoised latents are decoded back to video frames. |
gcanlin
left a comment
There was a problem hiding this comment.
Can we add usp + vae pp to CI?
will add x2v function test follow up |
…llm-project#3979) Signed-off-by: david6666666 <530634352@qq.com>
Summary
Accuracy and Performance
All runs used B300 GPUs 4-7, eager mode, 480p, 33 frames, 50 inference steps, seed 42.
Videos are GitHub-hosted release assets from the fork and are not committed to the codebase.
Compatibility smoke tests:
Tests
Note: local validation emitted the existing vLLM/vLLM-Omni major/minor mismatch warning (vLLM-Omni 0.20.1.dev139, vLLM 0.21.0), but the checks and offline runs completed successfully.