[Feat][HunyuanVideo-1.5]Support vae-patch-parallel#2418
Conversation
f0edc77 to
6307185
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0edc77b59
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
4805239 to
521fa53
Compare
JiwaniZakir
left a comment
There was a problem hiding this comment.
The bug fix in distributed_vae_executor.py (changing pp_size <= self.world_size to self.rank < pp_size) is the right correction — the old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.
In tile_merge, the slicing tile[:, :, :, :row_limit_h, :row_limit_w] is applied uniformly to every tile including the last row and column. When height % overlap_h != 0 or width % overlap_w != 0, the final tile in tile_split will be a partial tile (Python slice semantics return whatever fits), but after decoding it may have spatial dimensions smaller than row_limit_h / row_limit_w. Applying the same crop unconditionally in tile_merge would then silently truncate the output, producing a smaller-than-expected spatial result. The last row/column tiles should use min(row_limit_h, tile.shape[-2]) and min(row_limit_w, tile.shape[-1]) respectively.
Additionally, in tile_merge, the vertical blend self.blend_v(coord_tensor_map[(i - 1, j)], tile, blend_h) always reads the raw decoded output of the previous row's tile from coord_tensor_map, not the already-blended result. This means horizontal blending previously applied to tile (i-1, j) is discarded when computing the vertical blend, which can leave visible seams at interior corners of the tile grid.
|
any acc comparison |
27decbc to
9914d11
Compare
Fixed, thanks |
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments
|
|
||
| @classmethod | ||
| def from_pretrained(cls, *args: Any, **kwargs: Any): | ||
| model = super().from_pretrained(*args, **kwargs) |
There was a problem hiding this comment.
DistributedAutoencoderKL_base.from_pretrained already does exactly this (calls super().from_pretrained then init_distributed). Drop the override.
There was a problem hiding this comment.
Dropped the override — the base class DistributedAutoencoderKL_base.from_pretrained already calls super().from_pretrained then init_distributed.
| result = self.decoder(z) | ||
| else: | ||
| logger.info("HunyuanVideo VAE: distributed tiled decode with overlap blending") | ||
| result = self.distributed_decoder.execute( |
There was a problem hiding this comment.
This logs on every decode call. Use logger.debug or log once.
There was a problem hiding this comment.
Fixed, changed to logger.debug.
| overlap_w = self.tile_latent_stride_width | ||
| blend_h = self.tile_sample_min_height - self.tile_sample_stride_height | ||
| blend_w = self.tile_sample_min_width - self.tile_sample_stride_width | ||
| row_limit_h = self.tile_sample_stride_height |
There was a problem hiding this comment.
Nit: overlap_h/overlap_w are strides, not overlaps — rename to stride_h/stride_w.
There was a problem hiding this comment.
Fixed, renamed to stride_h/stride_w.
…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
Still need the accuracy comparison previously requested.
| blend_w = grid_spec.tile_spec["blend_w"] | ||
| row_limit_h = grid_spec.tile_spec["row_limit_h"] | ||
| row_limit_w = grid_spec.tile_spec["row_limit_w"] | ||
|
|
There was a problem hiding this comment.
Parent tile_exec applies post_quant_conv before decoding. Here and in the decode() fallback (line 124: result = self.decoder(z)) it's skipped. Is this intentional for HunyuanVideo's VAE? If use_post_quant_conv is False in the config it's fine, but worth confirming since it's a silent difference from the base class behavior.
There was a problem hiding this comment.
Yes, this is intentional. AutoencoderKLHunyuanVideo15 does not use post_quant_conv in its decode path -- both _decode() and tiled_decode() call self.decoder(z) directly without post_quant_conv. Our tile_exec mirrors that behavior exactly.
|
Can you show visual outputs and please do LPLPs test, you can refer to #1470 |
725d231 to
bc460aa
Compare
…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>
…anVideo VAE AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding (both in _decode and tiled_decode). The distributed tile_exec and the non-tiled decode fallback were skipping this step, causing incorrect output. Addresses review comment from hsliuustc0106 on PR vllm-project#2418.
…anVideo VAE AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding (both in _decode and tiled_decode). The distributed tile_exec and the non-tiled decode fallback were skipping this step, causing incorrect output. Addresses review comment from hsliuustc0106 on PR vllm-project#2418. Signed-off-by: daixinning <daixinning@163.com>
932cf5f to
72eef1f
Compare
Visual Output & LPIPS Quality BenchmarkTest setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU
t2v_vae_pp1.mp4
t2v_vae_pp4.mp4Reproducibility: Both runs use identical random seeds fixed via LPIPS (vae-pp=1 vs vae-pp=4, AlexNet backbone)
All 33 frames are bit-for-bit identical -- distributed tiled decode introduces zero quality loss. Timing Comparison
VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end. |
|
The accuracy comparison is now available — please see the LPIPS benchmark above. In short: vae-pp=1 and vae-pp=4 produce bit-for-bit identical outputs (Mean LPIPS = 0.000000 across all 33 frames), so there is zero quality degradation from distributed tiled decode. |
| # 2. local decode | ||
| assigned = self._balance_tasks(tiletask_list, pp_size) | ||
| local_tasks = assigned[self.rank] if pp_size <= self.world_size else [] | ||
| local_tasks = assigned[self.rank] if self.rank < pp_size else [] |
There was a problem hiding this comment.
Why need to change this line?
There was a problem hiding this comment.
As @JiwaniZakir noted in the earlier review:
The old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.
To be more specific: pp_size is computed as min(self.parallel_size, self.world_size), so pp_size <= self.world_size is always True — meaning every rank would unconditionally receive tasks, making vae_patch_parallel_size effectively a no-op. The fix self.rank < pp_size correctly gates task assignment on whether the current rank is within the active VAE parallel group, so ranks outside the group get an empty task list instead.
| return self.tile_split, self.tile_exec, self.tile_merge | ||
| return None, None, None | ||
|
|
||
| def decode(self, z: torch.Tensor, return_dict: bool = True, *args: Any, **kwargs: Any): |
There was a problem hiding this comment.
Could you refer to #2368 to implement vae encode parallel for HunyuanVideo?
There was a problem hiding this comment.
Hi @gcanlin, thanks for the suggestion. We looked into this carefully but concluded that encode parallel is not needed for HunyuanVideo-1.5.
In Wan I2V (#2368), the VAE encode input is a full-length video condition tensor [B, C, num_frames, H, W] — the same spatial resolution and temporal length as the generated video — so encode is a genuine bottleneck worth parallelizing.
In HunyuanVideo-1.5 I2V, the VAE encode input is a single reference frame [B, C, 1, H, W]. The compute and memory cost is negligible compared to decode, so tiled encode parallel would add code complexity with no practical benefit.
For T2V there is no encode at all. So encode parallel has no meaningful use case for HunyuanVideo-1.5, and we have kept the implementation focused on decode parallel only.
…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>
b87292d to
272e08d
Compare
f932687 to
4a2a6d1
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
The executor fix is correct — the old pp_size <= self.world_size was always true since pp_size = min(parallel_size, world_size), so the else branch was dead code.
Two things:
- The encode methods (encode_tile_split/exec/merge, lines 115–189) appear unused — are they intended for a follow-up? If so, consider adding them in that PR so they can be tested alongside the code that calls them.
- No unit tests for the tile split/merge/blend logic. The overlapping tile blending in both H and W is non-trivial and worth testing independently of the e2e run.
| # 2. local decode | ||
| assigned = self._balance_tasks(tiletask_list, pp_size) | ||
| local_tasks = assigned[self.rank] if pp_size <= self.world_size else [] | ||
| local_tasks = assigned[self.rank] if self.rank < pp_size else [] |
There was a problem hiding this comment.
The fix itself is good, but the PR description says this removes "unnecessary pp_size = min(parallel_size, world_size) indirection" — the min() is fine, the old condition was the bug.
There was a problem hiding this comment.
Good catch, thanks. The PR description has been updated to clarify: the itself is correct and necessary, the bug was the guard condition which is always by construction.
| def _strategy_select(self, z: torch.Tensor): | ||
| """Use tile strategy when latent exceeds tile size.""" | ||
| need_spatial = z.shape[3] > self.tile_latent_min_height or z.shape[4] > self.tile_latent_min_width | ||
| if need_spatial: |
There was a problem hiding this comment.
The base class returns patch split as fallback here. This returns None, which means small latents skip distributed execution entirely and just call self.decoder(z) directly. Intentional?
There was a problem hiding this comment.
Yes, intentional. Two reasons:
-
In practice this path is never reached. The tile threshold is 32×32 latent (derived from tile_sample_min=256, spatial_ratio=8). Real inference resolutions always exceed this — 480p latent is 60×104, 720p is 90×160 — so _strategy_select always returns the tile strategy and the None branch is dead code under normal usage.
-
The base class patch_split is not compatible with HunyuanVideo latents. patch_split operates on 4D tensors [B, C, H, W], while HunyuanVideo latents are 5D [B, C, T, H, W]. Falling back to patch strategy would cause a shape mismatch. For the rare case of a latent smaller than the tile threshold, falling back to single-GPU decode via self.decoder(z) is both correct and cheap.
There was a problem hiding this comment.
Good point. Updated: small latents now also go through tile_split/exec/merge instead of falling back to single-GPU decode, which is consistent with the base class behavior of always using distributed execution when enabled. (The base class patch_split is not reused here since it operates on 4D tensors and is incompatible with HunyuanVideo's 5D latents [B, C, T, H, W].)
4a2a6d1 to
22a89eb
Compare
|
@hsliuustc0106 Thanks for the review. Regarding point 1 (encode methods): the encode methods have been removed in the latest push. HunyuanVideo-1.5 I2V only encodes a single reference frame , so the compute cost is negligible and tiled encode parallel adds no practical benefit. T2V has no encode at all. The implementation is now focused on decode parallel only. |
22a89eb to
d424a9e
Compare
|
@hsliuustc0106 Unit tests for tile_split, tile_merge, and blend logic have been added in tests/diffusion/distributed/test_autoencoder_kl_hunyuanvideo.py (12 tests, all passing). Coverage includes: tile grid shape and coord uniqueness, tile size bounds, merge output shape, uniform-latent round-trip, and blend boundary/extent correctness for both H and W directions. |
d424a9e to
7ef11cf
Compare
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge always used (small latents included, consistent with base class behavior; base class patch_split is incompatible with 5D latents) - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines - Add unit tests for tile_split, tile_merge, and blend logic Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
7ef11cf to
42e90ab
Compare
[Feat][HunyuanVideo-1.5] Support vae-patch-parallel for distributed VAE decode
Purpose
Add distributed tiled VAE decode support for HunyuanVideo 1.5 (T2V and I2V) to enable
vae_patch_parallel_size > 1.The existing
DistributedVaeExecutorframework supports splitting VAE decode across multiple workers, but HunyuanVideo 1.5's VAE (AutoencoderKLHunyuanVideo15) lacked the model-specific tile split/merge logic needed to participate in distributed decode. This meant all VAE decode work ran on a single GPU, which is a bottleneck for high-resolution or long-duration video generation.This PR adds
DistributedAutoencoderKLHunyuanVideo, a distributed-aware subclass that implements diffusers-style overlapping spatial tile splitting with linear blending, enabling the VAE decode workload to be distributed acrossvae_patch_parallel_sizeworkers.Changes
New file
autoencoder_kl_hunyuanvideo.py: ImplementsDistributedAutoencoderKLHunyuanVideowith:tile_split(): Splits latent tensors into overlapping spatial tiles (H, W) using configurable tile/stride sizestile_exec(): Decodes a single tile via the decodertile_merge(): Reassembles decoded tiles with linear overlap blending (both horizontal and vertical)_strategy_select(): Automatically enables tiled decode when latent dimensions exceed tile sizedecode(): Dispatches to distributed tiled decode when distributed mode is enabledPipeline integration: Both
pipeline_hunyuan_video_1_5.py(T2V) andpipeline_hunyuan_video_1_5_i2v.py(I2V) now useDistributedAutoencoderKLHunyuanVideoinstead of the upstreamAutoencoderKLHunyuanVideo15.Executor fix (
distributed_vae_executor.py): The old guard conditionpp_size <= self.world_sizewas alwaysTrue(sincepp_size = min(parallel_size, world_size)is by definition ≤world_size), makingvae_patch_parallel_sizeeffectively ignored and causing all ranks to receive decode tasks regardless of the configured parallel size. Fixed by replacing the condition withself.rank < pp_size, so only the intended ranks participate in VAE decode work.Test Plan
End-to-end text-to-video inference on Ascend NPU with HunyuanVideo-1.5 (480p, 33 frames), using the repo's built-in example script with
--vae-patch-parallel-size 4and--vae-use-tiling:python examples/offline_inference/text_to_video/text_to_video.py \ --model hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v \ --prompt "A little girl wearing a straw hat runs through a summer meadow full of wildflowers." \ --num-frames 33 --num-inference-steps 50 --seed 42 \ --tensor-parallel-size 2 \ --cfg-parallel-size 2 \ --vae-patch-parallel-size 4 \ --vae-use-tilingEnvironment: 4x Ascend NPU.
Test Result
Timing Comparison
VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end.
Quality (LPIPS)
Test setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU. Reproducibility ensured via
msprobe.pytorch.seed_all(seed=42, mode=True)+torch.use_deterministic_algorithms(True).All 33 frames are bit-for-bit identical between vae-pp=1 and vae-pp=4 — distributed tiled decode introduces zero quality loss.