Skip to content

[Feat][HunyuanVideo-1.5]Support vae-patch-parallel#2418

Open
daixinning wants to merge 1 commit into
vllm-project:mainfrom
daixinning:hy-video-vae-pp
Open

[Feat][HunyuanVideo-1.5]Support vae-patch-parallel#2418
daixinning wants to merge 1 commit into
vllm-project:mainfrom
daixinning:hy-video-vae-pp

Conversation

@daixinning
Copy link
Copy Markdown
Contributor

@daixinning daixinning commented Apr 1, 2026

[Feat][HunyuanVideo-1.5] Support vae-patch-parallel for distributed VAE decode

Purpose

Add distributed tiled VAE decode support for HunyuanVideo 1.5 (T2V and I2V) to enable vae_patch_parallel_size > 1.

The existing DistributedVaeExecutor framework supports splitting VAE decode across multiple workers, but HunyuanVideo 1.5's VAE (AutoencoderKLHunyuanVideo15) lacked the model-specific tile split/merge logic needed to participate in distributed decode. This meant all VAE decode work ran on a single GPU, which is a bottleneck for high-resolution or long-duration video generation.

This PR adds DistributedAutoencoderKLHunyuanVideo, a distributed-aware subclass that implements diffusers-style overlapping spatial tile splitting with linear blending, enabling the VAE decode workload to be distributed across vae_patch_parallel_size workers.

Changes

  1. New file autoencoder_kl_hunyuanvideo.py: Implements DistributedAutoencoderKLHunyuanVideo with:

    • tile_split(): Splits latent tensors into overlapping spatial tiles (H, W) using configurable tile/stride sizes
    • tile_exec(): Decodes a single tile via the decoder
    • tile_merge(): Reassembles decoded tiles with linear overlap blending (both horizontal and vertical)
    • _strategy_select(): Automatically enables tiled decode when latent dimensions exceed tile size
    • decode(): Dispatches to distributed tiled decode when distributed mode is enabled
  2. Pipeline integration: Both pipeline_hunyuan_video_1_5.py (T2V) and pipeline_hunyuan_video_1_5_i2v.py (I2V) now use DistributedAutoencoderKLHunyuanVideo instead of the upstream AutoencoderKLHunyuanVideo15.

  3. Executor fix (distributed_vae_executor.py): The old guard condition pp_size <= self.world_size was always True (since pp_size = min(parallel_size, world_size) is by definition ≤ world_size), making vae_patch_parallel_size effectively ignored and causing all ranks to receive decode tasks regardless of the configured parallel size. Fixed by replacing the condition with self.rank < pp_size, so only the intended ranks participate in VAE decode work.

Test Plan

End-to-end text-to-video inference on Ascend NPU with HunyuanVideo-1.5 (480p, 33 frames), using the repo's built-in example script with --vae-patch-parallel-size 4 and --vae-use-tiling:

python examples/offline_inference/text_to_video/text_to_video.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v \
    --prompt "A little girl wearing a straw hat runs through a summer meadow full of wildflowers." \
    --num-frames 33 --num-inference-steps 50 --seed 42 \
    --tensor-parallel-size 2 \
    --cfg-parallel-size 2 \
    --vae-patch-parallel-size 4 \
    --vae-use-tiling

Environment: 4x Ascend NPU.

Test Result

Timing Comparison

Config TextEncoding Denoising Decoding Total
vae-pp=1 214.71 ms 77461.30 ms 11788.63 ms 90592 ms
vae-pp=4 214.91 ms 77140.65 ms 3639.62 ms 82085 ms
Speedup -- -- 3.24x 1.10x

VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end.

Quality (LPIPS)

Test setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU. Reproducibility ensured via msprobe.pytorch.seed_all(seed=42, mode=True) + torch.use_deterministic_algorithms(True).

Metric Value
Mean LPIPS 0.000000
Max LPIPS 0.000000
Min LPIPS 0.000000

All 33 frames are bit-for-bit identical between vae-pp=1 and vae-pp=4 — distributed tiled decode introduces zero quality loss.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0edc77b59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/distributed/autoencoders/distributed_vae_executor.py Outdated
Comment thread vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_hunyuanvideo.py Outdated
@daixinning daixinning force-pushed the hy-video-vae-pp branch 5 times, most recently from 4805239 to 521fa53 Compare April 1, 2026 16:12
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug fix in distributed_vae_executor.py (changing pp_size <= self.world_size to self.rank < pp_size) is the right correction — the old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

In tile_merge, the slicing tile[:, :, :, :row_limit_h, :row_limit_w] is applied uniformly to every tile including the last row and column. When height % overlap_h != 0 or width % overlap_w != 0, the final tile in tile_split will be a partial tile (Python slice semantics return whatever fits), but after decoding it may have spatial dimensions smaller than row_limit_h / row_limit_w. Applying the same crop unconditionally in tile_merge would then silently truncate the output, producing a smaller-than-expected spatial result. The last row/column tiles should use min(row_limit_h, tile.shape[-2]) and min(row_limit_w, tile.shape[-1]) respectively.

Additionally, in tile_merge, the vertical blend self.blend_v(coord_tensor_map[(i - 1, j)], tile, blend_h) always reads the raw decoded output of the previous row's tile from coord_tensor_map, not the already-blended result. This means horizontal blending previously applied to tile (i-1, j) is discarded when computing the vertical blend, which can leave visible seams at interior corners of the tile grid.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any acc comparison

@daixinning daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from 27decbc to 9914d11 Compare April 2, 2026 02:29
@daixinning
Copy link
Copy Markdown
Contributor Author

The bug fix in distributed_vae_executor.py (changing pp_size <= self.world_size to self.rank < pp_size) is the right correction — the old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

In tile_merge, the slicing tile[:, :, :, :row_limit_h, :row_limit_w] is applied uniformly to every tile including the last row and column. When height % overlap_h != 0 or width % overlap_w != 0, the final tile in tile_split will be a partial tile (Python slice semantics return whatever fits), but after decoding it may have spatial dimensions smaller than row_limit_h / row_limit_w. Applying the same crop unconditionally in tile_merge would then silently truncate the output, producing a smaller-than-expected spatial result. The last row/column tiles should use min(row_limit_h, tile.shape[-2]) and min(row_limit_w, tile.shape[-1]) respectively.

Additionally, in tile_merge, the vertical blend self.blend_v(coord_tensor_map[(i - 1, j)], tile, blend_h) always reads the raw decoded output of the previous row's tile from coord_tensor_map, not the already-blended result. This means horizontal blending previously applied to tile (i-1, j) is discarded when computing the vertical blend, which can leave visible seams at interior corners of the tile grid.

Fixed, thanks

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple comments


@classmethod
def from_pretrained(cls, *args: Any, **kwargs: Any):
model = super().from_pretrained(*args, **kwargs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedAutoencoderKL_base.from_pretrained already does exactly this (calls super().from_pretrained then init_distributed). Drop the override.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the override — the base class DistributedAutoencoderKL_base.from_pretrained already calls super().from_pretrained then init_distributed.

result = self.decoder(z)
else:
logger.info("HunyuanVideo VAE: distributed tiled decode with overlap blending")
result = self.distributed_decoder.execute(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs on every decode call. Use logger.debug or log once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, changed to logger.debug.

overlap_w = self.tile_latent_stride_width
blend_h = self.tile_sample_min_height - self.tile_sample_stride_height
blend_w = self.tile_sample_min_width - self.tile_sample_stride_width
row_limit_h = self.tile_sample_stride_height
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: overlap_h/overlap_w are strides, not overlaps — rename to stride_h/stride_w.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, renamed to stride_h/stride_w.

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 5, 2026
…unyuanvideo.py

- Remove redundant from_pretrained override; base class already calls
  super().from_pretrained then init_distributed
- Rename overlap_h/overlap_w to stride_h/stride_w to better reflect
  their semantics as tile strides
- Change logger.info to logger.debug to avoid logging on every decode call

Signed-off-by: daixinning <daixinning@163.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need the accuracy comparison previously requested.

blend_w = grid_spec.tile_spec["blend_w"]
row_limit_h = grid_spec.tile_spec["row_limit_h"]
row_limit_w = grid_spec.tile_spec["row_limit_w"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent tile_exec applies post_quant_conv before decoding. Here and in the decode() fallback (line 124: result = self.decoder(z)) it's skipped. Is this intentional for HunyuanVideo's VAE? If use_post_quant_conv is False in the config it's fine, but worth confirming since it's a silent difference from the base class behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. AutoencoderKLHunyuanVideo15 does not use post_quant_conv in its decode path -- both _decode() and tiled_decode() call self.decoder(z) directly without post_quant_conv. Our tile_exec mirrors that behavior exactly.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Can you show visual outputs and please do LPLPs test, you can refer to #1470

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 7, 2026
…unyuanvideo.py

- Remove redundant from_pretrained override; base class already calls
  super().from_pretrained then init_distributed
- Rename overlap_h/overlap_w to stride_h/stride_w to better reflect
  their semantics as tile strides
- Change logger.info to logger.debug to avoid logging on every decode call

Signed-off-by: daixinning <daixinning@163.com>
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 8, 2026
…anVideo VAE

AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding
(both in _decode and tiled_decode). The distributed tile_exec and the
non-tiled decode fallback were skipping this step, causing incorrect output.

Addresses review comment from hsliuustc0106 on PR vllm-project#2418.
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 8, 2026
…anVideo VAE

AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding
(both in _decode and tiled_decode). The distributed tile_exec and the
non-tiled decode fallback were skipping this step, causing incorrect output.

Addresses review comment from hsliuustc0106 on PR vllm-project#2418.

Signed-off-by: daixinning <daixinning@163.com>
@daixinning daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from 932cf5f to 72eef1f Compare April 9, 2026 07:30
@daixinning
Copy link
Copy Markdown
Contributor Author

daixinning commented Apr 9, 2026

Visual Output & LPIPS Quality Benchmark

Test setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU

  • Baseline: vae_patch_parallel_size=1
t2v_vae_pp1.mp4
  • Distributed: vae_patch_parallel_size=4
t2v_vae_pp4.mp4

Reproducibility: Both runs use identical random seeds fixed via msprobe.pytorch.seed_all(seed=42, mode=True), which on Ascend NPU sets torch.npu.manual_seed_all, HCCL_DETERMINISTIC=True, CLOSE_MATMUL_K_SHIFT=1, and torch.use_deterministic_algorithms(True), plus torch.Generator(...).manual_seed(42) for latent initialization.

LPIPS (vae-pp=1 vs vae-pp=4, AlexNet backbone)

Metric Value
Mean LPIPS 0.000000
Max LPIPS 0.000000
Min LPIPS 0.000000

All 33 frames are bit-for-bit identical -- distributed tiled decode introduces zero quality loss.

Timing Comparison

Config TextEncoding Denoising Decoding Total
vae-pp=1 214.71 ms 77461.30 ms 11788.63 ms 90592 ms
vae-pp=4 214.91 ms 77140.65 ms 3639.62 ms 82085 ms
Speedup -- -- 3.24x 1.10x

VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end.

@lishunyang12 @hsliuustc0106

@daixinning
Copy link
Copy Markdown
Contributor Author

The accuracy comparison is now available — please see the LPIPS benchmark above. In short: vae-pp=1 and vae-pp=4 produce bit-for-bit identical outputs (Mean LPIPS = 0.000000 across all 33 frames), so there is zero quality degradation from distributed tiled decode.

# 2. local decode
assigned = self._balance_tasks(tiletask_list, pp_size)
local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
local_tasks = assigned[self.rank] if self.rank < pp_size else []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to change this line?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @JiwaniZakir noted in the earlier review:

The old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

To be more specific: pp_size is computed as min(self.parallel_size, self.world_size), so pp_size <= self.world_size is always True — meaning every rank would unconditionally receive tasks, making vae_patch_parallel_size effectively a no-op. The fix self.rank < pp_size correctly gates task assignment on whether the current rank is within the active VAE parallel group, so ranks outside the group get an empty task list instead.

return self.tile_split, self.tile_exec, self.tile_merge
return None, None, None

def decode(self, z: torch.Tensor, return_dict: bool = True, *args: Any, **kwargs: Any):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refer to #2368 to implement vae encode parallel for HunyuanVideo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gcanlin, thanks for the suggestion. We looked into this carefully but concluded that encode parallel is not needed for HunyuanVideo-1.5.

In Wan I2V (#2368), the VAE encode input is a full-length video condition tensor [B, C, num_frames, H, W] — the same spatial resolution and temporal length as the generated video — so encode is a genuine bottleneck worth parallelizing.

In HunyuanVideo-1.5 I2V, the VAE encode input is a single reference frame [B, C, 1, H, W]. The compute and memory cost is negligible compared to decode, so tiled encode parallel would add code complexity with no practical benefit.

For T2V there is no encode at all. So encode parallel has no meaningful use case for HunyuanVideo-1.5, and we have kept the implementation focused on decode parallel only.

@daixinning daixinning changed the title [Feat][HunyuanVideo-1.5]Support vae-patch-parallel [WIP][Feat][HunyuanVideo-1.5]Support vae-patch-parallel Apr 13, 2026
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
…unyuanvideo.py

- Remove redundant from_pretrained override; base class already calls
  super().from_pretrained then init_distributed
- Rename overlap_h/overlap_w to stride_h/stride_w to better reflect
  their semantics as tile strides
- Change logger.info to logger.debug to avoid logging on every decode call

Signed-off-by: daixinning <daixinning@163.com>
@daixinning daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from f932687 to 4a2a6d1 Compare April 13, 2026 12:10
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executor fix is correct — the old pp_size <= self.world_size was always true since pp_size = min(parallel_size, world_size), so the else branch was dead code.

Two things:

  1. The encode methods (encode_tile_split/exec/merge, lines 115–189) appear unused — are they intended for a follow-up? If so, consider adding them in that PR so they can be tested alongside the code that calls them.
  2. No unit tests for the tile split/merge/blend logic. The overlapping tile blending in both H and W is non-trivial and worth testing independently of the e2e run.

# 2. local decode
assigned = self._balance_tasks(tiletask_list, pp_size)
local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
local_tasks = assigned[self.rank] if self.rank < pp_size else []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself is good, but the PR description says this removes "unnecessary pp_size = min(parallel_size, world_size) indirection" — the min() is fine, the old condition was the bug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. The PR description has been updated to clarify: the itself is correct and necessary, the bug was the guard condition which is always by construction.

def _strategy_select(self, z: torch.Tensor):
"""Use tile strategy when latent exceeds tile size."""
need_spatial = z.shape[3] > self.tile_latent_min_height or z.shape[4] > self.tile_latent_min_width
if need_spatial:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class returns patch split as fallback here. This returns None, which means small latents skip distributed execution entirely and just call self.decoder(z) directly. Intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentional. Two reasons:

  1. In practice this path is never reached. The tile threshold is 32×32 latent (derived from tile_sample_min=256, spatial_ratio=8). Real inference resolutions always exceed this — 480p latent is 60×104, 720p is 90×160 — so _strategy_select always returns the tile strategy and the None branch is dead code under normal usage.

  2. The base class patch_split is not compatible with HunyuanVideo latents. patch_split operates on 4D tensors [B, C, H, W], while HunyuanVideo latents are 5D [B, C, T, H, W]. Falling back to patch strategy would cause a shape mismatch. For the rare case of a latent smaller than the tile threshold, falling back to single-GPU decode via self.decoder(z) is both correct and cheap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated: small latents now also go through tile_split/exec/merge instead of falling back to single-GPU decode, which is consistent with the base class behavior of always using distributed execution when enabled. (The base class patch_split is not reused here since it operates on 4D tensors and is incompatible with HunyuanVideo's 5D latents [B, C, T, H, W].)

@daixinning daixinning changed the title [WIP][Feat][HunyuanVideo-1.5]Support vae-patch-parallel [Feat][HunyuanVideo-1.5]Support vae-patch-parallel Apr 13, 2026
@daixinning
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 Thanks for the review. Regarding point 1 (encode methods): the encode methods have been removed in the latest push. HunyuanVideo-1.5 I2V only encodes a single reference frame , so the compute cost is negligible and tiled encode parallel adds no practical benefit. T2V has no encode at all. The implementation is now focused on decode parallel only.

@daixinning
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 Unit tests for tile_split, tile_merge, and blend logic have been added in tests/diffusion/distributed/test_autoencoder_kl_hunyuanvideo.py (12 tests, all passing). Coverage includes: tile grid shape and coord uniqueness, tile size bounds, merge output shape, uniform-latent round-trip, and blend boundary/extent correctness for both H and W directions.

- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge
  for distributed decode
- decode: tile_split/exec/merge always used (small latents included, consistent
  with base class behavior; base class patch_split is incompatible with 5D latents)
- Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size
- Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines
- Add unit tests for tile_split, tile_merge, and blend logic

Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel).

Signed-off-by: daixinning <daixinning@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants