[Feat][HunyuanVideo-1.5]Support vae-patch-parallel by daixinning · Pull Request #2418 · vllm-project/vllm-omni

daixinning · 2026-04-01T11:19:01Z

[Feat][HunyuanVideo-1.5] Support vae-patch-parallel for distributed VAE decode

Purpose

Add distributed tiled VAE decode support for HunyuanVideo 1.5 (T2V and I2V) to enable vae_patch_parallel_size > 1.

The existing DistributedVaeExecutor framework supports splitting VAE decode across multiple workers, but HunyuanVideo 1.5's VAE (AutoencoderKLHunyuanVideo15) lacked the model-specific tile split/merge logic needed to participate in distributed decode. This meant all VAE decode work ran on a single GPU, which is a bottleneck for high-resolution or long-duration video generation.

This PR adds DistributedAutoencoderKLHunyuanVideo, a distributed-aware subclass that implements diffusers-style overlapping spatial tile splitting with linear blending, enabling the VAE decode workload to be distributed across vae_patch_parallel_size workers.

Changes

New file autoencoder_kl_hunyuanvideo.py: Implements DistributedAutoencoderKLHunyuanVideo with:
- tile_split(): Splits latent tensors into overlapping spatial tiles (H, W) using configurable tile/stride sizes
- tile_exec(): Decodes a single tile via the decoder
- tile_merge(): Reassembles decoded tiles with linear overlap blending (both horizontal and vertical)
- _strategy_select(): Automatically enables tiled decode when latent dimensions exceed tile size
- decode(): Dispatches to distributed tiled decode when distributed mode is enabled
Pipeline integration: Both pipeline_hunyuan_video_1_5.py (T2V) and pipeline_hunyuan_video_1_5_i2v.py (I2V) now use DistributedAutoencoderKLHunyuanVideo instead of the upstream AutoencoderKLHunyuanVideo15.
Executor fix (distributed_vae_executor.py): The old guard condition pp_size <= self.world_size was always True (since pp_size = min(parallel_size, world_size) is by definition ≤ world_size), making vae_patch_parallel_size effectively ignored and causing all ranks to receive decode tasks regardless of the configured parallel size. Fixed by replacing the condition with self.rank < pp_size, so only the intended ranks participate in VAE decode work.

Test Plan

End-to-end text-to-video inference on Ascend NPU with HunyuanVideo-1.5 (480p, 33 frames), using the repo's built-in example script with --vae-patch-parallel-size 4 and --vae-use-tiling:

python examples/offline_inference/text_to_video/text_to_video.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-480p_t2v \
    --prompt "A little girl wearing a straw hat runs through a summer meadow full of wildflowers." \
    --num-frames 33 --num-inference-steps 50 --seed 42 \
    --tensor-parallel-size 2 \
    --cfg-parallel-size 2 \
    --vae-patch-parallel-size 4 \
    --vae-use-tiling

Environment: 4x Ascend NPU.

Test Result

Timing Comparison

Config	TextEncoding	Denoising	Decoding	Total
vae-pp=1	214.71 ms	77461.30 ms	11788.63 ms	90592 ms
vae-pp=4	214.91 ms	77140.65 ms	3639.62 ms	82085 ms
Speedup	--	--	3.24x	1.10x

VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end.

Quality (LPIPS)

Test setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU. Reproducibility ensured via msprobe.pytorch.seed_all(seed=42, mode=True) + torch.use_deterministic_algorithms(True).

Metric	Value
Mean LPIPS	0.000000
Max LPIPS	0.000000
Min LPIPS	0.000000

All 33 frames are bit-for-bit identical between vae-pp=1 and vae-pp=4 — distributed tiled decode introduces zero quality loss.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0edc77b59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

JiwaniZakir

The bug fix in distributed_vae_executor.py (changing pp_size <= self.world_size to self.rank < pp_size) is the right correction — the old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

In tile_merge, the slicing tile[:, :, :, :row_limit_h, :row_limit_w] is applied uniformly to every tile including the last row and column. When height % overlap_h != 0 or width % overlap_w != 0, the final tile in tile_split will be a partial tile (Python slice semantics return whatever fits), but after decoding it may have spatial dimensions smaller than row_limit_h / row_limit_w. Applying the same crop unconditionally in tile_merge would then silently truncate the output, producing a smaller-than-expected spatial result. The last row/column tiles should use min(row_limit_h, tile.shape[-2]) and min(row_limit_w, tile.shape[-1]) respectively.

Additionally, in tile_merge, the vertical blend self.blend_v(coord_tensor_map[(i - 1, j)], tile, blend_h) always reads the raw decoded output of the previous row's tile from coord_tensor_map, not the already-blended result. This means horizontal blending previously applied to tile (i-1, j) is discarded when computing the vertical blend, which can leave visible seams at interior corners of the tile grid.

hsliuustc0106 · 2026-04-01T22:06:15Z

any acc comparison

daixinning · 2026-04-02T02:58:33Z

The bug fix in distributed_vae_executor.py (changing pp_size <= self.world_size to self.rank < pp_size) is the right correction — the old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

In tile_merge, the slicing tile[:, :, :, :row_limit_h, :row_limit_w] is applied uniformly to every tile including the last row and column. When height % overlap_h != 0 or width % overlap_w != 0, the final tile in tile_split will be a partial tile (Python slice semantics return whatever fits), but after decoding it may have spatial dimensions smaller than row_limit_h / row_limit_w. Applying the same crop unconditionally in tile_merge would then silently truncate the output, producing a smaller-than-expected spatial result. The last row/column tiles should use min(row_limit_h, tile.shape[-2]) and min(row_limit_w, tile.shape[-1]) respectively.

Additionally, in tile_merge, the vertical blend self.blend_v(coord_tensor_map[(i - 1, j)], tile, blend_h) always reads the raw decoded output of the previous row's tile from coord_tensor_map, not the already-blended result. This means horizontal blending previously applied to tile (i-1, j) is discarded when computing the vertical blend, which can leave visible seams at interior corners of the tile grid.

Fixed, thanks

lishunyang12

left a couple comments

lishunyang12 · 2026-04-02T15:34:00Z

+
+    @classmethod
+    def from_pretrained(cls, *args: Any, **kwargs: Any):
+        model = super().from_pretrained(*args, **kwargs)


DistributedAutoencoderKL_base.from_pretrained already does exactly this (calls super().from_pretrained then init_distributed). Drop the override.

Dropped the override — the base class DistributedAutoencoderKL_base.from_pretrained already calls super().from_pretrained then init_distributed.

lishunyang12 · 2026-04-02T15:34:00Z

+            result = self.decoder(z)
+        else:
+            logger.info("HunyuanVideo VAE: distributed tiled decode with overlap blending")
+            result = self.distributed_decoder.execute(


This logs on every decode call. Use logger.debug or log once.

Fixed, changed to logger.debug.

lishunyang12 · 2026-04-02T15:34:00Z

+        overlap_w = self.tile_latent_stride_width
+        blend_h = self.tile_sample_min_height - self.tile_sample_stride_height
+        blend_w = self.tile_sample_min_width - self.tile_sample_stride_width
+        row_limit_h = self.tile_sample_stride_height


Nit: overlap_h/overlap_w are strides, not overlaps — rename to stride_h/stride_w.

Fixed, renamed to stride_h/stride_w.

…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>

hsliuustc0106

Still need the accuracy comparison previously requested.

hsliuustc0106 · 2026-04-05T14:45:49Z

+        blend_w = grid_spec.tile_spec["blend_w"]
+        row_limit_h = grid_spec.tile_spec["row_limit_h"]
+        row_limit_w = grid_spec.tile_spec["row_limit_w"]
+


Parent tile_exec applies post_quant_conv before decoding. Here and in the decode() fallback (line 124: result = self.decoder(z)) it's skipped. Is this intentional for HunyuanVideo's VAE? If use_post_quant_conv is False in the config it's fine, but worth confirming since it's a silent difference from the base class behavior.

Yes, this is intentional. AutoencoderKLHunyuanVideo15 does not use post_quant_conv in its decode path -- both _decode() and tiled_decode() call self.decoder(z) directly without post_quant_conv. Our tile_exec mirrors that behavior exactly.

lishunyang12 · 2026-04-06T01:14:08Z

Can you show visual outputs and please do LPLPs test, you can refer to #1470

…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>

…anVideo VAE AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding (both in _decode and tiled_decode). The distributed tile_exec and the non-tiled decode fallback were skipping this step, causing incorrect output. Addresses review comment from hsliuustc0106 on PR vllm-project#2418.

…anVideo VAE AutoencoderKLHunyuanVideo15 always applies post_quant_conv before decoding (both in _decode and tiled_decode). The distributed tile_exec and the non-tiled decode fallback were skipping this step, causing incorrect output. Addresses review comment from hsliuustc0106 on PR vllm-project#2418. Signed-off-by: daixinning <daixinning@163.com>

daixinning · 2026-04-09T08:56:05Z

Visual Output & LPIPS Quality Benchmark

Test setup: HunyuanVideo-1.5 480p T2V, 33 frames, 50 steps, seed=42, 4x Ascend NPU

Baseline: vae_patch_parallel_size=1

t2v_vae_pp1.mp4

Distributed: vae_patch_parallel_size=4

t2v_vae_pp4.mp4

Reproducibility: Both runs use identical random seeds fixed via msprobe.pytorch.seed_all(seed=42, mode=True), which on Ascend NPU sets torch.npu.manual_seed_all, HCCL_DETERMINISTIC=True, CLOSE_MATMUL_K_SHIFT=1, and torch.use_deterministic_algorithms(True), plus torch.Generator(...).manual_seed(42) for latent initialization.

LPIPS (vae-pp=1 vs vae-pp=4, AlexNet backbone)

Metric	Value
Mean LPIPS	0.000000
Max LPIPS	0.000000
Min LPIPS	0.000000

All 33 frames are bit-for-bit identical -- distributed tiled decode introduces zero quality loss.

Timing Comparison

Config	TextEncoding	Denoising	Decoding	Total
vae-pp=1	214.71 ms	77461.30 ms	11788.63 ms	90592 ms
vae-pp=4	214.91 ms	77140.65 ms	3639.62 ms	82085 ms
Speedup	--	--	3.24x	1.10x

VAE decode is accelerated 3.24x with vae-pp=4, saving ~8.5s end-to-end.

@lishunyang12 @hsliuustc0106

daixinning · 2026-04-09T08:56:49Z

The accuracy comparison is now available — please see the LPIPS benchmark above. In short: vae-pp=1 and vae-pp=4 produce bit-for-bit identical outputs (Mean LPIPS = 0.000000 across all 33 frames), so there is zero quality degradation from distributed tiled decode.

gcanlin · 2026-04-11T13:08:39Z

        # 2. local decode
        assigned = self._balance_tasks(tiletask_list, pp_size)
-        local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
+        local_tasks = assigned[self.rank] if self.rank < pp_size else []


Why need to change this line?

As @JiwaniZakir noted in the earlier review:

The old condition was checking a global property rather than whether the current rank falls within the active pipeline stage range.

To be more specific: pp_size is computed as min(self.parallel_size, self.world_size), so pp_size <= self.world_size is always True — meaning every rank would unconditionally receive tasks, making vae_patch_parallel_size effectively a no-op. The fix self.rank < pp_size correctly gates task assignment on whether the current rank is within the active VAE parallel group, so ranks outside the group get an empty task list instead.

gcanlin · 2026-04-11T13:35:01Z

+            return self.tile_split, self.tile_exec, self.tile_merge
+        return None, None, None
+
+    def decode(self, z: torch.Tensor, return_dict: bool = True, *args: Any, **kwargs: Any):


Could you refer to #2368 to implement vae encode parallel for HunyuanVideo?

Hi @gcanlin, thanks for the suggestion. We looked into this carefully but concluded that encode parallel is not needed for HunyuanVideo-1.5.

In Wan I2V (#2368), the VAE encode input is a full-length video condition tensor [B, C, num_frames, H, W] — the same spatial resolution and temporal length as the generated video — so encode is a genuine bottleneck worth parallelizing.

In HunyuanVideo-1.5 I2V, the VAE encode input is a single reference frame [B, C, 1, H, W]. The compute and memory cost is negligible compared to decode, so tiled encode parallel would add code complexity with no practical benefit.

For T2V there is no encode at all. So encode parallel has no meaningful use case for HunyuanVideo-1.5, and we have kept the implementation focused on decode parallel only.

…unyuanvideo.py - Remove redundant from_pretrained override; base class already calls super().from_pretrained then init_distributed - Rename overlap_h/overlap_w to stride_h/stride_w to better reflect their semantics as tile strides - Change logger.info to logger.debug to avoid logging on every decode call Signed-off-by: daixinning <daixinning@163.com>

hsliuustc0106

The executor fix is correct — the old pp_size <= self.world_size was always true since pp_size = min(parallel_size, world_size), so the else branch was dead code.

Two things:

The encode methods (encode_tile_split/exec/merge, lines 115–189) appear unused — are they intended for a follow-up? If so, consider adding them in that PR so they can be tested alongside the code that calls them.
No unit tests for the tile split/merge/blend logic. The overlapping tile blending in both H and W is non-trivial and worth testing independently of the e2e run.

hsliuustc0106 · 2026-04-13T12:11:44Z

        # 2. local decode
        assigned = self._balance_tasks(tiletask_list, pp_size)
-        local_tasks = assigned[self.rank] if pp_size <= self.world_size else []
+        local_tasks = assigned[self.rank] if self.rank < pp_size else []


The fix itself is good, but the PR description says this removes "unnecessary pp_size = min(parallel_size, world_size) indirection" — the min() is fine, the old condition was the bug.

Good catch, thanks. The PR description has been updated to clarify: the itself is correct and necessary, the bug was the guard condition which is always by construction.

hsliuustc0106 · 2026-04-13T12:11:44Z

+    def _strategy_select(self, z: torch.Tensor):
+        """Use tile strategy when latent exceeds tile size."""
+        need_spatial = z.shape[3] > self.tile_latent_min_height or z.shape[4] > self.tile_latent_min_width
+        if need_spatial:


The base class returns patch split as fallback here. This returns None, which means small latents skip distributed execution entirely and just call self.decoder(z) directly. Intentional?

Yes, intentional. Two reasons:

In practice this path is never reached. The tile threshold is 32×32 latent (derived from tile_sample_min=256, spatial_ratio=8). Real inference resolutions always exceed this — 480p latent is 60×104, 720p is 90×160 — so _strategy_select always returns the tile strategy and the None branch is dead code under normal usage.

The base class patch_split is not compatible with HunyuanVideo latents. patch_split operates on 4D tensors [B, C, H, W], while HunyuanVideo latents are 5D [B, C, T, H, W]. Falling back to patch strategy would cause a shape mismatch. For the rare case of a latent smaller than the tile threshold, falling back to single-GPU decode via self.decoder(z) is both correct and cheap.

Good point. Updated: small latents now also go through tile_split/exec/merge instead of falling back to single-GPU decode, which is consistent with the base class behavior of always using distributed execution when enabled. (The base class patch_split is not reused here since it operates on 4D tensors and is incompatible with HunyuanVideo's 5D latents [B, C, T, H, W].)

daixinning · 2026-04-13T13:48:34Z

@hsliuustc0106 Thanks for the review. Regarding point 1 (encode methods): the encode methods have been removed in the latest push. HunyuanVideo-1.5 I2V only encodes a single reference frame , so the compute cost is negligible and tiled encode parallel adds no practical benefit. T2V has no encode at all. The implementation is now focused on decode parallel only.

daixinning · 2026-04-13T14:16:50Z

@hsliuustc0106 Unit tests for tile_split, tile_merge, and blend logic have been added in tests/diffusion/distributed/test_autoencoder_kl_hunyuanvideo.py (12 tests, all passing). Coverage includes: tile grid shape and coord uniqueness, tile size bounds, merge output shape, uniform-latent round-trip, and blend boundary/extent correctness for both H and W directions.

- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge always used (small latents included, consistent with base class behavior; base class patch_split is incompatible with 5D latents) - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines - Add unit tests for tile_split, tile_merge, and blend logic Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>

daixinning requested a review from hsliuustc0106 as a code owner April 1, 2026 11:19

daixinning force-pushed the hy-video-vae-pp branch from f0edc77 to 6307185 Compare April 1, 2026 11:20

chatgpt-codex-connector Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/distributed/autoencoders/distributed_vae_executor.py Outdated

Comment thread vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_hunyuanvideo.py Outdated

daixinning force-pushed the hy-video-vae-pp branch 5 times, most recently from 4805239 to 521fa53 Compare April 1, 2026 16:12

JiwaniZakir reviewed Apr 1, 2026

View reviewed changes

daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from 27decbc to 9914d11 Compare April 2, 2026 02:29

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

hsliuustc0106 reviewed Apr 5, 2026

View reviewed changes

daixinning force-pushed the hy-video-vae-pp branch from 725d231 to bc460aa Compare April 7, 2026 05:55

daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from 932cf5f to 72eef1f Compare April 9, 2026 07:30

gcanlin reviewed Apr 11, 2026

View reviewed changes

daixinning changed the title ~~[Feat][HunyuanVideo-1.5]Support vae-patch-parallel~~ [WIP][Feat][HunyuanVideo-1.5]Support vae-patch-parallel Apr 13, 2026

daixinning force-pushed the hy-video-vae-pp branch from b87292d to 272e08d Compare April 13, 2026 11:20

daixinning force-pushed the hy-video-vae-pp branch 2 times, most recently from f932687 to 4a2a6d1 Compare April 13, 2026 12:10

hsliuustc0106 reviewed Apr 13, 2026

View reviewed changes

daixinning force-pushed the hy-video-vae-pp branch from 4a2a6d1 to 22a89eb Compare April 13, 2026 13:27

daixinning changed the title ~~[WIP][Feat][HunyuanVideo-1.5]Support vae-patch-parallel~~ [Feat][HunyuanVideo-1.5]Support vae-patch-parallel Apr 13, 2026

daixinning force-pushed the hy-video-vae-pp branch from 22a89eb to d424a9e Compare April 13, 2026 14:15

daixinning force-pushed the hy-video-vae-pp branch from d424a9e to 7ef11cf Compare April 13, 2026 14:26

daixinning force-pushed the hy-video-vae-pp branch from 7ef11cf to 42e90ab Compare April 13, 2026 14:31

wtomin mentioned this pull request Apr 15, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

Conversation

daixinning commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Timing Comparison

Quality (LPIPS)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

daixinning commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Apr 6, 2026

Uh oh!

daixinning commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Visual Output & LPIPS Quality Benchmark

LPIPS (vae-pp=1 vs vae-pp=4, AlexNet backbone)

Timing Comparison

Uh oh!

daixinning commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daixinning commented Apr 13, 2026

Uh oh!

daixinning commented Apr 13, 2026

Uh oh!

daixinning commented Apr 1, 2026 •

edited

Loading

daixinning commented Apr 9, 2026 •

edited

Loading