[Feature] Support vae tiling parallel encode#2368
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5694af1067
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for i in range(0, height, self.tile_sample_stride_height): | ||
| for j in range(0, width, self.tile_sample_stride_width): |
There was a problem hiding this comment.
Scale tiled-encode split coordinates after patchify
When config.patch_size is set, _encode_distributed patchifies x before dispatching tile work, but encode_tile_split still iterates with self.tile_sample_stride_* in pre-patch coordinates. On patchified inputs this makes tiles much larger than intended (and far fewer), so patch-parallel encode loses most of its parallelism and uses incorrect overlap geometry for seam blending on patchified Wan checkpoints. The decode path already applies patch-size-aware scaling, so encode should mirror that coordinate system.
Useful? React with 👍 / 👎.
|
any accuracy test? |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@Bounty-hunter @wtomin Please help review. Thanks! |
|
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
For 1, I have added the doc for vae encode parallel. For 2, I think it's better to implement it in a follow-up PR. Will request help from community. |
I will enable nightly-test to cover GPU accuracy test. And for NPU, I will execute the same test locally and paste the result. |
This PR leads to accuracy regression currently. I will try to fix it. |
I think it is expected. Previously vae does not use |
lishunyang12
left a comment
There was a problem hiding this comment.
left a few comments, mostly minor
Has this accuracy issue been resolved? |
I found that the current nightly accuracy test is only for the comparison between offline and online inference but not main and this PR. So it's wired. It looks more like I didn't align offline and online parallel config in the test. |
|
maybe we can add -buildkite-agent artifact upload to see generated videos and judged by human. |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
I think yes. But other models also need to do some minor adaption. |
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> (cherry picked from commit e771842) Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for both encode and decode paths - encode: encode_tile_split/exec/merge + _encode override with broadcast_result=True - decode: tile_split/exec/merge + decode override with broadcast_result=False - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines Encode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for both encode and decode paths - encode: encode_tile_split/exec/merge + _encode override with broadcast_result=True - decode: tile_split/exec/merge + decode override with broadcast_result=False - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines Encode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for both encode and decode paths - encode: encode_tile_split/exec/merge + _encode override with broadcast_result=True - decode: tile_split/exec/merge + decode override with broadcast_result=False - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines Encode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for both encode and decode paths - encode: encode_tile_split/exec/merge + _encode override with broadcast_result=True - decode: tile_split/exec/merge + decode override with broadcast_result=False - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines Encode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge + decode override with broadcast_result=False - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge always used (small latents included, consistent with base class behavior; base class patch_split is incompatible with 5D latents) - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines - Add unit tests for tile_split, tile_merge, and blend logic Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge always used (small latents included, consistent with base class behavior; base class patch_split is incompatible with 5D latents) - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines - Add unit tests for tile_split, tile_merge, and blend logic Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
- Add DistributedAutoencoderKLHunyuanVideo with overlapping tile split/exec/merge for distributed decode - decode: tile_split/exec/merge always used (small latents included, consistent with base class behavior; base class patch_split is incompatible with 5D latents) - Fix distributed_vae_executor: use self.rank < pp_size instead of pp_size <= self.world_size - Wire DistributedAutoencoderKLHunyuanVideo into T2V and I2V pipelines - Add unit tests for tile_split, tile_merge, and blend logic Decode parallel follows the pattern from vllm-project#2368 (Wan VAE encode parallel). Signed-off-by: daixinning <daixinning@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: bob-021206 <binyan_github@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Add support for vae encode parallel.
Test Plan
Run Wan2.2 I2V end-to-end on 8 NPUs with Ulysses enabled:
Test Results
Main:
wan22_main.mp4
This PR:
wan22_vae_encode_parallel.mp4