Skip to content

[Perf] Optimize Wan2.2 device free on image preprocess#2852

Merged
gcanlin merged 4 commits intovllm-project:mainfrom
fan2956:main_fix_free
Apr 20, 2026
Merged

[Perf] Optimize Wan2.2 device free on image preprocess#2852
gcanlin merged 4 commits intovllm-project:mainfrom
fan2956:main_fix_free

Conversation

@fan2956
Copy link
Copy Markdown
Contributor

@fan2956 fan2956 commented Apr 16, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR fixes two critical issues in pipeline_wan2_2_i2v.py to prevent CPU fallback and device mismatches:

  1. Moves input tensor to the target device before video_processor.preprocess to eliminate CPU-side preprocessing.
  2. Optimizes mask_lat_size generation by using PyTorch native tensor slicing instead of list indexing, ensuring full execution on the target device (NPU/GPU) and avoiding device free.
    No functional changes are made to the original logic. The changes improve inference stability across hardware platforms.

Test Plan

export MULTI_STREAM_MEMORY_REUSE=2
vllm serve /home/y00958577/Wan2.2-I2V-A14B-Diffusers/ \
  --omni \
  --port 8099 \
  --usp 8 \
  --use-hsdp \
  --enforce-eager \
  --log-stats \
  --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": "False"}'\
  --vae-patch-parallel-size 8 \
  --vae-use-tiling
curl -X POST http://localhost:8099/v1/videos \
    -F "prompt='一只棕色野兔的正面特写镜头,采用低角度仰拍视角,营造亲密而庄严的视觉冲击。兔子一双圆润漆黑的大眼睛直视镜头深处,眼神中交织着野生动物的警觉与一>丝难以言喻的温柔好奇,仿佛在与观者建立跨越物种的静默对话。它毛色呈现层次丰富的棕褐渐变,从浅奶油色腹部过渡到深棕背部,每根毛发纹理清晰可辨,在侧光下泛着丝绸般>的光泽。细长洁白的胡须共有三对,随呼吸节奏微微颤动,偶尔因捕捉气流信息而轻轻摇摆。
两只标志性的长耳完全竖立,耳廓外侧覆盖短密棕毛,内侧则露出粉嫩的血管网络,薄如蝉翼的皮肤下血液流动隐约可见,耳朵以细微幅度不时转动,精准定位声源方向。背景是一>片澄澈的蔚蓝天空,形态蓬松的白色积云以缓慢速度横向漂移,云影在兔子头顶交替变化,光线随之明暗流转。晴朗天气的明媚阳光从画面左上方45度角倾泻而下,在兔脸右侧形成>柔和的伦勃朗式阴影,强化了面部立体感和皮毛质感。
兔子湿润的黑鼻子持续进行每秒三至四次的快速抽动,这是它们感知化学信号的本能动作,粉色三瓣嘴随之轻启,露出正在反刍的洁白门齿,下颌以稳定节奏左右研磨。摄影采用大>光圈浅景深,焦点牢牢锁定在兔子双眼连线所在的焦平面,背景天空和远景绿色植被虚化成圆润的彩色光斑,前景几根嫩绿草叶闯入画面边缘,以缓慢弧线随风摇曳,暗示着和煦的>春日微风。
整个场景弥漫着宁静致远的田园诗意,色彩温暖饱和,充满生命力。兔子在持续五秒的对视后,以典型 lagomorph 特征完成一次完整的瞬膜眨眼——第三眼睑从内侧横向滑过眼球,继
而缓缓歪头向右十五度,这个行为在动物行为学中代表认知加工和好奇心表达,耳朵随之向同一方向倾斜,最终恢复正视姿态,胡须舒展,完成这段短暂而珍贵的自然纪录。'" \
    -F "input_reference=@/home/zf/vllm-omni/rabbit.jpeg" \
    -F "size=832x480" \
    -F "seconds=5" \
    -F "fps=16" \
    -F "num_frames=81" \
    -F "guidance_scale=1" \
    -F "guidance_scale_2=1" \
    -F "flow_shift=5.0" \
    -F "num_inference_steps=4" \
    -F "seed=42"

Test Result

before:
e2e_total_ms = 9,837.386
after:
e2e_total_ms = 9,512.703


Essential Elements of an Effective PR Description Checklist
  • [√] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [√] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • [√] The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: fan2956 <zhoufan53@huawei.com>
@fan2956 fan2956 requested a review from hsliuustc0106 as a code owner April 16, 2026 11:44
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: fan2956 <zhoufan53@huawei.com>
@david6666666
Copy link
Copy Markdown
Collaborator

@bjf-frz @gcanlin ptal thx

video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)

if isinstance(image, PIL.Image.Image):
image = TF.to_tensor(image).to(device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just move tensor to device instead of introducing TF?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just move tensor to device instead of introducing TF?
If we do not use TF.to_tensor, we need to manually implement a function to achieve the same functionality.

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have other concerns. Just confirm one question.

# Handle last_image if provided
if last_image is not None:
if isinstance(last_image, PIL.Image.Image):
image = TF.to_tensor(image).to(device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is redundant as above. The image here is already a tensor on the device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is redundant as above. The image here is already a tensor on the device.

done

# Handle last_image if provided
if last_image is not None:
if isinstance(last_image, PIL.Image.Image):
image = TF.to_tensor(image).to(device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should convert last_image, not image. At this point the first branch may already have reassigned image = TF.to_tensor(image), so when both image and last_image are PIL inputs this line will try to run to_tensor() on an existing tensor and fail before preprocessing the last frame. Can you switch this to last_image = TF.to_tensor(last_image).to(device)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

The mask_lat_size optimization (change 3) is correct and clean — creating the tensor directly on device and using native slicing instead of list(range(...)) is a straightforward improvement.

However, there is a copy-paste bug in the image preprocessing changes that needs to be fixed before merge.

Bug: Wrong variable in last_image branch (line ~491)

if last_image is not None:
    if isinstance(last_image, PIL.Image.Image):
        image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`
        last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

This line converts image (not last_image) to a tensor. At this point in the code, image has already been processed in the block above, so this:

  1. Does not achieve the intended optimizationlast_image is still a PIL Image when passed to video_processor.preprocess, so it will still be preprocessed on CPU.
  2. Overwrites image with a raw TF.to_tensor result (a [C,H,W] tensor with no resize/normalize), clobbering the value that was set earlier. While image may not be used after this point, it is still incorrect and fragile.

The fix should be:

last_image = TF.to_tensor(last_image).to(device)
last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

Minor concern: double conversion

For the first image path, TF.to_tensor(image) converts the PIL Image to a float [0,1] tensor. Then video_processor.preprocess(image, ...) receives this tensor. Diffusers' VideoProcessor.preprocess has its own PIL-to-tensor path and normalization logic. Please verify that passing a pre-converted tensor doesn't cause unexpected behavior (e.g., double normalization from [0,1] to [-1,1] twice, or shape mismatches since TF.to_tensor produces [C,H,W] while VideoProcessor may expect [B,C,H,W] or PIL input).

If the goal is simply to avoid CPU-side work in VideoProcessor.preprocess, an alternative would be to just do image_tensor = video_processor.preprocess(image, ...).to(device) and move the result to device immediately, rather than introducing a separate TF.to_tensor call before it.

mask_lat_size changes (LGTM)

The changes in prepare_latents are correct:

  • Adding device=latent_condition.device avoids creating on CPU then moving.
  • Replacing list(range(1, num_frames)) with 1: is more idiomatic and avoids materializing a Python list.

Both are good micro-optimizations consistent with the ~3% e2e improvement reported.

Verdict

Please fix the copy-paste bug on the last_image branch and verify the TF.to_tensor + VideoProcessor.preprocess interaction doesn't cause double normalization.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKER scan:

  • Correctness: ISSUES (line 491: copy-paste error - uses image instead of last_image)
  • Reliability/Safety: PASS
  • Breaking Changes: PASS
  • Test Coverage: PASS (e2e latency provided: 9,837ms -> 9,513ms)
  • Documentation: PASS
  • Security: PASS

OVERALL: 1 BLOCKER FOUND

VERDICT: REQUEST_CHANGES

There is a copy-paste error on line 491:

if isinstance(last_image, PIL.Image.Image):
    image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`, not `image`
    last_image_tensor = video_processor.preprocess(last_image, ...)

This means last_image is never converted to a tensor or moved to the target device, which defeats the purpose of this optimization.

Please fix line 491 to:

last_image = TF.to_tensor(last_image).to(device)

The rest of the changes look good - using tensor slicing instead of list(range(...)) is the right approach to avoid CPU fallback.

Signed-off-by: fan2956 <zhoufan53@huawei.com>
@fan2956
Copy link
Copy Markdown
Contributor Author

fan2956 commented Apr 17, 2026

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

The mask_lat_size optimization (change 3) is correct and clean — creating the tensor directly on device and using native slicing instead of list(range(...)) is a straightforward improvement.

However, there is a copy-paste bug in the image preprocessing changes that needs to be fixed before merge.

Bug: Wrong variable in last_image branch (line ~491)

if last_image is not None:
    if isinstance(last_image, PIL.Image.Image):
        image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`
        last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

This line converts image (not last_image) to a tensor. At this point in the code, image has already been processed in the block above, so this:

  1. Does not achieve the intended optimizationlast_image is still a PIL Image when passed to video_processor.preprocess, so it will still be preprocessed on CPU.
  2. Overwrites image with a raw TF.to_tensor result (a [C,H,W] tensor with no resize/normalize), clobbering the value that was set earlier. While image may not be used after this point, it is still incorrect and fragile.

The fix should be:

last_image = TF.to_tensor(last_image).to(device)
last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

Minor concern: double conversion

For the first image path, TF.to_tensor(image) converts the PIL Image to a float [0,1] tensor. Then video_processor.preprocess(image, ...) receives this tensor. Diffusers' VideoProcessor.preprocess has its own PIL-to-tensor path and normalization logic. Please verify that passing a pre-converted tensor doesn't cause unexpected behavior (e.g., double normalization from [0,1] to [-1,1] twice, or shape mismatches since TF.to_tensor produces [C,H,W] while VideoProcessor may expect [B,C,H,W] or PIL input).

If the goal is simply to avoid CPU-side work in VideoProcessor.preprocess, an alternative would be to just do image_tensor = video_processor.preprocess(image, ...).to(device) and move the result to device immediately, rather than introducing a separate TF.to_tensor call before it.

mask_lat_size changes (LGTM)

The changes in prepare_latents are correct:

  • Adding device=latent_condition.device avoids creating on CPU then moving.
  • Replacing list(range(1, num_frames)) with 1: is more idiomatic and avoids materializing a Python list.

Both are good micro-optimizations consistent with the ~3% e2e improvement reported.

Verdict

Please fix the copy-paste bug on the last_image branch and verify the TF.to_tensor + VideoProcessor.preprocess interaction doesn't cause double normalization.

BLOCKER scan:

  • Correctness: ISSUES (line 491: copy-paste error - uses image instead of last_image)
  • Reliability/Safety: PASS
  • Breaking Changes: PASS
  • Test Coverage: PASS (e2e latency provided: 9,837ms -> 9,513ms)
  • Documentation: PASS
  • Security: PASS

OVERALL: 1 BLOCKER FOUND

VERDICT: REQUEST_CHANGES

There is a copy-paste error on line 491:

if isinstance(last_image, PIL.Image.Image):
    image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`, not `image`
    last_image_tensor = video_processor.preprocess(last_image, ...)

This means last_image is never converted to a tensor or moved to the target device, which defeats the purpose of this optimization.

Please fix line 491 to:

last_image = TF.to_tensor(last_image).to(device)

The rest of the changes look good - using tensor slicing instead of list(range(...)) is the right approach to avoid CPU fallback.

done

@fan2956 fan2956 requested a review from lishunyang12 April 17, 2026 09:35
@gcanlin gcanlin added ready label to trigger buildkite CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 19, 2026
@gcanlin gcanlin enabled auto-merge (squash) April 19, 2026 11:15
@gcanlin gcanlin merged commit 78f237e into vllm-project:main Apr 20, 2026
8 checks passed
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants