[Perf] Optimize Wan2.2 device free on image preprocess by fan2956 · Pull Request #2852 · vllm-project/vllm-omni

fan2956 · 2026-04-16T11:44:10Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR fixes two critical issues in pipeline_wan2_2_i2v.py to prevent CPU fallback and device mismatches:

Moves input tensor to the target device before video_processor.preprocess to eliminate CPU-side preprocessing.
Optimizes mask_lat_size generation by using PyTorch native tensor slicing instead of list indexing, ensuring full execution on the target device (NPU/GPU) and avoiding device free.
No functional changes are made to the original logic. The changes improve inference stability across hardware platforms.

Test Plan

export MULTI_STREAM_MEMORY_REUSE=2
vllm serve /home/y00958577/Wan2.2-I2V-A14B-Diffusers/ \
  --omni \
  --port 8099 \
  --usp 8 \
  --use-hsdp \
  --enforce-eager \
  --log-stats \
  --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile", "torch_profiler_with_stack": "False"}'\
  --vae-patch-parallel-size 8 \
  --vae-use-tiling

curl -X POST http://localhost:8099/v1/videos \
    -F "prompt='一只棕色野兔的正面特写镜头，采用低角度仰拍视角，营造亲密而庄严的视觉冲击。兔子一双圆润漆黑的大眼睛直视镜头深处，眼神中交织着野生动物的警觉与一>丝难以言喻的温柔好奇，仿佛在与观者建立跨越物种的静默对话。它毛色呈现层次丰富的棕褐渐变，从浅奶油色腹部过渡到深棕背部，每根毛发纹理清晰可辨，在侧光下泛着丝绸般>的光泽。细长洁白的胡须共有三对，随呼吸节奏微微颤动，偶尔因捕捉气流信息而轻轻摇摆。
两只标志性的长耳完全竖立，耳廓外侧覆盖短密棕毛，内侧则露出粉嫩的血管网络，薄如蝉翼的皮肤下血液流动隐约可见，耳朵以细微幅度不时转动，精准定位声源方向。背景是一>片澄澈的蔚蓝天空，形态蓬松的白色积云以缓慢速度横向漂移，云影在兔子头顶交替变化，光线随之明暗流转。晴朗天气的明媚阳光从画面左上方45度角倾泻而下，在兔脸右侧形成>柔和的伦勃朗式阴影，强化了面部立体感和皮毛质感。
兔子湿润的黑鼻子持续进行每秒三至四次的快速抽动，这是它们感知化学信号的本能动作，粉色三瓣嘴随之轻启，露出正在反刍的洁白门齿，下颌以稳定节奏左右研磨。摄影采用大>光圈浅景深，焦点牢牢锁定在兔子双眼连线所在的焦平面，背景天空和远景绿色植被虚化成圆润的彩色光斑，前景几根嫩绿草叶闯入画面边缘，以缓慢弧线随风摇曳，暗示着和煦的>春日微风。
整个场景弥漫着宁静致远的田园诗意，色彩温暖饱和，充满生命力。兔子在持续五秒的对视后，以典型 lagomorph 特征完成一次完整的瞬膜眨眼——第三眼睑从内侧横向滑过眼球，继
而缓缓歪头向右十五度，这个行为在动物行为学中代表认知加工和好奇心表达，耳朵随之向同一方向倾斜，最终恢复正视姿态，胡须舒展，完成这段短暂而珍贵的自然纪录。'" \
    -F "input_reference=@/home/zf/vllm-omni/rabbit.jpeg" \
    -F "size=832x480" \
    -F "seconds=5" \
    -F "fps=16" \
    -F "num_frames=81" \
    -F "guidance_scale=1" \
    -F "guidance_scale_2=1" \
    -F "flow_shift=5.0" \
    -F "num_inference_steps=4" \
    -F "seed=42"

Test Result

before:
e2e_total_ms = 9,837.386
after:
e2e_total_ms = 9,512.703

Essential Elements of an Effective PR Description Checklist

[√] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[√] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
[√] The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: fan2956 <zhoufan53@huawei.com>

chatgpt-codex-connector · 2026-04-16T11:44:14Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: fan2956 <zhoufan53@huawei.com>

david6666666 · 2026-04-16T11:59:35Z

@bjf-frz @gcanlin ptal thx

gcanlin · 2026-04-16T12:25:17Z

        video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)

        if isinstance(image, PIL.Image.Image):
+            image = TF.to_tensor(image).to(device)


Can't we just move tensor to device instead of introducing TF?

Can't we just move tensor to device instead of introducing TF?
If we do not use TF.to_tensor, we need to manually implement a function to achieve the same functionality.

gcanlin

I don't have other concerns. Just confirm one question.

bjf-frz · 2026-04-16T12:37:57Z

        # Handle last_image if provided
        if last_image is not None:
            if isinstance(last_image, PIL.Image.Image):
+                image = TF.to_tensor(image).to(device)


This sentence is redundant as above. The image here is already a tensor on the device.

This sentence is redundant as above. The image here is already a tensor on the device.

done

david6666666 · 2026-04-16T12:38:54Z

        # Handle last_image if provided
        if last_image is not None:
            if isinstance(last_image, PIL.Image.Image):
+                image = TF.to_tensor(image).to(device)


This should convert last_image, not image. At this point the first branch may already have reassigned image = TF.to_tensor(image), so when both image and last_image are PIL inputs this line will try to run to_tensor() on an existing tensor and fail before preprocessing the last frame. Can you switch this to last_image = TF.to_tensor(last_image).to(device)?

lishunyang12

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

The mask_lat_size optimization (change 3) is correct and clean — creating the tensor directly on device and using native slicing instead of list(range(...)) is a straightforward improvement.

However, there is a copy-paste bug in the image preprocessing changes that needs to be fixed before merge.

Bug: Wrong variable in `last_image` branch (line ~491)

if last_image is not None:
    if isinstance(last_image, PIL.Image.Image):
        image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`
        last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

This line converts image (not last_image) to a tensor. At this point in the code, image has already been processed in the block above, so this:

Does not achieve the intended optimization — last_image is still a PIL Image when passed to video_processor.preprocess, so it will still be preprocessed on CPU.
Overwrites image with a raw TF.to_tensor result (a [C,H,W] tensor with no resize/normalize), clobbering the value that was set earlier. While image may not be used after this point, it is still incorrect and fragile.

The fix should be:

last_image = TF.to_tensor(last_image).to(device)
last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)

Minor concern: double conversion

For the first image path, TF.to_tensor(image) converts the PIL Image to a float [0,1] tensor. Then video_processor.preprocess(image, ...) receives this tensor. Diffusers' VideoProcessor.preprocess has its own PIL-to-tensor path and normalization logic. Please verify that passing a pre-converted tensor doesn't cause unexpected behavior (e.g., double normalization from [0,1] to [-1,1] twice, or shape mismatches since TF.to_tensor produces [C,H,W] while VideoProcessor may expect [B,C,H,W] or PIL input).

If the goal is simply to avoid CPU-side work in VideoProcessor.preprocess, an alternative would be to just do image_tensor = video_processor.preprocess(image, ...).to(device) and move the result to device immediately, rather than introducing a separate TF.to_tensor call before it.

mask_lat_size changes (LGTM)

The changes in prepare_latents are correct:

Adding device=latent_condition.device avoids creating on CPU then moving.
Replacing list(range(1, num_frames)) with 1: is more idiomatic and avoids materializing a Python list.

Both are good micro-optimizations consistent with the ~3% e2e improvement reported.

Verdict

Please fix the copy-paste bug on the last_image branch and verify the TF.to_tensor + VideoProcessor.preprocess interaction doesn't cause double normalization.

hsliuustc0106 · 2026-04-16T21:02:39Z

BLOCKER scan:

Correctness: ISSUES (line 491: copy-paste error - uses image instead of last_image)
Reliability/Safety: PASS
Breaking Changes: PASS
Test Coverage: PASS (e2e latency provided: 9,837ms -> 9,513ms)
Documentation: PASS
Security: PASS

OVERALL: 1 BLOCKER FOUND

VERDICT: REQUEST_CHANGES

There is a copy-paste error on line 491:

if isinstance(last_image, PIL.Image.Image):
    image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`, not `image`
    last_image_tensor = video_processor.preprocess(last_image, ...)

This means last_image is never converted to a tensor or moved to the target device, which defeats the purpose of this optimization.

Please fix line 491 to:

last_image = TF.to_tensor(last_image).to(device)

The rest of the changes look good - using tensor slicing instead of list(range(...)) is the right approach to avoid CPU fallback.

Signed-off-by: fan2956 <zhoufan53@huawei.com>

fan2956 · 2026-04-17T01:51:41Z

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

The mask_lat_size optimization (change 3) is correct and clean — creating the tensor directly on device and using native slicing instead of list(range(...)) is a straightforward improvement.

However, there is a copy-paste bug in the image preprocessing changes that needs to be fixed before merge.

Bug: Wrong variable in last_image branch (line ~491)
if last_image is not None:
    if isinstance(last_image, PIL.Image.Image):
        image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`
        last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)
This line converts image (not last_image) to a tensor. At this point in the code, image has already been processed in the block above, so this:

Does not achieve the intended optimization — last_image is still a PIL Image when passed to video_processor.preprocess, so it will still be preprocessed on CPU.

Overwrites image with a raw TF.to_tensor result (a [C,H,W] tensor with no resize/normalize), clobbering the value that was set earlier. While image may not be used after this point, it is still incorrect and fragile.

The fix should be:
last_image = TF.to_tensor(last_image).to(device)
last_image_tensor = video_processor.preprocess(last_image, height=height, width=width)
Minor concern: double conversion

For the first image path, TF.to_tensor(image) converts the PIL Image to a float [0,1] tensor. Then video_processor.preprocess(image, ...) receives this tensor. Diffusers' VideoProcessor.preprocess has its own PIL-to-tensor path and normalization logic. Please verify that passing a pre-converted tensor doesn't cause unexpected behavior (e.g., double normalization from [0,1] to [-1,1] twice, or shape mismatches since TF.to_tensor produces [C,H,W] while VideoProcessor may expect [B,C,H,W] or PIL input).

If the goal is simply to avoid CPU-side work in VideoProcessor.preprocess, an alternative would be to just do image_tensor = video_processor.preprocess(image, ...).to(device) and move the result to device immediately, rather than introducing a separate TF.to_tensor call before it.

mask_lat_size changes (LGTM)

The changes in prepare_latents are correct:

Adding device=latent_condition.device avoids creating on CPU then moving.

Replacing list(range(1, num_frames)) with 1: is more idiomatic and avoids materializing a Python list.

Both are good micro-optimizations consistent with the ~3% e2e improvement reported.

Verdict

Please fix the copy-paste bug on the last_image branch and verify the TF.to_tensor + VideoProcessor.preprocess interaction doesn't cause double normalization.

BLOCKER scan:

Correctness: ISSUES (line 491: copy-paste error - uses image instead of last_image)

Reliability/Safety: PASS

Breaking Changes: PASS

Test Coverage: PASS (e2e latency provided: 9,837ms -> 9,513ms)

Documentation: PASS

Security: PASS

OVERALL: 1 BLOCKER FOUND

VERDICT: REQUEST_CHANGES

There is a copy-paste error on line 491:
if isinstance(last_image, PIL.Image.Image):
    image = TF.to_tensor(image).to(device)  # BUG: should be `last_image`, not `image`
    last_image_tensor = video_processor.preprocess(last_image, ...)
This means last_image is never converted to a tensor or moved to the target device, which defeats the purpose of this optimization.

Please fix line 491 to:
last_image = TF.to_tensor(last_image).to(device)
The rest of the changes look good - using tensor slicing instead of list(range(...)) is the right approach to avoid CPU fallback.

done

Fixed.

…2852) Signed-off-by: fan2956 <zhoufan53@huawei.com>

[Perf] Optimize Wan2.2 device free on image preprocess

02ddf24

Signed-off-by: fan2956 <zhoufan53@huawei.com>

fan2956 requested a review from hsliuustc0106 as a code owner April 16, 2026 11:44

[Bugfix] fix pre-commit

0033df9

Signed-off-by: fan2956 <zhoufan53@huawei.com>

david6666666 mentioned this pull request Apr 16, 2026

[RFC][0.20.0]: Qwen-Image、Qwen-Image-Layered、Qwen-Image-Edit-Plus、Wan2.2 Production-grade Feature Monitoring JiusiServe/vllm-omni#181

Closed

1 task

gcanlin added the nightly-test label to trigger buildkite nightly test CI label Apr 16, 2026

gcanlin reviewed Apr 16, 2026

View reviewed changes

gcanlin approved these changes Apr 16, 2026

View reviewed changes

bjf-frz reviewed Apr 16, 2026

View reviewed changes

david6666666 reviewed Apr 16, 2026

View reviewed changes

lishunyang12 previously requested changes Apr 16, 2026

View reviewed changes

[Bugfix] fix last image to_tensor error

37770a4

Signed-off-by: fan2956 <zhoufan53@huawei.com>

fan2956 requested a review from lishunyang12 April 17, 2026 09:35

gcanlin added ready label to trigger buildkite CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 19, 2026

gcanlin enabled auto-merge (squash) April 19, 2026 11:15

gcanlin mentioned this pull request Apr 19, 2026

[RFC]: Wan2.2 Performance Optimization Roadmap on vLLM-Omni #1355

Open

1 task

Merge branch 'main' into main_fix_free

9d9ad04

gcanlin merged commit 78f237e into vllm-project:main Apr 20, 2026
8 checks passed

gcanlin mentioned this pull request Apr 20, 2026

[Cherry-pick][v0.18.0.post1][Perf] Optimize Wan2.2 device free on image preprocess #2854

Merged

2 tasks

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026

[Perf] Optimize Wan2.2 device free on image preprocess (vllm-project#…

fc133ab

…2852) Signed-off-by: fan2956 <zhoufan53@huawei.com>

Conversation

fan2956 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 16, 2026

Uh oh!

david6666666 commented Apr 16, 2026

Uh oh!

gcanlin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

fan2956 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

bjf-frz Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

fan2956 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

david6666666 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

fan2956 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

Bug: Wrong variable in last_image branch (line ~491)

Minor concern: double conversion

mask_lat_size changes (LGTM)

Verdict

Uh oh!

hsliuustc0106 commented Apr 16, 2026

Uh oh!

fan2956 commented Apr 17, 2026

Review: [Perf] Optimize Wan2.2 device free on image preprocess

Summary

Bug: Wrong variable in last_image branch (line ~491)

Minor concern: double conversion

mask_lat_size changes (LGTM)

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fan2956 commented Apr 16, 2026 •

edited

Loading

Bug: Wrong variable in `last_image` branch (line ~491)

Bug: Wrong variable in `last_image` branch (line ~491)