Skip to content

[cherry-pick][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 #2877#2878

Merged
david6666666 merged 35 commits into
vllm-project:release/v0.18.0.post1from
david6666666:codex/release-v0.18.0.post1-pr2847-2780-2840-2876
Apr 20, 2026
Merged

[cherry-pick][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 #2877#2878
david6666666 merged 35 commits into
vllm-project:release/v0.18.0.post1from
david6666666:codex/release-v0.18.0.post1-pr2847-2780-2840-2876

Conversation

@david6666666
Copy link
Copy Markdown
Collaborator

Summary

Validation

  • python -m py_compile vllm_omni/engine/async_omni_engine.py tests/entrypoints/test_async_omni_diffusion_config.py tests/entrypoints/openai_api/test_image_server.py
  • python -m pytest -q tests/diffusion/models/qwen_image/test_qwen_image_max_sequence_length.py tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
  • python -m pytest -q tests/diffusion/models/qwen_image/test_qwen_image_edit_plus.py
  • python -m pytest -q tests/entrypoints/openai_api/test_video_api_utils.py
  • python -m pytest -q tests/entrypoints/test_async_omni_diffusion_config.py tests/entrypoints/openai_api/test_image_server.py
  • pre-commit run --all-files
  • E2E validation for the cherry-picked image/video paths; detailed request/response evidence is posted in the PR comments

Notes

Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit adda9a6)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 281e14a)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 66151f0)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 1e8fa70)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 0a6d618)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit bd9bfaf)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 21851d6)
Signed-off-by: David Chen <530634352@qq.com>
(cherry picked from commit 896b0b8)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 0e2f009)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit eec0785)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 72af603)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit f1900fe)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit c95d20c)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 731c536)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 8c857c3)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 0c25a06)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 826c74a)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 297d06b)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 4ea2271)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit f414061)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 05a7a5d)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit f3e7ce9)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 3015646)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit ecbb6d4)
Signed-off-by: david6666666 <530634352@qq.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Collaborator Author

Supplemental validation for the ordered backport onto release/v0.18.0.post1.

Cherry-pick order used:

  1. #2847
  2. #2780
  3. #2840
  4. #2876

Local test environment note:

  • PYTHONPATH=/mnt/data4/yipeng/vllm was used so the local vllm API matches this release branch during pytest and serving.

Static / unit validation:

  • python -m py_compile vllm_omni/engine/async_omni_engine.py tests/entrypoints/test_async_omni_diffusion_config.py tests/entrypoints/openai_api/test_image_server.py
    • Passed.
  • python -m pytest -q tests/diffusion/models/qwen_image/test_qwen_image_max_sequence_length.py tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
    • 28 passed.
  • python -m pytest -q tests/diffusion/models/qwen_image/test_qwen_image_edit_plus.py
    • 1 passed.
  • python -m pytest -q tests/entrypoints/openai_api/test_video_api_utils.py
    • 6 passed.
  • python -m pytest -q tests/entrypoints/test_async_omni_diffusion_config.py tests/entrypoints/openai_api/test_image_server.py
    • 52 passed.
  • pre-commit run --all-files
    • Passed after hook-applied formatting cleanup.
    • Formatting-only follow-up commit on this branch: beeb333a ([Chore] run pre-commit formatting).

E2E for #2847:

  • Qwen-Image serve on local snapshot /mnt/data1/huggingface/hub/models--Qwen--Qwen-Image/snapshots/75e0b4be04f60ec59a75f475837eced720f823b6
    • short prompt request to /v1/images/generations: 200, decoded output size 512x512, decoded PNG bytes 842.
    • long prompt request (5000 x rabbit): 500 with message:
      • `prompt` is too long after applying the Qwen prompt template: got 5000 tokens, but `max_sequence_length` is 1024
  • Qwen-Image-Edit serve on local snapshot /mnt/data1/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de
    • short prompt request to /v1/images/edits: 200, decoded output size 512x512, decoded PNG bytes 787271.
    • long prompt request (5000 x rabbit): 500 with message:
      • `prompt` is too long after applying the Qwen prompt template: got 5000 tokens, but `max_sequence_length` is 1024
  • Wan2.2-T2V-A14B-Diffusers serve on local snapshot /mnt/data1/huggingface/hub/models--Wan-AI--Wan2.2-T2V-A14B-Diffusers/snapshots/5be7df9619b54f4e2667b2755bc6a756675b5cd7
    • short /v1/videos request (num_inference_steps=1, num_frames=5): create 200, final status completed, inference_time_s=0.43019302003085613, output bytes 36362.
    • long prompt request (5000 x rabbit): create 200, final status failed, error includes:
      • `prompt` is too long for Wan2.2 text encoding: got 10001 tokens, but `max_sequence_length` is 512

E2E for #2840:

  • Qwen-Image-Edit-2511 serve on local snapshot /mnt/data1/huggingface/hub/models--Qwen--Qwen-Image-Edit-2511/snapshots/6f3ccc0b56e431dc6a0c2b2039706d7d26f22cb9
  • /v1/images/edits with 5 input images:
    • 400 with message:
      • Received 5 input images. At most 4 images are supported by this model.
  • /v1/images/edits with 4 input images:
    • 200, decoded output size 512x512, mode RGB.

E2E for #2876:

  • Serve command used for the main validation path kept the requested runtime settings:
    • CUDA_VISIBLE_DEVICES=4,5,6,7
    • model /mnt/data1/huggingface/hub/models--Wan-AI--Wan2.2-I2V-A14B-Diffusers/snapshots/596658fd9ca6b7b71d5057529bbf319ecbc61d74
    • --omni --port 8099 --enable-diffusion-pipeline-profiler --ulysses-degree 4
  • First run with the exact provided long Chinese prompt:
    • create 200, final status failed
    • failure is expected on the combined backport branch because #2847 now enforces Wan2.2 prompt length before encoding
    • final error includes:
      • `prompt` is too long for Wan2.2 text encoding: got 654 tokens, but `max_sequence_length` is 512
  • Second run to isolate and verify the #2876 RIFE device-selection behavior used the same serve command, same image, same interpolation settings, and a shortened prompt that stays within the new Wan2.2 limit:
    • final status completed
    • artifact_ready_wall_s=210.576
    • server_inference_time_s=209.5431856457144
    • output file bytes 1092405
  • Relevant server log evidence from the successful rerun:
    • Loaded RIFE weights from /mnt/data1/huggingface/hub/models--elfgum--RIFE-4.22.lite/snapshots/99d6892a9f4c039cb37ff21c9530e79b13f0b30b/flownet.pkl
    • RIFE model loaded on device: cuda
    • GET /v1/videos/video_gen_6650cb9180b14ff68ac85d4d79e87039/content HTTP/1.1 200 OK

Backport note for #2847 on this release branch:

  • release/v0.18.0.post1 does not contain the Wan2.2 VACE implementation from main, so the backport keeps the prompt-length validation for the Wan2.2 pipelines that exist on this branch (T2V, I2V, TI2V) and drops the VACE-only touchpoints.

@david6666666 david6666666 changed the title [Backport][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 [cherry-pick][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 Apr 17, 2026
@david6666666 david6666666 changed the title [cherry-pick][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 [cherry-pick][release/v0.18.0.post1] cherry-pick #2847 #2780 #2840 #2876 #2877 Apr 17, 2026
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit 072bfa2)
Signed-off-by: david6666666 <530634352@qq.com>
(cherry picked from commit ea6ce23)
Signed-off-by: david6666666 <530634352@qq.com>
Copy link
Copy Markdown
Collaborator Author

Update: I cherry-picked #2877 onto this backport branch as well and re-validated the previously blocked #2876 case.

Additional commits on this branch:

  • d7233cbd [Fix] align Wan2.2 max_sequence_length with model config
  • 25ac7cd8 [Fix] raise Wan2.2 max_sequence_length to 2048
  • 67e52e86 [Chore] run pre-commit after PR2877 backport

Additional validation after adding #2877:

  • python -m compileall vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
    • Passed.
  • python -m pytest -q tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py tests/entrypoints/openai_api/test_video_api_utils.py
    • 15 passed.
  • pre-commit run --all-files
    • Passed.

Re-validation of the original #2876 I2V + RIFE case using the exact long Chinese prompt from the earlier run:

  • Serve command remained the same:
    • CUDA_VISIBLE_DEVICES=4,5,6,7
    • model /mnt/data1/huggingface/hub/models--Wan-AI--Wan2.2-I2V-A14B-Diffusers/snapshots/596658fd9ca6b7b71d5057529bbf319ecbc61d74
    • --omni --port 8099 --enable-diffusion-pipeline-profiler --ulysses-degree 4
  • Request parameters remained the same as the previously supplied #2876 validation script, including:
    • size=1280x720
    • seconds=5
    • fps=16
    • num_inference_steps=8
    • guidance_scale=3.5
    • guidance_scale_2=3.5
    • boundary_ratio=0.875
    • num_frames=81
    • flow_shift=5.0
    • seed=42
    • enable_frame_interpolation=true
    • frame_interpolation_exp=1
    • frame_interpolation_scale=1.0
    • frame_interpolation_model_path=/mnt/data1/huggingface/hub/models--elfgum--RIFE-4.22.lite/snapshots/99d6892a9f4c039cb37ff21c9530e79b13f0b30b
  • Result with the original long prompt after adding #2877:
    • final_status=completed
    • artifact_ready_wall_s=215.108
    • server_inference_time_s=214.06226211227477
    • output file bytes 970842
    • video id video_gen_daf89c9c7953414387cfb486bacc5122

Relevant server log evidence from this exact rerun:

  • Loaded RIFE weights from /mnt/data1/huggingface/hub/models--elfgum--RIFE-4.22.lite/snapshots/99d6892a9f4c039cb37ff21c9530e79b13f0b30b/flownet.pkl
  • RIFE model loaded on device: cuda
  • GET /v1/videos/video_gen_daf89c9c7953414387cfb486bacc5122/content HTTP/1.1 200 OK

So after adding #2877, the exact long-prompt #2876 validation path that previously failed at Wan2.2 prompt-length validation now completes successfully, while still keeping the RIFE device-selection fix validated on CUDA.

@gcanlin gcanlin added the ready label to trigger buildkite CI label Apr 17, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKING ISSUE: This PR cherry-picks unmerged PRs (#2840, #2876) from main to the release branch.

Release branches should only receive changes that have been proven on main. Cherry-picking open PRs bypasses the normal review process and can introduce unverified code.

Please wait until #2840 and #2876 are reviewed and merged to main, then cherry-pick from there.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Cherry-pick validation looks comprehensive.

One concern: cherry-picking multiple PRs together can make conflict resolution fragile. When this lands, verify the backport doesn't create divergence issues with main branch behavior - especially the Wan2.2 max_sequence_length changes (#2847 + #2877 interaction) which were called out in the notes.

Suggestion for future release branch work: Consider landing PRs individually when possible to reduce merge conflict surface area.

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 17, 2026

UT is broken in v0.18.0.post1. Considering the quality of release, would be better to fix them.

@FrosterHan
Copy link
Copy Markdown
Contributor

2847 2840 passed verification

Signed-off-by: david6666666 <530634352@qq.com>
Copy link
Copy Markdown
Collaborator Author

Follow-up for the Wan2.2 short-prompt performance regression observed on this backport branch.

Root cause

  • After the Wan2.2 max_sequence_length backport, the runtime correctly allowed prompts up to 2048, but encode_prompt() still used padding="max_length" for the text encoder path.
  • That meant short prompts were still encoded at the full configured max_sequence_length, so the extra latency showed up in text_encoder.forward, not in DiT denoising.
  • The regression was specific to Wan2.2 because Qwen-Image uses actual-length padding / processor inputs rather than padding short prompts to the configured ceiling.

Fix

  • Commit: 5be6ff56 ([Fix] avoid padding short Wan2.2 prompts to max_sequence_length)
  • Updated Wan2.2 T2V, I2V, and TI2V so they:
    1. keep validating prompt / negative prompt length against max_sequence_length
    2. compute the actual max prompt length needed by the current batch
    3. only pad the text-encoder inputs to that actual batch max instead of the configured ceiling
  • This keeps the 2048-token support from the earlier backport while removing the short-prompt text-encoding slowdown.

Added regression test

  • tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
  • New coverage asserts that short prompts are encoded at their actual length rather than being padded to the supported max length.

Validation

  • python -m pytest -q tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
    • 12 passed
  • python -m py_compile vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_i2v.py vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2_ti2v.py tests/diffusion/models/wan2_2/test_wan22_max_sequence_length.py
    • Passed.
  • pre-commit run --all-files
    • Passed.

E2E re-validation (same local environment style as the earlier PR comments)

  • Serve:
    • local snapshot /mnt/data1/huggingface/hub/models--Wan-AI--Wan2.2-I2V-A14B-Diffusers/snapshots/596658fd9ca6b7b71d5057529bbf319ecbc61d74
    • PYTHONPATH=/mnt/data4/yipeng/vllm:<worktree>
    • CUDA_VISIBLE_DEVICES=4,5,6,7
    • --omni --enable-diffusion-pipeline-profiler --ulysses-degree 4
  • Request:
    • /v1/videos
    • prompt: short English prompt (A white rabbit standing on a wooden table, then slowly turning its head and hopping forward with smooth motion.)
    • size=1280x720
    • seconds=5
    • fps=16
    • num_frames=81
    • num_inference_steps=4
    • guidance_scale=3.5
    • guidance_scale_2=3.5
    • boundary_ratio=0.875
    • flow_shift=5.0
    • seed=42
    • frame interpolation disabled

Measured result for the fixed branch

  • final_status=completed
  • inference_time_s=113.2784127406776
  • output file bytes 1300357

Comparison against the earlier measurements collected on April 19, 2026

  • baseline release/v0.18.0.post1: 113.77747260034084 s
  • this PR before the fix: 116.16866869293153 s
  • this PR after the fix: 113.2784127406776 s

Profiler evidence from the fixed run

  • Wan22I2VPipeline.text_encoder.forward returned to roughly 0.014s - 0.018s per call for the measured request
  • Wan22I2VPipeline.forward on the measured request: 111.188085s
  • DiffusionEngine.step breakdown: preprocess=16.90 ms, add_req_and_wait=111750.35 ms, postprocess=236.55 ms, total=112004.36 ms

Conclusion

  • The observed regression on short prompts was in Wan2.2 text encoding, not in DiT denoising.
  • After this fix, the measured Wan2.2 I2V runtime is back in line with the release/v0.18.0.post1 baseline while preserving the larger prompt-length support from the backported validation work.

…-pr2847-2780-2840-2876

Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
@david6666666 david6666666 merged commit 2116e88 into vllm-project:release/v0.18.0.post1 Apr 20, 2026
2 of 5 checks passed
david6666666 added a commit that referenced this pull request Apr 20, 2026
 (#2937)

Signed-off-by: david6666666 <530634352@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants