Skip to content

[AutoRound] Support WAN2.2 W4A16 quantization model#3353

Open
lvliang-intel wants to merge 13 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-wan22
Open

[AutoRound] Support WAN2.2 W4A16 quantization model#3353
lvliang-intel wants to merge 13 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-wan22

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

@lvliang-intel lvliang-intel commented May 5, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add AutoRound W4A16 quantization support for Wan2.2 pipelines and transformer modules.
https://huggingface.co/Intel/Wan2.2-TI2V-5B-Diffusers-int4-AutoRound
https://huggingface.co/Intel/Wan2.2-I2V-A14B-Diffusers-int4-AutoRound
https://huggingface.co/Intel/Wan2.2-T2V-A14B-Diffusers-int4-AutoRound

Related: #1325, #1777, #2670

Test Plan

Run UT
Run VBench dataset accuracy test

Test Result

Raw Scores

Subject Consistency Wan2.2-I2V-A14B-Diffusers Wan2.2-I2V-A14B-Diffusers-Int4-AutoRound Wan2.2-T2V-A14B-Diffusers Wan2.2-T2V-A14B-Diffusers-Int4-AutoRound
Subject Consistency 0.9752 0.9741 0.9508 0.9578
Background Consistency 0.9704 0.9691 0.9449 0.9465
Aesthetic Quality 0.6241 0.6089 0.5730 0.5980
Imaging Quality 0.6832 0.6679 0.6623 0.6591

Aggregate by Category

Category Wan2.2-I2V-A14B-Diffusers Wan2.2-I2V-A14B-Diffusers-Int4-AutoRound Wan2.2-T2V-A14B-Diffusers Wan2.2-T2V-A14B-Diffusers-Int4-AutoRound
Consistency 0.9728 0.9716 0.9478 0.9522
Quality 0.6537 0.6384 0.6176 0.6286

Evaluated Dimension Average

Model Dimensions Evaluated Avg Score
Wan2.2-I2V-A14B-Diffusers 4 0.8132
Wan2.2-I2V-A14B-Diffusers-Int4-AutoRound 4 0.8050
Wan2.2-T2V-A14B-Diffusers 4 0.7827
Wan2.2-T2V-A14B-Diffusers-Int4-AutoRound 4 0.7904

Generation Statistics

Model Success Rate Avg Latency(s) Avg Memory(MB) Speedup vs Ref Memory Ratio vs Ref
Wan2.2-I2V-A14B-Diffusers 100.0 377.31 76309.33 1.00x 1.00x
Wan2.2-I2V-A14B-Diffusers-Int4-AutoRound 100.0 439.68 36893.0 0.86x 0.48x
Wan2.2-T2V-A14B-Diffusers 100.0 736.09 76298.33 1.00x 1.00x
Wan2.2-T2V-A14B-Diffusers-Int4-AutoRound 100.0 863.49 36891.33 0.85x 0.48x

The test is mainly for accuracy purpose. For video generation at batch size 1, Int4 W4A16 primarily saves memory (0.48x as shown — great for fitting larger models / longer videos in VRAM) but does not necessarily improve latency because the workload is compute-bound and dequantization overhead is significant.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@lvliang-intel lvliang-intel force-pushed the feats/ar-w4a16-wan22 branch from 8ccefe8 to cbfbe97 Compare May 5, 2026 13:16
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Comprehensive benchmarks and well-structured tests. Memory reduction to 0.48x is significant for VRAM-constrained deployments. Two notes: 1) Checklist items at the bottom are unchecked - confirm documentation was updated if required. 2) Latency impact (0.86x speedup) is expected for compute-bound workloads at batch size 1, but consider documenting guidance for optimal batch sizes where dequantization overhead is amortized.

@david6666666
Copy link
Copy Markdown
Collaborator

Merge conflicts need fixing before review. Thx.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@david6666666
Copy link
Copy Markdown
Collaborator

LGTM now

@david6666666 david6666666 added the ready label to trigger buildkite CI label May 19, 2026
@david6666666
Copy link
Copy Markdown
Collaborator

@yenuo26 please check test

Comment thread tests/diffusion/models/wan2_2/test_wan22_quant_config_propagation.py Outdated
Comment thread tests/e2e/offline_inference/test_wan22_i2v_autoround_w4a16.py Outdated
Comment thread tests/e2e/offline_inference/test_wan22_i2v_autoround_w4a16.py Outdated
Comment thread tests/e2e/offline_inference/test_wan22_i2v_autoround_w4a16.py Outdated
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants