Skip to content

[NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU#3578

Merged
gcanlin merged 9 commits into
vllm-project:mainfrom
hxhhhlalala:w4a4
May 20, 2026
Merged

[NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU#3578
gcanlin merged 9 commits into
vllm-project:mainfrom
hxhhhlalala:w4a4

Conversation

@hxhhhlalala
Copy link
Copy Markdown
Contributor

@hxhhhlalala hxhhhlalala commented May 13, 2026

Purpose

This PR adds W4A4 MXFP4 (Microscaling FP4) quantization support for Wan2.2 diffusion
transformers on Ascend NPU, building on the MXFPLinearMethodBase framework introduced in
the MXFP8 PR.

  • Add DiffusionMXFP4Config (single-scale online) and DiffusionMXFP4DualScaleMixedConfig
    (dual-scale online + offline with per-layer BF16 fallback), registered as mxfp4 and
    mxfp4_dualscale in factory.py
  • Add three NPU linear methods: online single-scale (NPUMxfp4OnlineLinearMethod), offline
    dual-scale (NPUMxfp4DualScaleLinearMethod, loads weight / weight_scale /
    weight_dual_scale / mul_scale per layer), and online dual-scale
    (NPUMxfp4DualScaleOnlineLinearMethod); dual-scale online uses
    npu_dynamic_dual_level_mx_quant with the leading 5 blocks kept in BF16 by default
  • mxfp4_dualscale supports BF16 fallback via two controls: num_bf16_fallback_layers
    (coarse leading-block rule, online only, default 5) and ignored_layers (explicit
    per-layer override, both online and offline)
  • Fix Wan22Pipeline._create_transformer to propagate quantization_config from each
    transformer's local config.json for cascade models with differing ignored_layers
  • Add vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py — converts
    msModelSlim DualScale output to diffusers format, overlays MXFP4 tensors onto the BF16
    base checkpoint, and injects quantization_config (including ignored_layers in
    vllm-omni parameter naming) into each transformer/config.json for auto-detection
  • Add docs/user_guide/quantization/mxfp4.md; update overview.md

Note: Wan2.2-TI2V-5B is explicitly excluded from W4A4 quantization. Its smaller
parameter count causes unacceptable accuracy loss at 4-bit precision. Use MXFP8 for TI2V-5B.

Supported Models

Model Method Mode BF16 layers Status
Wan2.2-T2V-A14B / I2V-A14B mxfp4 Online None ✅ Supported
Wan2.2-T2V-A14B / I2V-A14B mxfp4_dualscale Online blocks.0–4 (num_bf16_fallback_layers=5) ✅ Supported
Wan2.2-T2V-A14B / I2V-A14B mxfp4_dualscale Offline (Recommended) Auto-detected from checkpoint → ignored_layers in config.json ✅ Supported
Wan2.2-TI2V-5B ❌ Not supported

Test Plan

  • vLLM version: 0.20.0
  • vLLM Ascend: main
  • vLLM Omni: this branch

Quantization tool: https://gitcode.com/Ascend/msmodelslim

Weight quantization script:

export ASCEND_RT_VISIBLE_DEVICES=0

msmodelslim quant \
    --model_path  /data/Wan2.2-T2V-A14B/ \
    --save_path   /data/Wan2.2-T2V-A14B-W4A4-MXFP4-raw/ \
    --device      npu \
    --model_type  Wan2.2 \
    --config_path configs/wan2_2_w4a4_mxfp4_dualscale.yaml

Checkpoint preprocessing:

python vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py \
  --model-type      Wan2.2-T2V-A14B \
  --original-model  /path/to/Wan2.2-T2V-A14B-Diffusers \
  --quant-path      /path/to/quant-output \
  --output-path     /path/to/merged-Wan2.2-T2V-A14B-MXFP4-DualScale

Server (offline / pre-quantized):

vllm serve /data/Wan2.2-T2V-A14B-MXFP4-DualScale/ --omni --port 8091 \
  --log-stats

Server (online / BF16 checkpoint):

vllm serve /data/Wan2.2-T2V-A14B/ --omni --port 8091 \
  --log-stats \
  --quantization mxfp4
vllm serve /data/Wan2.2-T2V-A14B/ --omni --port 8091 \
  --log-stats \
  --quantization mxfp4_dualscale

Client:

curl -X POST http://localhost:8091/v1/videos/generations \
-H "Content-Type: application/json" \
-d '{
  "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
  "num_inference_steps": 40,
  "guidance_scale": 5.0,
  "n": 1,
  "size": "720x1280",
  "num_frames": 41,
  "seed": 42
}'

Test Result

  • Wan2.2-T2V-A14B bf16 baseline
  • Wan2.2-T2V-A14B mxfp4 online
  • Wan2.2-T2V-A14B mxfp4_dualscale online
  • Wan2.2-T2V-A14B mxfp4_dualscale offline
  • Wan2.2-I2V-A14B bf16 baseline
  • Wan2.2-I2V-A14B mxfp4 online
  • Wan2.2-I2V-A14B mxfp4_dualscale online
  • Wan2.2-I2V-A14B mxfp4_dualscale offline

Quantization Quality Benchmark for NPU

  • Wan2.2-T2V-A14B
export ASCEND_RT_VISIBLE_DEVICES=0

python text_to_video.py \
--model /home/weights/Wan2.2-T2V-A14B-Diffusers \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 720 \
--width 1280 \
--num-frames 41 \
--num-inference-steps 40 \
--tensor-parallel-size 1 \
--quantization mxfp4 \
--output t2v_output_mxfp4.mp4
Config Avg Time(s) Speedup Memory (GB) Mem Reduction
BF16, SP=1 489 73.17
mxfp8 offline, SP=1 416.1 14.9% 47.88 34.6%
mxfp8 online, SP=1 416.2 14.9% 47.75 34.7%
mxfp4 online, SP=1 372.4 23.8% 34.59 52.7%
mxfp4 online_dualscale, SP=1 390 20.2% 39.62 45.9%
mxfp4 offline_dualscale, SP=1 389.7 20.3% 39.6 45.9%

BF16
T2V_BF16
mxfp8 offline
T2V_offline
mxfp8 online
T2V_online
mxfp4 online
T2V_online_mxfp4
mxfp4_dualscale online
T2V_online_dual
mxfp4_dualscale offline
T2V_offline_dual

  • Wan2.2-I2V-A14B
export ASCEND_RT_VISIBLE_DEVICES=0,1

python image_to_video.py \
--model /home/weights/Wan2.2-I2V-A14B-Diffusers \
--image cherry_blossom.jpg \
--prompt "Cherry blossoms swaying gently in the breeze, petals falling, smooth motion" \
--height 720 \
--width 1280 \
--num-frames 41 \
--num-inference-steps 40 \
--tensor-parallel-size 1 \
--ulysses-degree 2 \
--quantization mxfp4 \
--output i2v_output_mxfp4.mp4
Config Avg Time(s) Speedup Memory (GB) Mem Reduction
BF16, SP=2 277.9 73.99
mxfp8 offline, SP=2 239.6 13.8% 48.58 34.3%
mxfp8 online, SP=2 239.9 13.7% 48.71 34.2%
mxfp4 online, SP=2 218 21.5% 35.42 52.1%
mxfp4 online_dualscale, SP=2 226.8 18.4% 40.44 45.3%
mxfp4 offline_dualscale, SP=2 226.6 18.5% 40.42 45.4%

BF16
I2V_BF16
mxfp8 offline
I2V_offline
mxfp8 online
I2V_online
mxfp4 online
I2V_online_mxfp4
mxfp4_dualscale online
I2V_online_daul
mxfp4_dualscale offline
I2V_offline_dual

Memory Profiling

  • Wan2.2-I2V-A14B

    matmul shape:"19800,5120;15360,5120;15360"

    Config Quant(us) Matmul(us) Total(us) Reduction
    BF16, SP=2 7251.6 7251.6
    mxfp8 offline, SP=2 251.2 3632 3883.2 46.5%
    mxfp8 online, SP=2 251.2 3632 3883.2 46.5%
    mxfp4_dualscale offline, SP=2 289 2033.9 2322.9 68%
    mxfp4 online, SP=2 225.6 1828.9 2054.5 71.7%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style
    doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
@hxhhhlalala hxhhhlalala marked this pull request as ready for review May 14, 2026 06:45
@hxhhhlalala hxhhhlalala changed the title [WIP][NPU][Quant] Add W4A4 MXFP4 online/ dual-scale offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU [NPU][Quant] Add W4A4 MXFP4 online/ dual-scale offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU May 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c716351ce

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/quantization/factory.py Outdated
@hxhhhlalala
Copy link
Copy Markdown
Contributor Author

@david6666666 @gcanlin PTAL, thx

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Copy link
Copy Markdown
Collaborator

Architecture-wise, this mostly fits vLLM-Omni's existing quantization layering: method registration stays in vllm_omni/quantization/factory.py, Wan2.2 reads checkpoint-local quantization_config, and the NPU implementations reuse the existing QuantizationConfig / LinearBase / MXFPLinearMethodBase path instead of bypassing the model loader.

A few things should be fixed before merge:

  1. mxfp8_mxfp4_dualscale is exposed in the examples as a normal --quantization value, but the method is checkpoint-topology-dependent (num_mxfp8_blocks). With a BF16 checkpoint, selecting the string path builds the config with the default num_mxfp8_blocks=0, so the advertised mixed MXFP8+MXFP4 mode silently becomes all-MXFP4 dual-scale online. Please keep this as checkpoint auto-detection only, or add an explicit user-facing config/CLI path for num_mxfp8_blocks.

  2. The PR body and docs say Wan2.2-TI2V-5B is not supported for W4A4 MXFP4, but merge_mxfp4_dualscale_checkpoint.py still accepts Wan2.2-TI2V-5B and injects mxfp8_mxfp4_dualscale. Please remove it from SUPPORTED_MODEL_TYPES or add an explicit guard/error so users do not generate an unsupported checkpoint layout.

  3. This changes weight loading, checkpoint conversion, offline auto-detection, and NPU quantized matmul dispatch, but the PR does not add automated tests. Please add at least a lightweight regression test for the config/loader path and merge-script key remapping. The manual latency/VRAM evidence is useful, but it does not protect the architecture contracts this PR depends on.

Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address these before merge.

Comment thread vllm_omni/quantization/mixed_mxfp_config.py Outdated
Comment thread vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py Outdated
Comment thread vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py Outdated
@hxhhhlalala hxhhhlalala requested a review from yenuo26 as a code owner May 15, 2026 07:51
@hxhhhlalala hxhhhlalala force-pushed the w4a4 branch 3 times, most recently from f308f63 to 9d50330 Compare May 15, 2026 08:52
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
@david6666666
Copy link
Copy Markdown
Collaborator

I think using MXFP4 + BF16 directly is better; avoid MXFP8 for sensitive layers. NVFP4 + BF16 is also handled this way on GPUs.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
@hxhhhlalala hxhhhlalala changed the title [NPU][Quant] Add W4A4 MXFP4 online/ dual-scale offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU [NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU May 18, 2026
@david6666666
Copy link
Copy Markdown
Collaborator

please resolve conflicts

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the TP loading path.

Comment thread vllm_omni/quantization/mxfp4_config.py Outdated
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
@david6666666 david6666666 added the ready label to trigger buildkite CI label May 19, 2026
@david6666666
Copy link
Copy Markdown
Collaborator

@lishunyang12 @gcanlin please check thx

Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining TP loading issue in the single-scale serialized path.

Comment thread vllm_omni/quantization/mxfp4_config.py Outdated
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Comment on lines +461 to +490
# When od_config.quantization_config is None (no CLI --quantization flag), pre-build
# the quant_config from the transformer's own config.json and propagate it back to
# od_config. This has two effects:
# 1. The first transformer's auto-detected config is reused by the second transformer
# in cascade models (e.g. Wan2.2-T2V-A14B); if the second transformer's config.json
# has different ignored_layers, create_transformer_from_config rebuilds locally.
# 2. od_config.quantization_config becomes non-None so _check_unloaded_weights can
# filter expected quantization suffixes instead of raising on every unloaded param.
if quant_config is None and "quantization_config" in config:
from vllm_omni.quantization.factory import build_quant_config

disk_qc = config["quantization_config"]
if isinstance(disk_qc, dict) and "quant_method" in disk_qc:
qc_method = disk_qc["quant_method"]
qc_kwargs = {k: v for k, v in disk_qc.items() if k != "quant_method"}
quant_config = build_quant_config(qc_method, **qc_kwargs)
self.od_config.quantization_config = quant_config
logger.info(
"Auto-detected quantization from transformer config.json and propagated to od_config: "
"method=%s kwargs=%s",
qc_method,
qc_kwargs,
)
elif isinstance(disk_qc, str):
quant_config = build_quant_config(disk_qc)
self.od_config.quantization_config = quant_config
logger.info(
"Auto-detected quantization from transformer config.json and propagated to od_config: method=%s",
disk_qc,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking to @hxhhhlalala offline, we may not need these code anymore because we fallback to fp16 instead of mxfp8. Wait for @hxhhhlalala confirm again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

7. W4A4 carries higher quantization noise than W8A8 (16 vs 256 levels). The
DualScale offline method mitigates this with calibrated `mul_scale` smooth
quantization. Use `ignored_layers` and `num_bf16_fallback_layers` to trade
off compression vs. accuracy for precision-sensitive layers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to add one section to explain how to adapt mxfp4 for models, which will help other developers quickly integrate mxfp4 to other models.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)


def _disk_marks_serialized(qc_kwargs: dict, quant_config: object) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think this method can be extracted to quantization/factory.py or quantization/utils.py. It should be common.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@gcanlin gcanlin enabled auto-merge (squash) May 20, 2026 06:17
@gcanlin gcanlin merged commit 9dd36e3 into vllm-project:main May 20, 2026
7 of 9 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 20, 2026
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Co-authored-by: hyh_hh <huyinghong1@huawei.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 20, 2026
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Co-authored-by: hyh_hh <huyinghong1@huawei.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 21, 2026
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Co-authored-by: hyh_hh <huyinghong1@huawei.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 21, 2026
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Co-authored-by: hyh_hh <huyinghong1@huawei.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
zengchuang-hw pushed a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>
Co-authored-by: hyh_hh <huyinghong1@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants