[NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU by hxhhhlalala · Pull Request #3578 · vllm-project/vllm-omni

hxhhhlalala · 2026-05-13T11:16:51Z

Purpose

This PR adds W4A4 MXFP4 (Microscaling FP4) quantization support for Wan2.2 diffusion
transformers on Ascend NPU, building on the MXFPLinearMethodBase framework introduced in
the MXFP8 PR.

Add DiffusionMXFP4Config (single-scale online) and DiffusionMXFP4DualScaleMixedConfig
(dual-scale online + offline with per-layer BF16 fallback), registered as mxfp4 and
mxfp4_dualscale in factory.py
Add three NPU linear methods: online single-scale (NPUMxfp4OnlineLinearMethod), offline
dual-scale (NPUMxfp4DualScaleLinearMethod, loads weight / weight_scale /
weight_dual_scale / mul_scale per layer), and online dual-scale
(NPUMxfp4DualScaleOnlineLinearMethod); dual-scale online uses
npu_dynamic_dual_level_mx_quant with the leading 5 blocks kept in BF16 by default
mxfp4_dualscale supports BF16 fallback via two controls: num_bf16_fallback_layers
(coarse leading-block rule, online only, default 5) and ignored_layers (explicit
per-layer override, both online and offline)
Fix Wan22Pipeline._create_transformer to propagate quantization_config from each
transformer's local config.json for cascade models with differing ignored_layers
Add vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py — converts
msModelSlim DualScale output to diffusers format, overlays MXFP4 tensors onto the BF16
base checkpoint, and injects quantization_config (including ignored_layers in
vllm-omni parameter naming) into each transformer/config.json for auto-detection
Add docs/user_guide/quantization/mxfp4.md; update overview.md

Note: Wan2.2-TI2V-5B is explicitly excluded from W4A4 quantization. Its smaller
parameter count causes unacceptable accuracy loss at 4-bit precision. Use MXFP8 for TI2V-5B.

Supported Models

Model	Method	Mode	BF16 layers	Status
Wan2.2-T2V-A14B / I2V-A14B	`mxfp4`	Online	None	✅ Supported
Wan2.2-T2V-A14B / I2V-A14B	`mxfp4_dualscale`	Online	`blocks.0–4` (`num_bf16_fallback_layers=5`)	✅ Supported
Wan2.2-T2V-A14B / I2V-A14B	`mxfp4_dualscale`	Offline (Recommended)	Auto-detected from checkpoint → `ignored_layers` in `config.json`	✅ Supported
Wan2.2-TI2V-5B	—	—	—	❌ Not supported

Test Plan

vLLM version: 0.20.0
vLLM Ascend: main
vLLM Omni: this branch

Quantization tool: https://gitcode.com/Ascend/msmodelslim

Weight quantization script:

export ASCEND_RT_VISIBLE_DEVICES=0

msmodelslim quant \
    --model_path  /data/Wan2.2-T2V-A14B/ \
    --save_path   /data/Wan2.2-T2V-A14B-W4A4-MXFP4-raw/ \
    --device      npu \
    --model_type  Wan2.2 \
    --config_path configs/wan2_2_w4a4_mxfp4_dualscale.yaml

Checkpoint preprocessing:

python vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py \
  --model-type      Wan2.2-T2V-A14B \
  --original-model  /path/to/Wan2.2-T2V-A14B-Diffusers \
  --quant-path      /path/to/quant-output \
  --output-path     /path/to/merged-Wan2.2-T2V-A14B-MXFP4-DualScale

Server (offline / pre-quantized):

vllm serve /data/Wan2.2-T2V-A14B-MXFP4-DualScale/ --omni --port 8091 \
  --log-stats

Server (online / BF16 checkpoint):

vllm serve /data/Wan2.2-T2V-A14B/ --omni --port 8091 \
  --log-stats \
  --quantization mxfp4

vllm serve /data/Wan2.2-T2V-A14B/ --omni --port 8091 \
  --log-stats \
  --quantization mxfp4_dualscale

Client:

curl -X POST http://localhost:8091/v1/videos/generations \
-H "Content-Type: application/json" \
-d '{
  "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
  "num_inference_steps": 40,
  "guidance_scale": 5.0,
  "n": 1,
  "size": "720x1280",
  "num_frames": 41,
  "seed": 42
}'

Test Result

Wan2.2-T2V-A14B bf16 baseline
Wan2.2-T2V-A14B mxfp4 online
Wan2.2-T2V-A14B mxfp4_dualscale online
Wan2.2-T2V-A14B mxfp4_dualscale offline
Wan2.2-I2V-A14B bf16 baseline
Wan2.2-I2V-A14B mxfp4 online
Wan2.2-I2V-A14B mxfp4_dualscale online
Wan2.2-I2V-A14B mxfp4_dualscale offline

Quantization Quality Benchmark for NPU

Wan2.2-T2V-A14B

export ASCEND_RT_VISIBLE_DEVICES=0

python text_to_video.py \
--model /home/weights/Wan2.2-T2V-A14B-Diffusers \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--height 720 \
--width 1280 \
--num-frames 41 \
--num-inference-steps 40 \
--tensor-parallel-size 1 \
--quantization mxfp4 \
--output t2v_output_mxfp4.mp4

Config	Avg Time(s)	Speedup	Memory (GB)	Mem Reduction
BF16, SP=1	489	—	73.17	—
mxfp8 offline, SP=1	416.1	14.9%	47.88	34.6%
mxfp8 online, SP=1	416.2	14.9%	47.75	34.7%
mxfp4 online, SP=1	372.4	23.8%	34.59	52.7%
mxfp4 online_dualscale, SP=1	390	20.2%	39.62	45.9%
mxfp4 offline_dualscale, SP=1	389.7	20.3%	39.6	45.9%

BF16

mxfp8 offline

mxfp8 online

mxfp4 online

mxfp4_dualscale online

mxfp4_dualscale offline

Wan2.2-I2V-A14B

export ASCEND_RT_VISIBLE_DEVICES=0,1

python image_to_video.py \
--model /home/weights/Wan2.2-I2V-A14B-Diffusers \
--image cherry_blossom.jpg \
--prompt "Cherry blossoms swaying gently in the breeze, petals falling, smooth motion" \
--height 720 \
--width 1280 \
--num-frames 41 \
--num-inference-steps 40 \
--tensor-parallel-size 1 \
--ulysses-degree 2 \
--quantization mxfp4 \
--output i2v_output_mxfp4.mp4

Config	Avg Time(s)	Speedup	Memory (GB)	Mem Reduction
BF16, SP=2	277.9	—	73.99	—
mxfp8 offline, SP=2	239.6	13.8%	48.58	34.3%
mxfp8 online, SP=2	239.9	13.7%	48.71	34.2%
mxfp4 online, SP=2	218	21.5%	35.42	52.1%
mxfp4 online_dualscale, SP=2	226.8	18.4%	40.44	45.3%
mxfp4 offline_dualscale, SP=2	226.6	18.5%	40.42	45.4%

BF16

mxfp8 offline

mxfp8 online

mxfp4 online

mxfp4_dualscale online

mxfp4_dualscale offline

Memory Profiling

Wan2.2-I2V-A14B

matmul shape:"19800,5120;15360,5120;15360"

Config	Quant(us)	Matmul(us)	Total(us)	Reduction
BF16, SP=2	—	7251.6	7251.6	—
mxfp8 offline, SP=2	251.2	3632	3883.2	46.5%
mxfp8 online, SP=2	251.2	3632	3883.2	46.5%
mxfp4_dualscale offline, SP=2	289	2033.9	2322.9	68%
mxfp4 online, SP=2	225.6	1828.9	2054.5	71.7%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style
doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c716351ce

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hxhhhlalala · 2026-05-14T07:27:11Z

@david6666666 @gcanlin PTAL, thx

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 · 2026-05-15T06:24:43Z

Architecture-wise, this mostly fits vLLM-Omni's existing quantization layering: method registration stays in vllm_omni/quantization/factory.py, Wan2.2 reads checkpoint-local quantization_config, and the NPU implementations reuse the existing QuantizationConfig / LinearBase / MXFPLinearMethodBase path instead of bypassing the model loader.

A few things should be fixed before merge:

mxfp8_mxfp4_dualscale is exposed in the examples as a normal --quantization value, but the method is checkpoint-topology-dependent (num_mxfp8_blocks). With a BF16 checkpoint, selecting the string path builds the config with the default num_mxfp8_blocks=0, so the advertised mixed MXFP8+MXFP4 mode silently becomes all-MXFP4 dual-scale online. Please keep this as checkpoint auto-detection only, or add an explicit user-facing config/CLI path for num_mxfp8_blocks.
The PR body and docs say Wan2.2-TI2V-5B is not supported for W4A4 MXFP4, but merge_mxfp4_dualscale_checkpoint.py still accepts Wan2.2-TI2V-5B and injects mxfp8_mxfp4_dualscale. Please remove it from SUPPORTED_MODEL_TYPES or add an explicit guard/error so users do not generate an unsupported checkpoint layout.
This changes weight loading, checkpoint conversion, offline auto-detection, and NPU quantized matmul dispatch, but the PR does not add automated tests. Please add at least a lightweight regression test for the config/loader path and merge-script key remapping. The manual latency/VRAM evidence is useful, but it does not protect the architecture contracts this PR depends on.

david6666666

Please address these before merge.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 · 2026-05-15T09:30:02Z

I think using MXFP4 + BF16 directly is better; avoid MXFP8 for sensitive layers. NVFP4 + BF16 is also handled this way on GPUs.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 · 2026-05-19T02:22:38Z

please resolve conflicts

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666

Please check the TP loading path.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 · 2026-05-19T07:03:20Z

@lishunyang12 @gcanlin please check thx

david6666666

One remaining TP loading issue in the single-scale serialized path.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

gcanlin · 2026-05-20T02:27:14Z

+        # When od_config.quantization_config is None (no CLI --quantization flag), pre-build
+        # the quant_config from the transformer's own config.json and propagate it back to
+        # od_config.  This has two effects:
+        #   1. The first transformer's auto-detected config is reused by the second transformer
+        #      in cascade models (e.g. Wan2.2-T2V-A14B); if the second transformer's config.json
+        #      has different ignored_layers, create_transformer_from_config rebuilds locally.
+        #   2. od_config.quantization_config becomes non-None so _check_unloaded_weights can
+        #      filter expected quantization suffixes instead of raising on every unloaded param.
+        if quant_config is None and "quantization_config" in config:
+            from vllm_omni.quantization.factory import build_quant_config
+
+            disk_qc = config["quantization_config"]
+            if isinstance(disk_qc, dict) and "quant_method" in disk_qc:
+                qc_method = disk_qc["quant_method"]
+                qc_kwargs = {k: v for k, v in disk_qc.items() if k != "quant_method"}
+                quant_config = build_quant_config(qc_method, **qc_kwargs)
+                self.od_config.quantization_config = quant_config
+                logger.info(
+                    "Auto-detected quantization from transformer config.json and propagated to od_config: "
+                    "method=%s kwargs=%s",
+                    qc_method,
+                    qc_kwargs,
+                )
+            elif isinstance(disk_qc, str):
+                quant_config = build_quant_config(disk_qc)
+                self.od_config.quantization_config = quant_config
+                logger.info(
+                    "Auto-detected quantization from transformer config.json and propagated to od_config: method=%s",
+                    disk_qc,
+                )


After talking to @hxhhhlalala offline, we may not need these code anymore because we fallback to fp16 instead of mxfp8. Wait for @hxhhhlalala confirm again.

gcanlin · 2026-05-20T02:28:28Z

+7. W4A4 carries higher quantization noise than W8A8 (16 vs 256 levels). The
+   DualScale offline method mitigates this with calibrated `mul_scale` smooth
+   quantization. Use `ignored_layers` and `num_bf16_fallback_layers` to trade
+   off compression vs. accuracy for precision-sensitive layers.


Would be better to add one section to explain how to adapt mxfp4 for models, which will help other developers quickly integrate mxfp4 to other models.

gcanlin · 2026-05-20T02:57:40Z

+)
+
+
+def _disk_marks_serialized(qc_kwargs: dict, quant_config: object) -> bool:


BTW, I think this method can be extracted to quantization/factory.py or quantization/utils.py. It should be common.

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

gcanlin

LGTM, thanks!

…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>

…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>

w4a4 online & offline quant

6c71635

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from 94768ff to 6c71635 Compare May 14, 2026 06:43

hxhhhlalala marked this pull request as ready for review May 14, 2026 06:45

hxhhhlalala requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, hsliuustc0106, lishunyang12, princepride, wtomin and ywang96 as code owners May 14, 2026 06:45

hxhhhlalala changed the title ~~[WIP][NPU][Quant] Add W4A4 MXFP4 online/ dual-scale offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU~~ [NPU][Quant] Add W4A4 MXFP4 online/ dual-scale offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU May 14, 2026

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

Comment thread vllm_omni/quantization/factory.py Outdated

fix para init

a2b1a02

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from 0bab339 to a2b1a02 Compare May 14, 2026 08:15

david6666666 mentioned this pull request May 15, 2026

[RFC]: Continuous Quantization Support #1854

Open

david6666666 reviewed May 15, 2026

View reviewed changes

Comment thread vllm_omni/quantization/mixed_mxfp_config.py Outdated

Comment thread vllm_omni/quantization/tools/merge_mxfp4_dualscale_checkpoint.py Outdated

Comment thread vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py Outdated

hxhhhlalala requested a review from yenuo26 as a code owner May 15, 2026 07:51

hxhhhlalala force-pushed the w4a4 branch 3 times, most recently from f308f63 to 9d50330 Compare May 15, 2026 08:52

add ut

3c221e2

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from 9d50330 to 3c221e2 Compare May 15, 2026 08:54

hxhhhlalala force-pushed the w4a4 branch from a4846a3 to 18391c2 Compare May 16, 2026 02:14

david6666666 mentioned this pull request May 18, 2026

[RFC] [0.22.0]: Quantization Support JiusiServe/vllm-omni#182

Closed

6 tasks

hxhhhlalala force-pushed the w4a4 branch 2 times, most recently from 900a43c to 28dc1f5 Compare May 18, 2026 07:35

fix merge script

d75030a

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from 28dc1f5 to d75030a Compare May 18, 2026 07:36

resolve conflicts

a1d8af2

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 reviewed May 19, 2026

View reviewed changes

Comment thread vllm_omni/quantization/mxfp4_config.py Outdated

fix input_dim for TP sharding

435bc0a

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

david6666666 added the ready label to trigger buildkite CI label May 19, 2026

david6666666 reviewed May 19, 2026

View reviewed changes

Comment thread vllm_omni/quantization/mxfp4_config.py Outdated

hxhhhlalala force-pushed the w4a4 branch from f8fe325 to d25851f Compare May 19, 2026 08:09

add tp=2 ut

acde528

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from d25851f to acde528 Compare May 19, 2026 08:10

gcanlin reviewed May 20, 2026

View reviewed changes

hxhhhlalala force-pushed the w4a4 branch from d76c8a4 to dd33040 Compare May 20, 2026 03:19

add user guide

0cb3e5c

Signed-off-by: hyh_hh <huyinghong1@huawei.com>

hxhhhlalala force-pushed the w4a4 branch from dd33040 to 0cb3e5c Compare May 20, 2026 03:58

gcanlin approved these changes May 20, 2026

View reviewed changes

gcanlin enabled auto-merge (squash) May 20, 2026 06:17

gcanlin merged commit 9dd36e3 into vllm-project:main May 20, 2026
7 of 9 checks passed

		)


		def _disk_marks_serialized(qc_kwargs: dict, quant_config: object) -> bool:

Conversation

hxhhhlalala commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Supported Models

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hxhhhlalala commented May 14, 2026

Uh oh!

david6666666 commented May 15, 2026

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

david6666666 commented May 15, 2026

Uh oh!

david6666666 commented May 19, 2026

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david6666666 commented May 19, 2026

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hxhhhlalala May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hxhhhlalala May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin May 20, 2026

Choose a reason for hiding this comment

Uh oh!

hxhhhlalala May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hxhhhlalala commented May 13, 2026 •

edited

Loading