[NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU#3578
Conversation
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c716351ce
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@david6666666 @gcanlin PTAL, thx |
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
|
Architecture-wise, this mostly fits vLLM-Omni's existing quantization layering: method registration stays in A few things should be fixed before merge:
|
david6666666
left a comment
There was a problem hiding this comment.
Please address these before merge.
f308f63 to
9d50330
Compare
|
I think using MXFP4 + BF16 directly is better; avoid MXFP8 for sensitive layers. NVFP4 + BF16 is also handled this way on GPUs. |
900a43c to
28dc1f5
Compare
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
|
please resolve conflicts |
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
david6666666
left a comment
There was a problem hiding this comment.
Please check the TP loading path.
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
|
@lishunyang12 @gcanlin please check thx |
david6666666
left a comment
There was a problem hiding this comment.
One remaining TP loading issue in the single-scale serialized path.
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
| # When od_config.quantization_config is None (no CLI --quantization flag), pre-build | ||
| # the quant_config from the transformer's own config.json and propagate it back to | ||
| # od_config. This has two effects: | ||
| # 1. The first transformer's auto-detected config is reused by the second transformer | ||
| # in cascade models (e.g. Wan2.2-T2V-A14B); if the second transformer's config.json | ||
| # has different ignored_layers, create_transformer_from_config rebuilds locally. | ||
| # 2. od_config.quantization_config becomes non-None so _check_unloaded_weights can | ||
| # filter expected quantization suffixes instead of raising on every unloaded param. | ||
| if quant_config is None and "quantization_config" in config: | ||
| from vllm_omni.quantization.factory import build_quant_config | ||
|
|
||
| disk_qc = config["quantization_config"] | ||
| if isinstance(disk_qc, dict) and "quant_method" in disk_qc: | ||
| qc_method = disk_qc["quant_method"] | ||
| qc_kwargs = {k: v for k, v in disk_qc.items() if k != "quant_method"} | ||
| quant_config = build_quant_config(qc_method, **qc_kwargs) | ||
| self.od_config.quantization_config = quant_config | ||
| logger.info( | ||
| "Auto-detected quantization from transformer config.json and propagated to od_config: " | ||
| "method=%s kwargs=%s", | ||
| qc_method, | ||
| qc_kwargs, | ||
| ) | ||
| elif isinstance(disk_qc, str): | ||
| quant_config = build_quant_config(disk_qc) | ||
| self.od_config.quantization_config = quant_config | ||
| logger.info( | ||
| "Auto-detected quantization from transformer config.json and propagated to od_config: method=%s", | ||
| disk_qc, | ||
| ) |
There was a problem hiding this comment.
After talking to @hxhhhlalala offline, we may not need these code anymore because we fallback to fp16 instead of mxfp8. Wait for @hxhhhlalala confirm again.
| 7. W4A4 carries higher quantization noise than W8A8 (16 vs 256 levels). The | ||
| DualScale offline method mitigates this with calibrated `mul_scale` smooth | ||
| quantization. Use `ignored_layers` and `num_bf16_fallback_layers` to trade | ||
| off compression vs. accuracy for precision-sensitive layers. |
There was a problem hiding this comment.
Would be better to add one section to explain how to adapt mxfp4 for models, which will help other developers quickly integrate mxfp4 to other models.
| ) | ||
|
|
||
|
|
||
| def _disk_marks_serialized(qc_kwargs: dict, quant_config: object) -> bool: |
There was a problem hiding this comment.
BTW, I think this method can be extracted to quantization/factory.py or quantization/utils.py. It should be common.
Signed-off-by: hyh_hh <huyinghong1@huawei.com>
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…quantization support for Wan2.2 T2V / I2V inference on Ascend NPU (vllm-project#3578) Signed-off-by: hyh_hh <huyinghong1@huawei.com> Co-authored-by: hyh_hh <huyinghong1@huawei.com>
Purpose
This PR adds W4A4 MXFP4 (Microscaling FP4) quantization support for Wan2.2 diffusion
transformers on Ascend NPU, building on the
MXFPLinearMethodBaseframework introduced inthe MXFP8 PR.
(dual-scale online + offline with per-layer BF16 fallback), registered as mxfp4 and
mxfp4_dualscale in factory.py
dual-scale (NPUMxfp4DualScaleLinearMethod, loads weight / weight_scale /
weight_dual_scale / mul_scale per layer), and online dual-scale
(NPUMxfp4DualScaleOnlineLinearMethod); dual-scale online uses
npu_dynamic_dual_level_mx_quant with the leading 5 blocks kept in BF16 by default
(coarse leading-block rule, online only, default 5) and ignored_layers (explicit
per-layer override, both online and offline)
transformer's local config.json for cascade models with differing ignored_layers
msModelSlim DualScale output to diffusers format, overlays MXFP4 tensors onto the BF16
base checkpoint, and injects quantization_config (including ignored_layers in
vllm-omni parameter naming) into each transformer/config.json for auto-detection
Supported Models
mxfp4mxfp4_dualscaleblocks.0–4(num_bf16_fallback_layers=5)mxfp4_dualscaleignored_layersinconfig.jsonTest Plan
Quantization tool: https://gitcode.com/Ascend/msmodelslim
Weight quantization script:
export ASCEND_RT_VISIBLE_DEVICES=0 msmodelslim quant \ --model_path /data/Wan2.2-T2V-A14B/ \ --save_path /data/Wan2.2-T2V-A14B-W4A4-MXFP4-raw/ \ --device npu \ --model_type Wan2.2 \ --config_path configs/wan2_2_w4a4_mxfp4_dualscale.yamlCheckpoint preprocessing:
Server (offline / pre-quantized):
Server (online / BF16 checkpoint):
Client:
Test Result
Quantization Quality Benchmark for NPU
BF16






mxfp8 offline
mxfp8 online
mxfp4 online
mxfp4_dualscale online
mxfp4_dualscale offline
BF16






mxfp8 offline
mxfp8 online
mxfp4 online
mxfp4_dualscale online
mxfp4_dualscale offline
Memory Profiling
Wan2.2-I2V-A14B
matmul shape:"19800,5120;15360,5120;15360"
Essential Elements of an Effective PR Description Checklist
doc
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)