[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path#2979
Conversation
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Please show test result. This pr is subject to be refractored as modelopt infra is developing process and yaml refractoring is not fully settled for this model. cc @hsliuustc0106 |
|
@lishunyang12 PTAL for the diffusion_quant_config, is this different from vllm_config.quant_config? |
|
Could you upload your quantization model to a modelscope repo like what Intel team did: #1777? Update: will upload it when modelslim take our request. |
|
@hsliuustc0106 — |
|
Can you list a table on which layer you have quantized and which layerr you skipped, and why? |
|
Have you uploaded the quantized weight to hf or any other platforms? Is this the only model you are targeting to? Any future roadmap for following prs? |
|
#1470 |
a69e18a to
5a1a567
Compare
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
|
Can you profile it using https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/ |
I can, but what information do you need me to provide? |
|
Per-step torch profiler trace for bf16 vs mxfp8, kernel-level breakdown (attn, MoE MM, quant/dequant). Want to see where the 18% speedup comes from and whether there's dequant overhead left on the table. Flamegraph or top-N kernels table is fine. |
done. Throughout the network, only the activation quantization operators are present; all dequantization operators have been fused away. |
| try: | ||
| hf_config = get_config(self.od_config.model, trust_remote_code=self.od_config.trust_remote_code) | ||
| except ValueError: | ||
| hf_config = None | ||
| logger.info("Skipping hf_config loading for diffusion model %r", self.od_config.model_class_name) | ||
| hf_text_config = get_hf_text_config(hf_config) if hf_config is not None else None | ||
| vllm_config.model_config = SimpleNamespace( | ||
| hf_config=hf_config, | ||
| hf_text_config=hf_text_config, | ||
| enforce_eager=self.od_config.enforce_eager, | ||
| dtype=self.od_config.dtype, | ||
| enable_return_routed_experts=False, | ||
| ) | ||
| vllm_config.quant_config = self.od_config.quantization_config |
There was a problem hiding this comment.
In this PR, Only this block is needed to review carefully. @SamitHuang @wtomin @ZJY0516
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
gcanlin
left a comment
There was a problem hiding this comment.
LGTM, I have no other concerns.
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>
…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>



PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR adds HunyuanImage3 diffusion quantization support in vllm-omni, with a focus on reusing vllm-ascend offline quantization flow.
Changes:
Supported Models
ignored_layersCurrently, quantized HunyuanImage-3.0 weights have not been uploaded to public model platforms such as Hugging Face. You can manually generate quantized weights with the msModelSlim tool provided below. We will upload the quantized weights as soon as possible.
Test Plan
Quantization tool: https://gitcode.com/betta18/msmodelslim/tree/hyimage3_mxfp8
Weight quantization script:
Server:
Client:
Test Result
Quantization Quality Benchmark for NPU
Quantified Gain Analysis
Memory Profiling
HunyuanImage-3.0
bf16

mxfp8

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)