[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path by jiangmengyu18 · Pull Request #2979 · vllm-project/vllm-omni

jiangmengyu18 · 2026-04-21T06:58:51Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR adds HunyuanImage3 diffusion quantization support in vllm-omni, with a focus on reusing vllm-ascend offline quantization flow.

Changes:

Framework-common

Align diffusion quant config preparation with vllm / vllm-ascend flow
Reuse maybe_update_config(...) and configure_quant_config(...) in diffusion registry
Add HunyuanImage3 hf_to_vllm_mapper for quant description key normalization
Clean up HunyuanImage3 quant loading and packed QKV reshape logic
Add platform-based packed_modules_mapping resolution for HunyuanImage3

NPU-specific

Bind HunyuanImage3 packed module mapping for Ascend offline quantization

Supported Models

Model	HF Models	Recommendation	`ignored_layers`	status
HunyuanImage-3.0	-	All layers	None	Supported
HunyuanVideo-1.5	-	-	-	Planned
Wan2_2	-	-	-	Planned

Currently, quantized HunyuanImage-3.0 weights have not been uploaded to public model platforms such as Hugging Face. You can manually generate quantized weights with the msModelSlim tool provided below. We will upload the quantized weights as soon as possible.

Test Plan

vLLM version: v0.19.0
vLLM Ascend: main vllm-project/vllm-ascend@8b92c79
vLLM Omni: main c1ba86a

Quantization tool: https://gitcode.com/betta18/msmodelslim/tree/hyimage3_mxfp8
Weight quantization script:

export ASCEND_RT_VISIBLE_DEVICES=0

MODEL_PATH="/data/HunyuanImage-3.0/"
SAVE_PATH="/data/HunyuanImage-3.0-W8A8F8-MXFP"
MODEL_TYPE="HunyuanImage-3.0"
CONFIG_PATH="/lab_practice/hunyuan_image3/hunyuan_image3_w8a8f8_mxfp.yaml"

msmodelslim quant --model_path ${MODEL_PATH} \
                  --save_path ${SAVE_PATH} \
                  --device npu \
                  --model_type ${MODEL_TYPE} \
                  --config_path ${CONFIG_PATH} \
                  --trust_remote_code False

Server:

vllm serve "/data/HunyuanImage-3.0-W8A8F8-MXFP/" --omni --port "8091" \
    --log-stats \
    --quantization ascend \
    --stage-configs-path "vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml"

Client:

curl -X POST http://localhost:8091/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": 
        "A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling.\n\nThe primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms.\n\nThe surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall.\n\nThe lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong scense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style.",
    "num_inference_steps": 50,
    "guidance_scale": "1.0",
    "n": 1,
    "size": "1024x1024",
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > output.png

Test Result

HunyuanImage3.0 int8
HunyuanImage3.0 mxfp8
HunyuanImage3.0 bf16

Quantization Quality Benchmark for NPU

HunyuanImage-3.0

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16, TP=2	16.62s	—	93.93	—	(ref)
mxfp8, TP=2	13.59s	18.23%	61.35	34.68%	0.0185

Quantified Gain Analysis

HunyuanImage-3.0: bf16 -> mxfp8

Modules	Layers	% of Total Network	Reward	% of Total Network Gain
Atten	qkv_proj	2.75%	38.68%	1.06%
	o_proj	1.79%	39.62%	0.71%
	Total	4.54%	-	1.77%
MoE	gate	0.15%	5.12%	0.01%
	experts.gate_and_up_proj	27.79%	37.24%	10.35%
	experts.down_proj	12.73%	43.21%	5.50%
	shared_mlp.gate_and_up_proj	2.74%	34.19%	0.94%
	shared_mlp.down_proj	1.36%	43.30%	0.59%
	Total	44.77%	-	17.38%
Overall Total		49.31%	-	19.16%

Memory Profiling

HunyuanImage-3.0, 1024x1024, 50 steps

Config	Weights	Activations	Peak	Total Reduction
BF16, TP=2	79.75 GB	14.18 GB	93.93 GB	-
mxfp8, TP=2	43.73 GB	17.62 GB	61.35 GB	34.68%
mxfp8, TP=1	82.57 GB	15.05 GB	97.62 GB	48.03%

HunyuanImage-3.0

bf16

mxfp8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

chatgpt-codex-connector · 2026-04-21T06:58:59Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

lishunyang12 · 2026-04-21T07:03:10Z

Please show test result. This pr is subject to be refractored as modelopt infra is developing process and yaml refractoring is not fully settled for this model. cc @hsliuustc0106

jiangmengyu18 · 2026-04-22T02:12:12Z

@gcanlin @lishunyang12 @hsliuustc0106 cc

hsliuustc0106 · 2026-04-22T06:18:44Z

@lishunyang12 PTAL for the diffusion_quant_config, is this different from vllm_config.quant_config?

gcanlin · 2026-04-22T06:23:00Z

Could you upload your quantization model to a modelscope repo like what Intel team did: #1777?

Update: will upload it when modelslim take our request.

lishunyang12 · 2026-04-22T20:42:33Z

@hsliuustc0106 — od_config.quantization_config is the diffusion-side namespace; vllm_config.quant_config is for the LLM stages. They're separate by design today (diffusion has its own loop). This PR just threads od_config's through to HunyuanImage3 where it was hardcoded to None. Worth unifying eventually, not in scope here.

jiangmengyu18 · 2026-04-23T03:24:32Z

@gcanlin @lishunyang12 @hsliuustc0106 cc

lishunyang12 · 2026-04-23T03:48:11Z

Can you list a table on which layer you have quantized and which layerr you skipped, and why?

lishunyang12 · 2026-04-23T03:49:47Z

Have you uploaded the quantized weight to hf or any other platforms? Is this the only model you are targeting to? Any future roadmap for following prs?

lishunyang12 · 2026-04-23T04:01:03Z

We should complete this two tables.

lishunyang12 · 2026-04-23T04:02:47Z

Can you provide table like this for performance measurement?

lishunyang12 · 2026-04-23T04:06:00Z

#1470

For quality comparision, please provide table like this.

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

lishunyang12 · 2026-04-23T08:35:28Z

Can you profile it using https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/

jiangmengyu18 · 2026-04-23T09:34:06Z

Can you profile it using https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/

I can, but what information do you need me to provide?

lishunyang12 · 2026-04-23T09:40:55Z

Per-step torch profiler trace for bf16 vs mxfp8, kernel-level breakdown (attn, MoE MM, quant/dequant). Want to see where the 18% speedup comes from and whether there's dequant overhead left on the table. Flamegraph or top-N kernels table is fine.

jiangmengyu18 · 2026-04-24T01:54:05Z

Per-step torch profiler trace for bf16 vs mxfp8, kernel-level breakdown (attn, MoE MM, quant/dequant). Want to see where the 18% speedup comes from and whether there's dequant overhead left on the table. Flamegraph or top-N kernels table is fine.

done. Throughout the network, only the activation quantization operators are present; all dequantization operators have been fused away.

jiangmengyu18 · 2026-04-24T02:05:16Z

@gcanlin @lishunyang12 @hsliuustc0106 cc

gcanlin · 2026-04-24T02:21:27Z

+        try:
+            hf_config = get_config(self.od_config.model, trust_remote_code=self.od_config.trust_remote_code)
+        except ValueError:
+            hf_config = None
+            logger.info("Skipping hf_config loading for diffusion model %r", self.od_config.model_class_name)
+        hf_text_config = get_hf_text_config(hf_config) if hf_config is not None else None
+        vllm_config.model_config = SimpleNamespace(
+            hf_config=hf_config,
+            hf_text_config=hf_text_config,
+            enforce_eager=self.od_config.enforce_eager,
+            dtype=self.od_config.dtype,
+            enable_return_routed_experts=False,
+        )
+        vllm_config.quant_config = self.od_config.quantization_config


In this PR, Only this block is needed to review carefully. @SamitHuang @wtomin @ZJY0516

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>

gcanlin

LGTM, I have no other concerns.

Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>

…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>

…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>

…cend on diffusion path (vllm-project#2979) Signed-off-by: betta18 <jiangmengyu1@huawei.com> Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com> Co-authored-by: betta18 <jiangmengyu1@huawei.com>

betta18 added 2 commits April 21, 2026 10:45

[Feature][NPU] support hyimage3 offline quant by vllm-ascend.

fc18d8d

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

Merge branch 'support-hyimage3-quant' into support-hyimage3-quant-main

f05811d

jiangmengyu18 requested a review from hsliuustc0106 as a code owner April 21, 2026 06:58

hsliuustc0106 reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i_quant_ascend.yaml Outdated

gcanlin reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i_quant_ascend.yaml Outdated

gcanlin added the ready label to trigger buildkite CI label Apr 22, 2026

lishunyang12 reviewed Apr 22, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

lishunyang12 reviewed Apr 23, 2026

View reviewed changes

Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md

Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md Outdated

Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md

Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md

lishunyang12 reviewed Apr 23, 2026

View reviewed changes

Comment thread vllm_omni/platforms/npu/platform.py

fix bug

5a1a567

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

jiangmengyu18 force-pushed the support-hyimage3-quant-main branch from a69e18a to 5a1a567 Compare April 23, 2026 08:05

jiangmengyu18 closed this Apr 23, 2026

jiangmengyu18 reopened this Apr 23, 2026

betta18 added 2 commits April 23, 2026 16:14

fix pre-commit

ce5f3f1

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

fix pre-commit

631ddc7

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

gcanlin reviewed Apr 24, 2026

View reviewed changes

david6666666 reviewed Apr 24, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/registry.py Outdated

betta18 and others added 2 commits April 24, 2026 14:13

fix bug when model_class.packed_modules_mapping has default value.

9fce16e

Signed-off-by: betta18 <jiangmengyu1@huawei.com>

Merge branch 'main' into support-hyimage3-quant-main

a2b8e8f

Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>

gcanlin approved these changes Apr 24, 2026

View reviewed changes

gcanlin reviewed Apr 24, 2026

View reviewed changes

Comment thread vllm_omni/engine/async_omni_engine.py

resolve conflicts

99a9fb9

Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

Merge branch 'main' into support-hyimage3-quant-main

051c095

Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>

gcanlin enabled auto-merge (squash) April 25, 2026 05:50

gcanlin merged commit c174c95 into vllm-project:main Apr 25, 2026
8 checks passed

david6666666 mentioned this pull request May 8, 2026

[RFC] [0.22.0]: Quantization Support JiusiServe/vllm-omni#182

Open

8 tasks

Bounty-hunter mentioned this pull request May 10, 2026

[RFC]: HunyuanImage Model deployment optimization #2015

Open

Conversation

jiangmengyu18 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Supported Models

Test Plan

Test Result

Quantization Quality Benchmark for NPU

Quantified Gain Analysis

Memory Profiling

HunyuanImage-3.0

Uh oh!

chatgpt-codex-connector Bot commented Apr 21, 2026

Uh oh!

lishunyang12 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangmengyu18 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 22, 2026

Uh oh!

Uh oh!

gcanlin commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 22, 2026

Uh oh!

Uh oh!

jiangmengyu18 commented Apr 23, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

jiangmengyu18 commented Apr 23, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

jiangmengyu18 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangmengyu18 commented Apr 24, 2026

Uh oh!

gcanlin Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiangmengyu18 commented Apr 21, 2026 •

edited

Loading

lishunyang12 commented Apr 21, 2026 •

edited

Loading

jiangmengyu18 commented Apr 22, 2026 •

edited

Loading

gcanlin commented Apr 22, 2026 •

edited

Loading

lishunyang12 commented Apr 23, 2026 •

edited

Loading

jiangmengyu18 commented Apr 24, 2026 •

edited

Loading