Skip to content

[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path#2979

Merged
gcanlin merged 9 commits into
vllm-project:mainfrom
jiangmengyu18:support-hyimage3-quant-main
Apr 25, 2026
Merged

[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path#2979
gcanlin merged 9 commits into
vllm-project:mainfrom
jiangmengyu18:support-hyimage3-quant-main

Conversation

@jiangmengyu18
Copy link
Copy Markdown
Contributor

@jiangmengyu18 jiangmengyu18 commented Apr 21, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR adds HunyuanImage3 diffusion quantization support in vllm-omni, with a focus on reusing vllm-ascend offline quantization flow.

Changes:

  • Framework-common
  1. Align diffusion quant config preparation with vllm / vllm-ascend flow
  2. Reuse maybe_update_config(...) and configure_quant_config(...) in diffusion registry
  3. Add HunyuanImage3 hf_to_vllm_mapper for quant description key normalization
  4. Clean up HunyuanImage3 quant loading and packed QKV reshape logic
  5. Add platform-based packed_modules_mapping resolution for HunyuanImage3
  • NPU-specific
  1. Bind HunyuanImage3 packed module mapping for Ascend offline quantization

Supported Models

Model HF Models Recommendation ignored_layers status
HunyuanImage-3.0 - All layers None Supported
HunyuanVideo-1.5 - - - Planned
Wan2_2 - - - Planned

Currently, quantized HunyuanImage-3.0 weights have not been uploaded to public model platforms such as Hugging Face. You can manually generate quantized weights with the msModelSlim tool provided below. We will upload the quantized weights as soon as possible.

Test Plan

Quantization tool: https://gitcode.com/betta18/msmodelslim/tree/hyimage3_mxfp8
Weight quantization script:

export ASCEND_RT_VISIBLE_DEVICES=0

MODEL_PATH="/data/HunyuanImage-3.0/"
SAVE_PATH="/data/HunyuanImage-3.0-W8A8F8-MXFP"
MODEL_TYPE="HunyuanImage-3.0"
CONFIG_PATH="/lab_practice/hunyuan_image3/hunyuan_image3_w8a8f8_mxfp.yaml"

msmodelslim quant --model_path ${MODEL_PATH} \
                  --save_path ${SAVE_PATH} \
                  --device npu \
                  --model_type ${MODEL_TYPE} \
                  --config_path ${CONFIG_PATH} \
                  --trust_remote_code False

Server:

vllm serve "/data/HunyuanImage-3.0-W8A8F8-MXFP/" --omni --port "8091" \
    --log-stats \
    --quantization ascend \
    --stage-configs-path "vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml"

Client:

curl -X POST http://localhost:8091/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": 
        "A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling.\n\nThe primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms.\n\nThe surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall.\n\nThe lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong scense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style.",
    "num_inference_steps": 50,
    "guidance_scale": "1.0",
    "n": 1,
    "size": "1024x1024",
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > output.png

Test Result

  • HunyuanImage3.0 int8
  • HunyuanImage3.0 mxfp8
  • HunyuanImage3.0 bf16

Quantization Quality Benchmark for NPU

  • HunyuanImage-3.0
Config Avg Time Speedup Memory (GiB) Mem Reduction Mean LPIPS
BF16, TP=2 16.62s 93.93 (ref)
mxfp8, TP=2 13.59s 18.23% 61.35 34.68% 0.0185

Quantified Gain Analysis

  • HunyuanImage-3.0: bf16 -> mxfp8
ModulesLayers% of Total NetworkReward% of Total Network Gain
Attenqkv_proj2.75%38.68%1.06%
o_proj1.79%39.62%0.71%
Total4.54%-1.77%
MoEgate0.15%5.12%0.01%
experts.gate_and_up_proj27.79%37.24%10.35%
experts.down_proj12.73%43.21%5.50%
shared_mlp.gate_and_up_proj2.74%34.19%0.94%
shared_mlp.down_proj1.36%43.30%0.59%
Total44.77%-17.38%
Overall Total49.31%-19.16%

Memory Profiling

  • HunyuanImage-3.0, 1024x1024, 50 steps
Config Weights Activations Peak Total Reduction
BF16, TP=2 79.75 GB 14.18 GB 93.93 GB -
mxfp8, TP=2 43.73 GB 17.62 GB 61.35 GB 34.68%
mxfp8, TP=1 82.57 GB 15.05 GB 97.62 GB 48.03%

HunyuanImage-3.0

bf16
output_bf16

mxfp8
output_mxfp8


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Apr 21, 2026

Please show test result. This pr is subject to be refractored as modelopt infra is developing process and yaml refractoring is not fully settled for this model. cc @hsliuustc0106

@jiangmengyu18
Copy link
Copy Markdown
Contributor Author

jiangmengyu18 commented Apr 22, 2026

Comment thread vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i_quant_ascend.yaml Outdated
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@lishunyang12 PTAL for the diffusion_quant_config, is this different from vllm_config.quant_config?

Comment thread vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i_quant_ascend.yaml Outdated
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 22, 2026

Could you upload your quantization model to a modelscope repo like what Intel team did: #1777?

Update: will upload it when modelslim take our request.

@gcanlin gcanlin added the ready label to trigger buildkite CI label Apr 22, 2026
@lishunyang12
Copy link
Copy Markdown
Collaborator

@hsliuustc0106od_config.quantization_config is the diffusion-side namespace; vllm_config.quant_config is for the LLM stages. They're separate by design today (diffusion has its own loop). This PR just threads od_config's through to HunyuanImage3 where it was hardcoded to None. Worth unifying eventually, not in scope here.

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
@jiangmengyu18
Copy link
Copy Markdown
Contributor Author

@lishunyang12
Copy link
Copy Markdown
Collaborator

Can you list a table on which layer you have quantized and which layerr you skipped, and why?

@lishunyang12
Copy link
Copy Markdown
Collaborator

Have you uploaded the quantized weight to hf or any other platforms? Is this the only model you are targeting to? Any future roadmap for following prs?

Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md
Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md Outdated
Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md
Comment thread docs/user_guide/diffusion/quantization/msmodelslim.md
@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Apr 23, 2026

image We should complete this two tables.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Can you provide table like this for performance measurement?
image

@lishunyang12
Copy link
Copy Markdown
Collaborator

#1470
image
For quality comparision, please provide table like this.

Comment thread vllm_omni/platforms/npu/platform.py
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
@jiangmengyu18 jiangmengyu18 force-pushed the support-hyimage3-quant-main branch from a69e18a to 5a1a567 Compare April 23, 2026 08:05
@jiangmengyu18 jiangmengyu18 reopened this Apr 23, 2026
betta18 added 2 commits April 23, 2026 16:14
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator

@jiangmengyu18
Copy link
Copy Markdown
Contributor Author

Can you profile it using https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/

I can, but what information do you need me to provide?

@lishunyang12
Copy link
Copy Markdown
Collaborator

Per-step torch profiler trace for bf16 vs mxfp8, kernel-level breakdown (attn, MoE MM, quant/dequant). Want to see where the 18% speedup comes from and whether there's dequant overhead left on the table. Flamegraph or top-N kernels table is fine.

@jiangmengyu18
Copy link
Copy Markdown
Contributor Author

jiangmengyu18 commented Apr 24, 2026

Per-step torch profiler trace for bf16 vs mxfp8, kernel-level breakdown (attn, MoE MM, quant/dequant). Want to see where the 18% speedup comes from and whether there's dequant overhead left on the table. Flamegraph or top-N kernels table is fine.

done. Throughout the network, only the activation quantization operators are present; all dequantization operators have been fused away.

@jiangmengyu18
Copy link
Copy Markdown
Contributor Author

Comment on lines +125 to +138
try:
hf_config = get_config(self.od_config.model, trust_remote_code=self.od_config.trust_remote_code)
except ValueError:
hf_config = None
logger.info("Skipping hf_config loading for diffusion model %r", self.od_config.model_class_name)
hf_text_config = get_hf_text_config(hf_config) if hf_config is not None else None
vllm_config.model_config = SimpleNamespace(
hf_config=hf_config,
hf_text_config=hf_text_config,
enforce_eager=self.od_config.enforce_eager,
dtype=self.od_config.dtype,
enable_return_routed_experts=False,
)
vllm_config.quant_config = self.od_config.quantization_config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, Only this block is needed to review carefully. @SamitHuang @wtomin @ZJY0516

Comment thread vllm_omni/diffusion/registry.py Outdated
betta18 and others added 2 commits April 24, 2026 14:13
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have no other concerns.

Comment thread vllm_omni/engine/async_omni_engine.py
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
@gcanlin gcanlin enabled auto-merge (squash) April 25, 2026 05:50
@gcanlin gcanlin merged commit c174c95 into vllm-project:main Apr 25, 2026
8 checks passed
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…cend on diffusion path (vllm-project#2979)

Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
sphinxkkkbc pushed a commit to sphinxkkkbc/vllm-omni that referenced this pull request May 4, 2026
…cend on diffusion path (vllm-project#2979)

Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…cend on diffusion path (vllm-project#2979)

Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants