[FP8] enable hunyuan-image-3 diffusion model with fp8 online quant#1935
Conversation
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@lishunyang12 , wondering if anyone is on hunyuan-image-3-moe? If not, would like to init a PR to support |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2bf5674f07
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@lishunyang12 PTAL |
|
LGTM from the implementation. @xuechendi If you want to merge it before the quantization framework is settled, can you help testing it afterwards in #1764? |
|
@lishunyang12 , sure, I can verify once #1764 landed |
|
Please rebase after PR1908 is merged and this pr will be integrated before 1764. |
|
pending on #1908 landing |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
9b8a2e1 to
8009cb0
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
PR Review: [FP8] enable hunyuan-image-3 diffusion model with fp8 online quant
Gate Status: PASSING ✓
All CI checks pass.
What This PR Does
This is a well-scoped PR that:
- Threads
quant_configthrough HunyuanImage3DecoderLayer and HunyuanImage3Model - Updates weight loading to allow missing
weight_scaleweights (expected for online FP8) - Uses
get_vllm_quant_config_for_layersin the pipeline
Minor Suggestions (Non-blocking)
1. Debug Log Level
logger.info(f"quant_config: {quant_config}")Consider using logger.debug() to avoid noisy logs in production.
2. Warning Message Grammar
logger.warning(
f"Following weights scale were not initialized from checkpoint: {weights_scale_not_loaded}"
)Suggest: "Following weight_scale weights were not initialized from checkpoint: ..."
Note on "Remain Issue"
The missing weight_scale weights mentioned in the PR description is expected behavior for online FP8 quantization — scales are computed at load time, not loaded from checkpoint.
Summary
| Validated | Minor Issues |
|---|---|
| FP8 config properly threaded | Debug log level |
| Weight scale handling correct | Warning grammar |
| Manual test shows working output |
🤖 Generated with Claude Code
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@hsliuustc0106 , @lishunyang12 , I have rebased this PR after #1908 merged. |
lishunyang12
left a comment
There was a problem hiding this comment.
Looks good, rebase is clean. The two minor nits from the other review are non-blocking.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@lishunyang12 , thanks, I've fixed the nit and rebased |
|
Since you are exploring quantization on a new diffusion architecture. Can you help us find out which one layer in hunyuan-image3 is the most sensitive to FP8 towards final image output so that user know the trade-off and utilize this memory-saving feature more effectively. Not a blocker and you can carry out this experiment after the integration. and add your insight on the doc. https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/quantization/fp8/#configuration Also, could you please refer to #1470 and show us a table like this |
hsliuustc0106
left a comment
There was a problem hiding this comment.
@yjb767868009 please check whether int8 can work in npu?
| help="Diffusion model name or local path. Supported models: " | ||
| "Qwen/Qwen-Image, Tongyi-MAI/Z-Image-Turbo, Qwen/Qwen-Image-2512, stepfun-ai/NextStep-1.1", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
@ZJY0516 @SamitHuang PTAL whether we have stage config yaml for diffusion model?
There was a problem hiding this comment.
We don't usually have, this only needed by hunyuan
and I proposed a change to introduce stage selection in #1826
if that merged, we don't really need user to specify stage_configs_yaml here.
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…llm-project#1935) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#1854
To enable online-quant fp8 for hunyuan-image-3-moe diffusion path. Current Intel Arc B60 only has 24GB memory per card, can't run full hunyuan model with Bf16. Online quant fp8 is required.
Test Plan
Remain issue:
Output looks OK

Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)