[Hunyuanimage-3.0] Accuracy fix by Bounty-hunter · Pull Request #3373 · vllm-project/vllm-omni

Bounty-hunter · 2026-05-06T03:00:54Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

There is an accuracy problem after rebase vllm 0.20.0
changes:
(1) remove calling process_weights_after_loading by hook in HunyuanFusedMoEDefault class, because it will be invoked centrally in diffusers_loader.py. Repeated calls can lead to accuracy issues.

[DebugLog] module model.layers.1.mlp.experts call with quant UnquantizedFusedMoEMethod() in diffusers_loader.py

(2) After refactoring FusedMoE in vLLM, corresponding call adaptations are required.

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage.

Test Plan

t2i with Prompt: 'A brown and white dog is running on the grass'

Test Result

before:

after:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: dengyunyang <584797741@qq.com>

Gaohan123

Please add a regression test. Thanks

@dengyunyang

Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct, using the same two-stage AR -> DiT pipeline as the existing T2I path with the AR stage receiving an additional condition image alongside the user prompt. Highlights: * Pipeline & runtime - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py: cond image VAE-encode, ViT-encode, and scatter the resulting features into the DiT prefill via instantiate_vae_image_tokens / instantiate_vit_image_tokens (matches HF reference modeling layout). - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py: ar2diffusion bridge forwards condition image + system_prompt + user prompt from AR stage to DiT stage. - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml: 8-GPU IT2I stage config (4 AR + 4 DiT). - examples/offline_inference/hunyuan_image3/end2end.py + README.md: img2img modality entry; prompt_dict uses vllm-standard `prompt` key so the offline path receives the raw user prompt at the DiT stage (DiT pipeline reads `p.get("prompt")` only). * DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after the 0.20 rebase). Both addressed by aligning with the upstream PR vllm-project#3373 by @dengyunyang who independently surfaced the same accuracy gap. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py: HunyuanFusedMoEDefault used to register a forward pre-hook that called `self.quant_method.process_weights_after_loading(self)` on first forward, to compensate for the 0.18-era standard model loader not invoking it on FusedMoE layers. vLLM 0.20's standard loader (`model_executor/model_loader/base_loader.py`) now invokes `process_weights_after_loading` model-wide on init, so the hook fires a second time on first forward, double-applying non-idempotent in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight` re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel oracle on already-padded weights). Corrupted w13/w2 layout + wrong kernel oracle config produces a small per-token, per-layer expert- dispatch bias that accumulates across the 32 DiT MoE layers into a "painterly / oil texture" attractor on the generated image. The unquantized FusedMoE method has no `_already_called_process_weights_after_loading` guard (only the FP8 quant method does), so non-quantized HunyuanImage3 reliably trips this. Hook deliberately not registered. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py (HunYuanSparseMoeBlock): Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel` in forward, and drop `reduce_results=False` on the FusedMoE init. Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the `shared_mlp` output is merged inside FusedMoE.forward and the TP all-reduce is done internally; the wrapper code that did both of these externally was a 0.18-era workaround that became a double op after 0.20. Net effect of double-reduce + double shared_mlp add was a small numerical bias on top of the painterly drift; removing the wrapper restores HF-reference parity. Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same cartoon-block input + cute orange cat prompt yields a clean flat- cartoon output, visually matching HF generate_image() reference. * Tests - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py: unit-level - AR prefill input_ids byte-equal HF chat template, image-tensor byte-equal AR-side processor. - tests/e2e/accuracy/test_hunyuan_image3_it2i.py: full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple. Co-authored-by: dengyunyang <584797741@qq.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial · 2026-05-06T06:22:27Z

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage. 补充一下是哪里有问题

@dengyunyang

Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct, using the same two-stage AR -> DiT pipeline as the existing T2I path with the AR stage receiving an additional condition image alongside the user prompt. Highlights: * Pipeline & runtime - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py: cond image VAE-encode, ViT-encode, and scatter the resulting features into the DiT prefill via instantiate_vae_image_tokens / instantiate_vit_image_tokens (matches HF reference modeling layout). - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py: ar2diffusion bridge forwards condition image + system_prompt + user prompt from AR stage to DiT stage. - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml: 8-GPU IT2I stage config (4 AR + 4 DiT). - examples/offline_inference/hunyuan_image3/end2end.py + README.md: img2img modality entry; prompt_dict uses vllm-standard `prompt` key so the offline path receives the raw user prompt at the DiT stage (DiT pipeline reads `p.get("prompt")` only). * DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after the 0.20 rebase). Both addressed by aligning with the upstream PR vllm-project#3373 by @dengyunyang who independently surfaced the same accuracy gap. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py: HunyuanFusedMoEDefault used to register a forward pre-hook that called `self.quant_method.process_weights_after_loading(self)` on first forward, to compensate for the 0.18-era standard model loader not invoking it on FusedMoE layers. vLLM 0.20's standard loader (`model_executor/model_loader/base_loader.py`) now invokes `process_weights_after_loading` model-wide on init, so the hook fires a second time on first forward, double-applying non-idempotent in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight` re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel oracle on already-padded weights). Corrupted w13/w2 layout + wrong kernel oracle config produces a small per-token, per-layer expert- dispatch bias that accumulates across the 32 DiT MoE layers into a "painterly / oil texture" attractor on the generated image. The unquantized FusedMoE method has no `_already_called_process_weights_after_loading` guard (only the FP8 quant method does), so non-quantized HunyuanImage3 reliably trips this. Hook deliberately not registered. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py (HunYuanSparseMoeBlock): Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel` in forward, and drop `reduce_results=False` on the FusedMoE init. Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the `shared_mlp` output is merged inside FusedMoE.forward and the TP all-reduce is done internally; the wrapper code that did both of these externally was a 0.18-era workaround that became a double op after 0.20. Net effect of double-reduce + double shared_mlp add was a small numerical bias on top of the painterly drift; removing the wrapper restores HF-reference parity. Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same cartoon-block input + cute orange cat prompt yields a clean flat- cartoon output, visually matching HF generate_image() reference. * Tests - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py: unit-level - AR prefill input_ids byte-equal HF chat template, image-tensor byte-equal AR-side processor. - tests/e2e/accuracy/test_hunyuan_image3_it2i.py: full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple. Co-authored-by: dengyunyang <584797741@qq.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

Bounty-hunter · 2026-05-06T06:30:13Z

Please add a regression test. Thanks

regression test:
(1) benchmark dataset level test: #3055
(2) single image level test: will move tests/e2e/offline_inference/test_hunyuanimage3_text2img.py to ci in subsequent pr.

Bounty-hunter · 2026-05-06T06:32:33Z

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage. 补充一下是哪里有问题

Siglip2VisionModel not including member vision_model anymore.

gcanlin

LGTM

@dengyunyang

Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct, using the same two-stage AR -> DiT pipeline as the existing T2I path with the AR stage receiving an additional condition image alongside the user prompt. Highlights: * Pipeline & runtime - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py: cond image VAE-encode, ViT-encode, and scatter the resulting features into the DiT prefill via instantiate_vae_image_tokens / instantiate_vit_image_tokens (matches HF reference modeling layout). - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py: ar2diffusion bridge forwards condition image + system_prompt + user prompt from AR stage to DiT stage. - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml: 8-GPU IT2I stage config (4 AR + 4 DiT). - examples/offline_inference/hunyuan_image3/end2end.py + README.md: img2img modality entry; prompt_dict uses vllm-standard `prompt` key so the offline path receives the raw user prompt at the DiT stage (DiT pipeline reads `p.get("prompt")` only). * DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after the 0.20 rebase). Both addressed by aligning with the upstream PR vllm-project#3373 by @dengyunyang who independently surfaced the same accuracy gap. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py: HunyuanFusedMoEDefault used to register a forward pre-hook that called `self.quant_method.process_weights_after_loading(self)` on first forward, to compensate for the 0.18-era standard model loader not invoking it on FusedMoE layers. vLLM 0.20's standard loader (`model_executor/model_loader/base_loader.py`) now invokes `process_weights_after_loading` model-wide on init, so the hook fires a second time on first forward, double-applying non-idempotent in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight` re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel oracle on already-padded weights). Corrupted w13/w2 layout + wrong kernel oracle config produces a small per-token, per-layer expert- dispatch bias that accumulates across the 32 DiT MoE layers into a "painterly / oil texture" attractor on the generated image. The unquantized FusedMoE method has no `_already_called_process_weights_after_loading` guard (only the FP8 quant method does), so non-quantized HunyuanImage3 reliably trips this. Hook deliberately not registered. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py (HunYuanSparseMoeBlock): Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel` in forward, and drop `reduce_results=False` on the FusedMoE init. Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the `shared_mlp` output is merged inside FusedMoE.forward and the TP all-reduce is done internally; the wrapper code that did both of these externally was a 0.18-era workaround that became a double op after 0.20. Net effect of double-reduce + double shared_mlp add was a small numerical bias on top of the painterly drift; removing the wrapper restores HF-reference parity. Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same cartoon-block input + cute orange cat prompt yields a clean flat- cartoon output, visually matching HF generate_image() reference. * Tests - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py: unit-level - AR prefill input_ids byte-equal HF chat template, image-tensor byte-equal AR-side processor. - tests/e2e/accuracy/test_hunyuan_image3_it2i.py: full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple. Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: skf <54565339+skf-1999@users.noreply.github.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: dengyunyang <584797741@qq.com>

Bounty-hunter force-pushed the main_5_5_adapt branch 5 times, most recently from 9f70e8c to 1be83bf Compare May 6, 2026 05:17

accuracy fix

99c43eb

Signed-off-by: dengyunyang <584797741@qq.com>

Bounty-hunter force-pushed the main_5_5_adapt branch from 1be83bf to 99c43eb Compare May 6, 2026 05:18

Bounty-hunter changed the title ~~Accuracy fix~~ [Hunyuanimage-3.0] Accuracy fix May 6, 2026

Bounty-hunter marked this pull request as ready for review May 6, 2026 05:32

Bounty-hunter requested a review from hsliuustc0106 as a code owner May 6, 2026 05:32

Merge branch 'main' into main_5_5_adapt

b2b292e

Gaohan123 reviewed May 6, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label May 6, 2026

gcanlin approved these changes May 6, 2026

View reviewed changes

gcanlin enabled auto-merge (squash) May 6, 2026 06:54

Bounty-hunter mentioned this pull request May 6, 2026

[RFC]: Support Hunyuan image AR + DIT JiusiServe/vllm-omni#183

Closed

1 task

gcanlin merged commit 369a47d into vllm-project:main May 6, 2026
7 of 8 checks passed

This was referenced May 7, 2026

[Feature]: [Hunyuanimage]Support DIT reuse kv from AR stage JiusiServe/vllm-omni#216

Open

[RFC]: HunyuanImage Model deployment optimization #2015

Open

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Hunyuanimage-3.0] Accuracy fix (vllm-project#3373)

7a6ec34

Signed-off-by: dengyunyang <584797741@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hunyuanimage-3.0] Accuracy fix#3373

[Hunyuanimage-3.0] Accuracy fix#3373
gcanlin merged 2 commits into
vllm-project:mainfrom
Bounty-hunter:main_5_5_adapt

Bounty-hunter commented May 6, 2026 •

edited

Loading

Uh oh!

Gaohan123 left a comment

Uh oh!

TaffyOfficial commented May 6, 2026

Uh oh!

Bounty-hunter commented May 6, 2026

Uh oh!

Bounty-hunter commented May 6, 2026

Uh oh!

gcanlin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Bounty-hunter commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial commented May 6, 2026

Uh oh!

Bounty-hunter commented May 6, 2026

Uh oh!

Bounty-hunter commented May 6, 2026

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bounty-hunter commented May 6, 2026 •

edited

Loading