Skip to content

[Hunyuanimage-3.0] Accuracy fix#3373

Merged
gcanlin merged 2 commits into
vllm-project:mainfrom
Bounty-hunter:main_5_5_adapt
May 6, 2026
Merged

[Hunyuanimage-3.0] Accuracy fix#3373
gcanlin merged 2 commits into
vllm-project:mainfrom
Bounty-hunter:main_5_5_adapt

Conversation

@Bounty-hunter
Copy link
Copy Markdown
Contributor

@Bounty-hunter Bounty-hunter commented May 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

There is an accuracy problem after rebase vllm 0.20.0
changes:
(1) remove calling process_weights_after_loading by hook in HunyuanFusedMoEDefault class, because it will be invoked centrally in diffusers_loader.py. Repeated calls can lead to accuracy issues.

[DebugLog] module model.layers.1.mlp.experts call with quant UnquantizedFusedMoEMethod() in diffusers_loader.py

(2) After refactoring FusedMoE in vLLM, corresponding call adaptations are required.

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage.

Test Plan

t2i with Prompt: 'A brown and white dog is running on the grass'

Test Result

before:
output_0_0

after:
output_0_0


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@Bounty-hunter Bounty-hunter force-pushed the main_5_5_adapt branch 5 times, most recently from 9f70e8c to 1be83bf Compare May 6, 2026 05:17
Signed-off-by: dengyunyang <584797741@qq.com>
@Bounty-hunter Bounty-hunter changed the title Accuracy fix [Hunyuanimage-3.0] Accuracy fix May 6, 2026
@Bounty-hunter Bounty-hunter marked this pull request as ready for review May 6, 2026 05:32
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test. Thanks

TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct,
using the same two-stage AR -> DiT pipeline as the existing T2I path with
the AR stage receiving an additional condition image alongside the user
prompt.

Highlights:

* Pipeline & runtime
  - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py:
    cond image VAE-encode, ViT-encode, and scatter the resulting features
    into the DiT prefill via instantiate_vae_image_tokens /
    instantiate_vit_image_tokens (matches HF reference modeling layout).
  - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py:
    ar2diffusion bridge forwards condition image + system_prompt + user
    prompt from AR stage to DiT stage.
  - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml:
    8-GPU IT2I stage config (4 AR + 4 DiT).
  - examples/offline_inference/hunyuan_image3/end2end.py + README.md:
    img2img modality entry; prompt_dict uses vllm-standard `prompt` key
    so the offline path receives the raw user prompt at the DiT stage
    (DiT pipeline reads `p.get("prompt")` only).

* DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after
  the 0.20 rebase). Both addressed by aligning with the upstream PR
  vllm-project#3373 by @dengyunyang who independently surfaced
  the same accuracy gap.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py:
    HunyuanFusedMoEDefault used to register a forward pre-hook that
    called `self.quant_method.process_weights_after_loading(self)` on
    first forward, to compensate for the 0.18-era standard model loader
    not invoking it on FusedMoE layers. vLLM 0.20's standard loader
    (`model_executor/model_loader/base_loader.py`) now invokes
    `process_weights_after_loading` model-wide on init, so the hook
    fires a second time on first forward, double-applying non-idempotent
    in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight`
    re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel
    oracle on already-padded weights). Corrupted w13/w2 layout + wrong
    kernel oracle config produces a small per-token, per-layer expert-
    dispatch bias that accumulates across the 32 DiT MoE layers into a
    "painterly / oil texture" attractor on the generated image. The
    unquantized FusedMoE method has no
    `_already_called_process_weights_after_loading` guard (only the FP8
    quant method does), so non-quantized HunyuanImage3 reliably trips
    this. Hook deliberately not registered.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py
    (HunYuanSparseMoeBlock):
    Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel`
    in forward, and drop `reduce_results=False` on the FusedMoE init.
    Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the
    `shared_mlp` output is merged inside FusedMoE.forward and the TP
    all-reduce is done internally; the wrapper code that did both of
    these externally was a 0.18-era workaround that became a double
    op after 0.20. Net effect of double-reduce + double shared_mlp add
    was a small numerical bias on top of the painterly drift; removing
    the wrapper restores HF-reference parity.

  Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same
  cartoon-block input + cute orange cat prompt yields a clean flat-
  cartoon output, visually matching HF generate_image() reference.

* Tests
  - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py:
    unit-level - AR prefill input_ids byte-equal HF chat template,
    image-tensor byte-equal AR-side processor.
  - tests/e2e/accuracy/test_hunyuan_image3_it2i.py:
    full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at
    PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple.

Co-authored-by: dengyunyang <584797741@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial
Copy link
Copy Markdown
Contributor

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage. 补充一下是哪里有问题

TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct,
using the same two-stage AR -> DiT pipeline as the existing T2I path with
the AR stage receiving an additional condition image alongside the user
prompt.

Highlights:

* Pipeline & runtime
  - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py:
    cond image VAE-encode, ViT-encode, and scatter the resulting features
    into the DiT prefill via instantiate_vae_image_tokens /
    instantiate_vit_image_tokens (matches HF reference modeling layout).
  - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py:
    ar2diffusion bridge forwards condition image + system_prompt + user
    prompt from AR stage to DiT stage.
  - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml:
    8-GPU IT2I stage config (4 AR + 4 DiT).
  - examples/offline_inference/hunyuan_image3/end2end.py + README.md:
    img2img modality entry; prompt_dict uses vllm-standard `prompt` key
    so the offline path receives the raw user prompt at the DiT stage
    (DiT pipeline reads `p.get("prompt")` only).

* DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after
  the 0.20 rebase). Both addressed by aligning with the upstream PR
  vllm-project#3373 by @dengyunyang who independently surfaced
  the same accuracy gap.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py:
    HunyuanFusedMoEDefault used to register a forward pre-hook that
    called `self.quant_method.process_weights_after_loading(self)` on
    first forward, to compensate for the 0.18-era standard model loader
    not invoking it on FusedMoE layers. vLLM 0.20's standard loader
    (`model_executor/model_loader/base_loader.py`) now invokes
    `process_weights_after_loading` model-wide on init, so the hook
    fires a second time on first forward, double-applying non-idempotent
    in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight`
    re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel
    oracle on already-padded weights). Corrupted w13/w2 layout + wrong
    kernel oracle config produces a small per-token, per-layer expert-
    dispatch bias that accumulates across the 32 DiT MoE layers into a
    "painterly / oil texture" attractor on the generated image. The
    unquantized FusedMoE method has no
    `_already_called_process_weights_after_loading` guard (only the FP8
    quant method does), so non-quantized HunyuanImage3 reliably trips
    this. Hook deliberately not registered.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py
    (HunYuanSparseMoeBlock):
    Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel`
    in forward, and drop `reduce_results=False` on the FusedMoE init.
    Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the
    `shared_mlp` output is merged inside FusedMoE.forward and the TP
    all-reduce is done internally; the wrapper code that did both of
    these externally was a 0.18-era workaround that became a double
    op after 0.20. Net effect of double-reduce + double shared_mlp add
    was a small numerical bias on top of the painterly drift; removing
    the wrapper restores HF-reference parity.

  Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same
  cartoon-block input + cute orange cat prompt yields a clean flat-
  cartoon output, visually matching HF generate_image() reference.

* Tests
  - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py:
    unit-level - AR prefill input_ids byte-equal HF chat template,
    image-tensor byte-equal AR-side processor.
  - tests/e2e/accuracy/test_hunyuan_image3_it2i.py:
    full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at
    PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple.

Co-authored-by: dengyunyang <584797741@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@Bounty-hunter
Copy link
Copy Markdown
Contributor Author

Please add a regression test. Thanks

regression test:
(1) benchmark dataset level test: #3055
(2) single image level test: will move tests/e2e/offline_inference/test_hunyuanimage3_text2img.py to ci in subsequent pr.

@Bounty-hunter
Copy link
Copy Markdown
Contributor Author

(3) After upgrading transformers, Siglip2VisionModel has also been modified. It is now uniformly replaced with the Siglip2VisionTransformer implemented in the AR stage. 补充一下是哪里有问题

Siglip2VisionModel not including member vision_model anymore.

@gcanlin gcanlin added the ready label to trigger buildkite CI label May 6, 2026
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gcanlin gcanlin enabled auto-merge (squash) May 6, 2026 06:54
@gcanlin gcanlin merged commit 369a47d into vllm-project:main May 6, 2026
7 of 8 checks passed
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct,
using the same two-stage AR -> DiT pipeline as the existing T2I path with
the AR stage receiving an additional condition image alongside the user
prompt.

Highlights:

* Pipeline & runtime
  - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py:
    cond image VAE-encode, ViT-encode, and scatter the resulting features
    into the DiT prefill via instantiate_vae_image_tokens /
    instantiate_vit_image_tokens (matches HF reference modeling layout).
  - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py:
    ar2diffusion bridge forwards condition image + system_prompt + user
    prompt from AR stage to DiT stage.
  - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml:
    8-GPU IT2I stage config (4 AR + 4 DiT).
  - examples/offline_inference/hunyuan_image3/end2end.py + README.md:
    img2img modality entry; prompt_dict uses vllm-standard `prompt` key
    so the offline path receives the raw user prompt at the DiT stage
    (DiT pipeline reads `p.get("prompt")` only).

* DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after
  the 0.20 rebase). Both addressed by aligning with the upstream PR
  vllm-project#3373 by @dengyunyang who independently surfaced
  the same accuracy gap.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py:
    HunyuanFusedMoEDefault used to register a forward pre-hook that
    called `self.quant_method.process_weights_after_loading(self)` on
    first forward, to compensate for the 0.18-era standard model loader
    not invoking it on FusedMoE layers. vLLM 0.20's standard loader
    (`model_executor/model_loader/base_loader.py`) now invokes
    `process_weights_after_loading` model-wide on init, so the hook
    fires a second time on first forward, double-applying non-idempotent
    in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight`
    re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel
    oracle on already-padded weights). Corrupted w13/w2 layout + wrong
    kernel oracle config produces a small per-token, per-layer expert-
    dispatch bias that accumulates across the 32 DiT MoE layers into a
    "painterly / oil texture" attractor on the generated image. The
    unquantized FusedMoE method has no
    `_already_called_process_weights_after_loading` guard (only the FP8
    quant method does), so non-quantized HunyuanImage3 reliably trips
    this. Hook deliberately not registered.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py
    (HunYuanSparseMoeBlock):
    Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel`
    in forward, and drop `reduce_results=False` on the FusedMoE init.
    Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the
    `shared_mlp` output is merged inside FusedMoE.forward and the TP
    all-reduce is done internally; the wrapper code that did both of
    these externally was a 0.18-era workaround that became a double
    op after 0.20. Net effect of double-reduce + double shared_mlp add
    was a small numerical bias on top of the painterly drift; removing
    the wrapper restores HF-reference parity.

  Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same
  cartoon-block input + cute orange cat prompt yields a clean flat-
  cartoon output, visually matching HF generate_image() reference.

* Tests
  - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py:
    unit-level - AR prefill input_ids byte-equal HF chat template,
    image-tensor byte-equal AR-side processor.
  - tests/e2e/accuracy/test_hunyuan_image3_it2i.py:
    full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at
    PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple.

Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: skf <54565339+skf-1999@users.noreply.github.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: dengyunyang <584797741@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants