[BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text by xuechendi · Pull Request #1826 · vllm-project/vllm-omni

xuechendi · 2026-03-12T01:06:00Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

After #759 landed,
hunyuan-image3 seems always pick AR path instead of diffusion path => Leads to final_output_type is "text"

This PR proposed

enable a new config in Hunyuan-image-3-moe.yaml

modes:
  - mode: text-to-image
    stages: [1]
  - mode: image-to-text
    stages: [0]

Add a filter_stages method when loading stage yaml files
pseudo codes as below

if Omni(..., mode = 'image-to-text'):
    filter_stages = stage_configs['modes']['image-to-text']['stages']
elif Omni(..., mode = 'text-to-images'):
    filter_stages = stage_configs['modes']['text-to-images']['stages']
else:
    filter_stages = all_stages

update examples text_to_image.py and image_to_text.py with --mode; if not defined, use text-to-image

Test Plan

text-to-image path

python -u examples/offline_inference/text_to_image/text_to_image.py   \
   --model /mnt/data/tencent/HunyuanImage-3.0/ \
   --prompt "A brown and white dog is running on the grass" \
   --output output_image_latest.png \
   --num-inference-steps 50 \
   --tensor-parallel-size 4 \
   --cfg-scale 4.0 \
   --enforce-eager 2>&1 | tee hunyuan-xpu-latest.log

image-to-text path

python examples/offline_inference/hunyuan_image3/image_to_text.py \
--image output_image_latest.png # the dog image above \
--prompt "describe this image" \
--model /mnt/data/tencent--HunyuanImage-3.0/

output

Prompt: <|startoftext|>You are an assistant that understands images and outputs text.<img>describe this image

Text:  The image is a close-up photograph of a tiger's face, capturing the intricate details of its fur and features with a shallow depth of field. The focus is sharp on the tiger's eyes, which are a striking yellow color with black stripes extending from them. The fur around the eyes is white, contrasting with the orange and black patterns that cover the rest of the face. The nose is pinkish-red, surrounded by white fur that extends down to the chin and throat. Long, stiff whiskers protrude from the muzzle area. The background is completely out of focus, presenting a blend of green and brown hues that suggest a natural, possibly forested environment. There are no discernible man-made objects or other subjects in the frame. The image does not contain any text.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

xuechendi · 2026-03-12T01:06:51Z

@gcanlin @hsliuustc0106 @usberkeley
May you help to take a look

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7da02cbbef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Gaohan123

Why can't we just load a custom config directly?

hsliuustc0106

Summary

Fix for Hunyuan Image3 text-to-image path by adding --disable-default-stage-cfg flag to prevent default config loading that was causing AR path selection instead of diffusion path.

Validated

✅ DCO signed
✅ All CI checks passed
✅ Includes XPU stage config for hunyuan_image_3_moe
✅ Small fix for XPU path in autoencoder_kl_3d.py

Questions

The PR description mentions this is a fix, but the test plan section is empty. Can you add:
- Before/after comparison showing the wrong behavior (AR path) vs correct behavior (diffusion path)?
- Sample command and output demonstrating the fix works?
Is there a reason disable_default_stage_cfg defaults to False? For Hunyuan users calling text_to_image.py, wouldn't they always want this enabled to get the diffusion path?
The XPU config file adds 43 lines - is this a copy of the existing config with device changes, or new functionality?

Minor clarification needed before approval.

xuechendi · 2026-03-12T20:14:56Z

Why can't we just load a custom config directly?

Also works, but will be a little complicated since user need to provide a diffusion version of hunyuan-image-3-moe.
I am also thinking since one model might provide AR or diffusion, it would worth to add config selection based on the workload. I'll start with an RFC for that.

Meanwhile, this will be quick fix for user who don't want to provide their own config but still use OMNI args to run text_to_image.

xuechendi · 2026-03-12T22:01:51Z

@hsliuustc0106 @yma11

Before/after comparison showing the wrong behavior (AR path) vs correct behavior (diffusion path)?

Before this Fix, no matter the workload is to do text_to_image or image_to_text, it always uses same hunyuan-image-3-moe.yaml for AR.

With the FIX, user can explicitly indicate --disable-default-stage-cfg, then for text_to_image workload, it ignores hunyuan-image-3-moe.yaml and generate stage_config based on Omni(**kwargs)
Is there a reason disable_default_stage_cfg defaults to False? For Hunyuan users calling text_to_image.py, wouldn't they always want this enabled to get the diffusion path?

I don't want to change the default behavior - when model has associated yaml, we should load them
=> I set disable_default_stage_cfg = False as default , so it will try to load default stage_config.
=> When a case like hunyuan-image-3-moe that supports both text2image and image2text, since default goes image2text, we will need '--disable_default_stage_cfg' to go the text2image. as @Gaohan123 suggested, another way is we provide two stage_file and indicate user to use that one. (But it means 4 new configs: GPU/ROCM/NPU/XPU )
The XPU config file adds 43 lines - is this a copy of the existing config with device changes, or new functionality?

yes, this new config is for image2text path.

gcanlin · 2026-03-13T04:15:13Z

Could we always make diffusion part default? I think vLLM-Omni should provide the multi-modality generation preferably. And if users want to run AR-only part, they need to pass config by themselves.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-03-16T23:00:53Z

@Gaohan123 @hsliuustc0106 , I switched to use a new argument in yaml file to select. May you take a review

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-03-17T15:15:15Z

@gcanlin , please take a review

…fault_config Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

hsliuustc0106 · 2026-03-20T23:45:23Z

 # Stage 0: AR Model (vLLM implementation)

 # The following config has been verified on 8x L40S-48G GPU.
+modes:


why we need this mapping here? @Semmer2 @lishunyang12 @nussejzz PTAL

If we load both stages and let workload to decide which stage to go.
Device memory utilization gets doubled.
This PR suggested a simple fix by using modes to decide if uses want to go text-to-image / image-to-text.

I am thinking a more aggressive fix by sharing same weight for different stages, if that makes sense, I can init a RFC and have some discussion on that?

Gaohan123

LGTM. Thanks
Later, I suggest you can supplement a UT to protect the filtering function. And we can discuss a better solution to combine modality control and stage filtering.

…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Bounty-hunter · 2026-03-28T01:23:39Z

run with:

 python -u examples/offline_inference/text_to_image/text_to_image.py      --model /home/models/tencent/HunyuanImage-3.0    --prompt "A brown and white dog is running on the grass"    --output output_image_latest.png    --num-inference-steps 50    --tensor-parallel-size 4    --cfg-scale 4.0

it return an error:

Orchestrator initialization failed: Stage 0 requires 8 devices, but only 4 devices are visible

It seem to use the yaml devices: "0,1,2,3,4,5,6,7" directly.

xuechendi · 2026-03-30T21:50:16Z

+    stage_type: diffusion
+    runtime:
+      process: true
+      devices: "0,1,2,3,4,5,6,7"


@Bounty-hunter , that is because of the initial config is using all 8 cards.
If you want to use 4 cards, need to manual update here.

…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi requested a review from hsliuustc0106 as a code owner March 12, 2026 01:06

xuechendi mentioned this pull request Mar 12, 2026

[Model] Extend NPU support for HunyuanImage3 Diffusion Model #1689

Merged

5 tasks

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/utils.py Outdated

yma11 reviewed Mar 12, 2026

View reviewed changes

Comment thread examples/offline_inference/text_to_image/text_to_image.py

Gaohan123 reviewed Mar 12, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 12, 2026

View reviewed changes

xuechendi mentioned this pull request Mar 12, 2026

[RFC]: default stage config dispatch based on workload #1860

Open

1 task

fix AR path for xpu

8ce425f

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 7da02cb to 8ce425f Compare March 16, 2026 21:14

xuechendi changed the title ~~[BugFIX] fix Hunyuan image3 text_to_image path~~ [BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text Mar 16, 2026

xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 2913934 to 8299459 Compare March 16, 2026 22:59

xuechendi force-pushed the hunyuan_image3_fix_default_config branch 2 times, most recently from 7372197 to 202be1f Compare March 16, 2026 23:27

Enable a new config - mode - to decide stage selection

0b72f26

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 202be1f to 0b72f26 Compare March 17, 2026 01:03

Merge branch 'main' into hunyuan_image3_fix_default_config

30571a7

Gaohan123 added this to the v0.18.0 milestone Mar 19, 2026

xuechendi mentioned this pull request Mar 20, 2026

[FP8] enable hunyuan-image-3 diffusion model with fp8 online quant #1935

Merged

5 tasks

xuechendi added 2 commits March 20, 2026 18:43

Merge remote-tracking branch 'origin/main' into hunyuan_image3_fix_de…

16f50b8

…fault_config Signed-off-by: Chendi Xue <chendi.xue@intel.com>

update config to work with vllm-project#1935

c300c6d

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the hunyuan_image3_fix_default_config branch from be8a3b6 to c300c6d Compare March 20, 2026 19:39

hsliuustc0106 reviewed Mar 20, 2026

View reviewed changes

Gaohan123 added the ready label to trigger buildkite CI label Mar 24, 2026

Merge branch 'main' into hunyuan_image3_fix_default_config

76c691e

Merge branch 'main' into hunyuan_image3_fix_default_config

f1332e1

Gaohan123 approved these changes Mar 24, 2026

View reviewed changes

Gaohan123 merged commit d71c9ae into vllm-project:main Mar 24, 2026
8 checks passed

xiaohajiayou mentioned this pull request Mar 26, 2026

[Bugfix] Fix precedence between caller runtime args and default stage configs #2076

Merged

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[BugFIX] enable Hunyuan image3 with stage selection among text_to_ima…

80e42cc

…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[BugFIX] enable Hunyuan image3 with stage selection among text_to_ima…

6b00f74

…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Bounty-hunter mentioned this pull request Mar 28, 2026

[BugFix]config priority fix #2289

Closed

5 tasks

xuechendi commented Mar 30, 2026

View reviewed changes

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[BugFIX] enable Hunyuan image3 with stage selection among text_to_ima…

d9bf47a

…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Conversation

xuechendi commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

xuechendi commented Mar 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Summary

Validated

Questions

Uh oh!

xuechendi commented Mar 12, 2026

Uh oh!

xuechendi commented Mar 12, 2026

Uh oh!

gcanlin commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Mar 16, 2026

Uh oh!

xuechendi commented Mar 17, 2026

Uh oh!

hsliuustc0106 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

xuechendi Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Bounty-hunter commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xuechendi commented Mar 12, 2026 •

edited

Loading

gcanlin commented Mar 13, 2026 •

edited

Loading

Bounty-hunter commented Mar 28, 2026 •

edited

Loading