Skip to content

[BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text#1826

Merged
Gaohan123 merged 7 commits into
vllm-project:mainfrom
xuechendi:hunyuan_image3_fix_default_config
Mar 24, 2026
Merged

[BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text#1826
Gaohan123 merged 7 commits into
vllm-project:mainfrom
xuechendi:hunyuan_image3_fix_default_config

Conversation

@xuechendi
Copy link
Copy Markdown
Contributor

@xuechendi xuechendi commented Mar 12, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

After #759 landed,
hunyuan-image3 seems always pick AR path instead of diffusion path => Leads to final_output_type is "text"

This PR proposed

  1. enable a new config in Hunyuan-image-3-moe.yaml
modes:
  - mode: text-to-image
    stages: [1]
  - mode: image-to-text
    stages: [0]
  1. Add a filter_stages method when loading stage yaml files
    pseudo codes as below
if Omni(..., mode = 'image-to-text'):
    filter_stages = stage_configs['modes']['image-to-text']['stages']
elif Omni(..., mode = 'text-to-images'):
    filter_stages = stage_configs['modes']['text-to-images']['stages']
else:
    filter_stages = all_stages
  1. update examples text_to_image.py and image_to_text.py with --mode; if not defined, use text-to-image

Test Plan

text-to-image path

python -u examples/offline_inference/text_to_image/text_to_image.py   \
   --model /mnt/data/tencent/HunyuanImage-3.0/ \
   --prompt "A brown and white dog is running on the grass" \
   --output output_image_latest.png \
   --num-inference-steps 50 \
   --tensor-parallel-size 4 \
   --cfg-scale 4.0 \
   --enforce-eager 2>&1 | tee hunyuan-xpu-latest.log
image

image-to-text path

python examples/offline_inference/hunyuan_image3/image_to_text.py \
--image output_image_latest.png # the dog image above \
--prompt "describe this image" \
--model /mnt/data/tencent--HunyuanImage-3.0/

output

Prompt: <|startoftext|>You are an assistant that understands images and outputs text.<img>describe this image

Text:  The image is a close-up photograph of a tiger's face, capturing the intricate details of its fur and features with a shallow depth of field. The focus is sharp on the tiger's eyes, which are a striking yellow color with black stripes extending from them. The fur around the eyes is white, contrasting with the orange and black patterns that cover the rest of the face. The nose is pinkish-red, surrounded by white fur that extends down to the chin and throat. Long, stiff whiskers protrude from the muzzle area. The background is completely out of focus, presenting a blend of green and brown hues that suggest a natural, possibly forested environment. There are no discernible man-made objects or other subjects in the frame. The image does not contain any text.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@xuechendi
Copy link
Copy Markdown
Contributor Author

@gcanlin @hsliuustc0106 @usberkeley
May you help to take a look

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7da02cbbef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/utils.py Outdated
Comment thread examples/offline_inference/text_to_image/text_to_image.py
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just load a custom config directly?

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Fix for Hunyuan Image3 text-to-image path by adding --disable-default-stage-cfg flag to prevent default config loading that was causing AR path selection instead of diffusion path.

Validated

  • ✅ DCO signed
  • ✅ All CI checks passed
  • ✅ Includes XPU stage config for hunyuan_image_3_moe
  • ✅ Small fix for XPU path in autoencoder_kl_3d.py

Questions

  1. The PR description mentions this is a fix, but the test plan section is empty. Can you add:

    • Before/after comparison showing the wrong behavior (AR path) vs correct behavior (diffusion path)?
    • Sample command and output demonstrating the fix works?
  2. Is there a reason disable_default_stage_cfg defaults to False? For Hunyuan users calling text_to_image.py, wouldn't they always want this enabled to get the diffusion path?

  3. The XPU config file adds 43 lines - is this a copy of the existing config with device changes, or new functionality?


Minor clarification needed before approval.

@xuechendi
Copy link
Copy Markdown
Contributor Author

Why can't we just load a custom config directly?

Also works, but will be a little complicated since user need to provide a diffusion version of hunyuan-image-3-moe.
I am also thinking since one model might provide AR or diffusion, it would worth to add config selection based on the workload. I'll start with an RFC for that.

Meanwhile, this will be quick fix for user who don't want to provide their own config but still use OMNI args to run text_to_image.

@xuechendi
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @yma11

  1. Before/after comparison showing the wrong behavior (AR path) vs correct behavior (diffusion path)?

    Before this Fix, no matter the workload is to do text_to_image or image_to_text, it always uses same hunyuan-image-3-moe.yaml for AR.

    With the FIX, user can explicitly indicate --disable-default-stage-cfg, then for text_to_image workload, it ignores hunyuan-image-3-moe.yaml and generate stage_config based on Omni(**kwargs)

  2. Is there a reason disable_default_stage_cfg defaults to False? For Hunyuan users calling text_to_image.py, wouldn't they always want this enabled to get the diffusion path?

    I don't want to change the default behavior - when model has associated yaml, we should load them
    => I set disable_default_stage_cfg = False as default , so it will try to load default stage_config.
    => When a case like hunyuan-image-3-moe that supports both text2image and image2text, since default goes image2text, we will need '--disable_default_stage_cfg' to go the text2image. as @Gaohan123 suggested, another way is we provide two stage_file and indicate user to use that one. (But it means 4 new configs: GPU/ROCM/NPU/XPU )

  3. The XPU config file adds 43 lines - is this a copy of the existing config with device changes, or new functionality?

yes, this new config is for image2text path.

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 13, 2026

Could we always make diffusion part default? I think vLLM-Omni should provide the multi-modality generation preferably. And if users want to run AR-only part, they need to pass config by themselves.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 7da02cb to 8ce425f Compare March 16, 2026 21:14
@xuechendi xuechendi changed the title [BugFIX] fix Hunyuan image3 text_to_image path [BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text Mar 16, 2026
@xuechendi xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 2913934 to 8299459 Compare March 16, 2026 22:59
@xuechendi
Copy link
Copy Markdown
Contributor Author

@Gaohan123 @hsliuustc0106 , I switched to use a new argument in yaml file to select. May you take a review

@xuechendi xuechendi force-pushed the hunyuan_image3_fix_default_config branch 2 times, most recently from 7372197 to 202be1f Compare March 16, 2026 23:27
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the hunyuan_image3_fix_default_config branch from 202be1f to 0b72f26 Compare March 17, 2026 01:03
@xuechendi
Copy link
Copy Markdown
Contributor Author

@gcanlin , please take a review

…fault_config

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the hunyuan_image3_fix_default_config branch from be8a3b6 to c300c6d Compare March 20, 2026 19:39
# Stage 0: AR Model (vLLM implementation)

# The following config has been verified on 8x L40S-48G GPU.
modes:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this mapping here? @Semmer2 @lishunyang12 @nussejzz PTAL

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we load both stages and let workload to decide which stage to go.
Device memory utilization gets doubled.
This PR suggested a simple fix by using modes to decide if uses want to go text-to-image / image-to-text.

I am thinking a more aggressive fix by sharing same weight for different stages, if that makes sense, I can init a RFC and have some discussion on that?

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks
Later, I suggest you can supplement a UT to protect the filtering function. And we can discuss a better solution to combine modality control and stage filtering.

@Gaohan123 Gaohan123 merged commit d71c9ae into vllm-project:main Mar 24, 2026
8 checks passed
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
…ge/image_to_text (vllm-project#1826)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
…ge/image_to_text (vllm-project#1826)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@Bounty-hunter
Copy link
Copy Markdown
Contributor

Bounty-hunter commented Mar 28, 2026

run with:

 python -u examples/offline_inference/text_to_image/text_to_image.py      --model /home/models/tencent/HunyuanImage-3.0    --prompt "A brown and white dog is running on the grass"    --output output_image_latest.png    --num-inference-steps 50    --tensor-parallel-size 4    --cfg-scale 4.0

it return an error:

Orchestrator initialization failed: Stage 0 requires 8 devices, but only 4 devices are visible

It seem to use the yaml devices: "0,1,2,3,4,5,6,7" directly.

@Bounty-hunter Bounty-hunter mentioned this pull request Mar 28, 2026
5 tasks
stage_type: diffusion
runtime:
process: true
devices: "0,1,2,3,4,5,6,7"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bounty-hunter , that is because of the initial config is using all 8 cards.
If you want to use 4 cards, need to manual update here.

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…ge/image_to_text (vllm-project#1826)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants