[BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text#1826
Conversation
|
@gcanlin @hsliuustc0106 @usberkeley |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7da02cbbef
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Gaohan123
left a comment
There was a problem hiding this comment.
Why can't we just load a custom config directly?
hsliuustc0106
left a comment
There was a problem hiding this comment.
Summary
Fix for Hunyuan Image3 text-to-image path by adding --disable-default-stage-cfg flag to prevent default config loading that was causing AR path selection instead of diffusion path.
Validated
- ✅ DCO signed
- ✅ All CI checks passed
- ✅ Includes XPU stage config for hunyuan_image_3_moe
- ✅ Small fix for XPU path in
autoencoder_kl_3d.py
Questions
-
The PR description mentions this is a fix, but the test plan section is empty. Can you add:
- Before/after comparison showing the wrong behavior (AR path) vs correct behavior (diffusion path)?
- Sample command and output demonstrating the fix works?
-
Is there a reason
disable_default_stage_cfgdefaults toFalse? For Hunyuan users callingtext_to_image.py, wouldn't they always want this enabled to get the diffusion path? -
The XPU config file adds 43 lines - is this a copy of the existing config with device changes, or new functionality?
Minor clarification needed before approval.
Also works, but will be a little complicated since user need to provide a diffusion version of hunyuan-image-3-moe. Meanwhile, this will be quick fix for user who don't want to provide their own config but still use OMNI args to run text_to_image. |
yes, this new config is for image2text path. |
|
Could we always make diffusion part default? I think vLLM-Omni should provide the multi-modality generation preferably. And if users want to run AR-only part, they need to pass config by themselves. |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
7da02cb to
8ce425f
Compare
2913934 to
8299459
Compare
|
@Gaohan123 @hsliuustc0106 , I switched to use a new argument in yaml file to select. May you take a review |
7372197 to
202be1f
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
202be1f to
0b72f26
Compare
|
@gcanlin , please take a review |
…fault_config Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
be8a3b6 to
c300c6d
Compare
| # Stage 0: AR Model (vLLM implementation) | ||
|
|
||
| # The following config has been verified on 8x L40S-48G GPU. | ||
| modes: |
There was a problem hiding this comment.
why we need this mapping here? @Semmer2 @lishunyang12 @nussejzz PTAL
There was a problem hiding this comment.
If we load both stages and let workload to decide which stage to go.
Device memory utilization gets doubled.
This PR suggested a simple fix by using modes to decide if uses want to go text-to-image / image-to-text.
I am thinking a more aggressive fix by sharing same weight for different stages, if that makes sense, I can init a RFC and have some discussion on that?
Gaohan123
left a comment
There was a problem hiding this comment.
LGTM. Thanks
Later, I suggest you can supplement a UT to protect the filtering function. And we can discuss a better solution to combine modality control and stage filtering.
…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
run with: it return an error: It seem to use the yaml |
| stage_type: diffusion | ||
| runtime: | ||
| process: true | ||
| devices: "0,1,2,3,4,5,6,7" |
There was a problem hiding this comment.
@Bounty-hunter , that is because of the initial config is using all 8 cards.
If you want to use 4 cards, need to manual update here.
…ge/image_to_text (vllm-project#1826) Signed-off-by: Chendi Xue <chendi.xue@intel.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
After #759 landed,
hunyuan-image3 seems always pick AR path instead of diffusion path => Leads to final_output_type is "text"
This PR proposed
pseudo codes as below
text-to-imageTest Plan
text-to-image path
image-to-text path
output
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)