-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup #3444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Gaohan123
merged 44 commits into
vllm-project:main
from
TaffyOfficial:wt-hunyuan3-it2i-multi-image
May 14, 2026
Merged
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
0f2ee2d
[Feature] HunyuanImage-3.0 IT2I: support multi-image input
zuiho-kai 46b3b84
[Refactor] HunyuanImage-3.0 prompt_utils: split task and bot_task
zuiho-kai f4d76d5
[Feature] HunyuanImage-3.0 IT2I: wire multi-image through online serving
c18f016
[Bugfix] HunyuanImage-3.0 ar2diffusion: honor AR-predicted output ratio
c5f2f9b
[Chore] HunyuanImage-3.0 end2end: accept internal task names as --mod…
2ff92b7
feat(end2end): semantic output shape for multi-image IT2I
skf-1999 74e5cac
[Chore] Apply pre-commit formatting fixes
zuiho-kai d7400dc
fix(hunyuan_image3): honor ar2diffusion's predicted shape in pre_proc…
zuiho-kai d7c760e
refactor(end2end): drop multi-image regex shape heuristic
zuiho-kai 2175a99
fix(hunyuan_image3): add official extra resolution buckets (idx 33-36)
zuiho-kai 4aaa772
fix(hunyuan_image3): default cond image preprocessing to resize-stretch
zuiho-kai d0c2acb
fix(hunyuan_image3): use real <timestep> token id at scaffold slot
zuiho-kai f83c281
fix(hunyuan_image3): include <joint_img_sep> in per-image MM region
zuiho-kai b7c968b
fix(hunyuan_image3): pass extra resolutions to DiT-side reso_group
zuiho-kai 3b73eab
fix(hunyuan_image3 ar2diffusion): truncate AR cot_text at </recaption…
zuiho-kai 2847839
chore(hunyuan_image3): apply ruff format
zuiho-kai 3b4f885
fix(hunyuan_image3): online IT2I multi-image and AR bucket override
ca830c8
fix(hunyuan_image3): online IT2I HF byte-equivalent prompt path
c2ea079
fix(hunyuan_image3): align DiT tokenization with AR-sampled token IDs
1454f44
fix(hunyuan_image3): split task / bot_task / sys_type at /v1/images/e…
99c5eec
fix(hunyuan_image3): align online edit AR input with offline path
4d8c600
fix(hunyuan_image3): address PR #3444 review feedback
zuiho-kai 3298517
chore: appease ruff F841 / typos / ruff-format pre-commit
zuiho-kai 808aca0
fix(hunyuan_image3): align AR cond image preprocessing with DiT (cent…
zuiho-kai 297a2f5
test(hunyuan_image3): apply ruff format hook fixes
zuiho-kai 4cf71f2
fix(hunyuan_image3): preserve legacy plain prompt tasks
zuiho-kai cf7e4a2
fix(hunyuan_image3): align prompt token tests with result API
zuiho-kai 4fb78a3
fix(hunyuan_image3): harden edit bridge compatibility
zuiho-kai 38668a6
revert(hunyuan_image3): roll cond preprocessing back to magnet_repro …
zuiho-kai 9bc67cc
fix(hunyuan_image3): stop AR on <|endoftext|> for image-output tasks
zuiho-kai dec1c43
[Bugfix][HunyuanImage3] cap AR KV snapshot at </recaption>, defer mid…
zuiho-kai b84bc2f
fix(hunyuan_image3): cap IT2I input images at MAX_IMAGES_PER_REQUEST …
zuiho-kai 029f567
chore: apply pre-commit ruff format / isort fixups
zuiho-kai d8b9263
chore: rename MAX_IMAGES_PER_REQUEST alias to uppercase (ruff N811)
zuiho-kai 511b76c
fix(hunyuan_image3): align AR stop / KV cap / edits Form with upstrea…
zuiho-kai 8d90c17
chore: apply pre-commit isort split for resolve_stop_token_ids import
zuiho-kai b73b00f
chore(hunyuan_image3): drop dead cot_token_ids plumbing and online ta…
zuiho-kai 8d12ddd
chore: apply ruff-format fixup for cot_text_list comprehension
zuiho-kai bfd17b3
chore: keep for-loop one-line in apply_chat_template (no spurious diff)
zuiho-kai 1de9ec8
test: rename test_hunyuan_image3.py to avoid pytest basename collision
zuiho-kai 58ce6d8
fix(hunyuan_image3): mark AR stage is_comprehension=true so online IT…
zuiho-kai be0c684
chore(hunyuan_image3): drop redundant hunyuan-specific task/stop logi…
zuiho-kai 161ba50
test(hunyuan_image3): drop legacy task-as-bot_task tests after servin…
zuiho-kai b5b4d71
Merge branch 'main' into wt-hunyuan3-it2i-multi-image
TaffyOfficial File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,5 @@ | ||
| """ | ||
| HunyuanImage-3.0-Instruct unified end-to-end inference script. | ||
|
|
||
| Supports all modalities through a single entry point: | ||
| - text2img: Text → AR → DiT → Image | ||
| - img2img: Text+Image → AR → DiT → Edited Image (IT2I) | ||
| - img2text: Image+Text → AR → Text description (I2T) | ||
| - text2text: Text → AR → Text (comprehension, no image) | ||
|
|
||
| Usage: | ||
| python end2end.py --modality text2img --prompts "A cute cat" | ||
| python end2end.py --modality img2img --image-path input.png --prompts "Make it snowy" | ||
| python end2end.py --modality img2text --image-path input.png --prompts "Describe this image" | ||
| """ | ||
|
|
||
| import argparse | ||
|
|
@@ -19,18 +8,25 @@ | |
| from pathlib import Path | ||
|
|
||
| from vllm_omni.diffusion.models.hunyuan_image3.prompt_utils import ( | ||
| _TASK_PRESETS, | ||
| MAX_IMAGES_PER_REQUEST, | ||
| build_prompt_tokens, | ||
| resolve_stop_token_ids, | ||
| resolve_sys_type, | ||
| ) | ||
| from vllm_omni.entrypoints.omni import Omni | ||
| from vllm_omni.inputs.data import OmniPromptType | ||
|
|
||
| # Default deploy configs are absolute so this example works from any cwd. | ||
| _REPO_ROOT = Path(__file__).resolve().parents[3] | ||
| _DEFAULT_DEPLOY_CONFIG = str(_REPO_ROOT / "vllm_omni" / "deploy" / "hunyuan_image3.yaml") | ||
| _DEFAULT_AR_DEPLOY_CONFIG = str(_REPO_ROOT / "vllm_omni" / "deploy" / "hunyuan_image3_ar.yaml") | ||
|
|
||
| _MODALITY_TASK_MAP: dict[str, tuple[str, str | None]] = { | ||
| "text2img": ("t2i", "think"), | ||
| "img2img": ("it2i", "think"), | ||
| "img2text": ("i2t", None), | ||
| "text2text": ("t2t", None), | ||
| } | ||
|
|
||
| _MODALITY_DEFAULT_DEPLOY_CONFIG = { | ||
| "text2img": _DEFAULT_DEPLOY_CONFIG, | ||
| "img2img": _DEFAULT_DEPLOY_CONFIG, | ||
|
|
@@ -45,73 +41,37 @@ | |
| "text2text": "text-to-text", | ||
| } | ||
|
|
||
| _MODALITY_TASK_MAP = { | ||
| "text2img": "t2i", | ||
| "img2img": "it2i", | ||
| "img2text": "i2t", | ||
| "text2text": "t2t", | ||
| } | ||
|
|
||
|
|
||
| def parse_args(): | ||
| parser = argparse.ArgumentParser(description="HunyuanImage-3.0-Instruct end-to-end inference.") | ||
| parser.add_argument( | ||
| "--model", | ||
| default="tencent/HunyuanImage-3.0-Instruct", | ||
| help="Model name or local path.", | ||
| ) | ||
| parser.add_argument("--model", default="tencent/HunyuanImage-3.0-Instruct", help="Model name or local path.") | ||
| parser.add_argument( | ||
| "--modality", | ||
| default="text2img", | ||
| choices=["text2img", "img2img", "img2text", "text2text"], | ||
| help="Modality mode to control stage execution.", | ||
| choices=list(_MODALITY_TASK_MAP), | ||
| ) | ||
| parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.") | ||
| parser.add_argument( | ||
| "--image-path", | ||
| type=str, | ||
| default=None, | ||
| help="Path to input image (for img2img/img2text).", | ||
| ) | ||
| parser.add_argument( | ||
| "--output", | ||
| type=str, | ||
| default=".", | ||
| help="Output directory to save results.", | ||
| help="Input image path(s) for img2img/img2text. Comma-separated for multi-image (up to 3).", | ||
| ) | ||
|
|
||
| # Generation parameters | ||
| parser.add_argument("--output", type=str, default=".", help="Output directory to save results.") | ||
| parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.") | ||
| parser.add_argument("--guidance-scale", type=float, default=5.0, help="Classifier-free guidance scale.") | ||
| parser.add_argument("--seed", type=int, default=42, help="Random seed.") | ||
| parser.add_argument("--height", type=int, default=1024, help="Output image height.") | ||
| parser.add_argument("--width", type=int, default=1024, help="Output image width.") | ||
| parser.add_argument( | ||
| "--vae-use-tiling", | ||
| action="store_true", | ||
| help="Enable VAE tiling for memory optimization.", | ||
| ) | ||
|
|
||
| # Prompt configuration | ||
| parser.add_argument("--vae-use-tiling", action="store_true", help="Enable VAE tiling.") | ||
| parser.add_argument( | ||
| "--bot-task", | ||
| type=str, | ||
| default="auto", | ||
| choices=["auto", "think", "recaption", "think_recaption", "vanilla"], | ||
| help=( | ||
| "Prompt behavior. 'auto' selects the default for the modality; " | ||
| "'think' adds <think>; 'recaption' adds <recaption>; " | ||
| "'vanilla' uses the t2i pretrain template." | ||
| ), | ||
| ) | ||
| parser.add_argument( | ||
| "--sys-type", | ||
| type=str, | ||
| default=None, | ||
| help="Override system prompt type (e.g. en_unified, en_vanilla).", | ||
| choices=["none", "think", "recaption", "think_recaption", "vanilla"], | ||
| help="Override prompt mode. Default: auto from --modality.", | ||
| ) | ||
|
|
||
| # Omni init args | ||
| parser.add_argument("--sys-type", type=str, default=None, help="Override system prompt type.") | ||
| parser.add_argument("--deploy-config", type=str, default=None, help="Custom deploy YAML path.") | ||
| parser.add_argument("--stage-configs-path", type=str, default=None, help="Custom legacy stage config YAML path.") | ||
| parser.add_argument("--log-stats", action="store_true", default=False) | ||
|
|
@@ -157,22 +117,13 @@ def main(): | |
| os.makedirs(args.output, exist_ok=True) | ||
| additional_config = parse_additional_config(args.additional_config) | ||
|
|
||
| # Determine task for prompt formatting from modality + bot behavior. | ||
| task = _MODALITY_TASK_MAP[args.modality] | ||
| assert task is not None | ||
| bot_task = args.bot_task | ||
| if bot_task != "auto": | ||
| task = task + "_" + bot_task | ||
| if task not in _TASK_PRESETS: | ||
| valid_bot_tasks = { | ||
| "text2img": ["think", "recaption", "vanilla"], | ||
| "img2img": ["think", "recaption", "think_recaption"], | ||
| "img2text": ["auto"], | ||
| "text2text": ["auto"], | ||
| }[args.modality] | ||
| raise ValueError( | ||
| f"--bot-task {bot_task!r} is not supported for {args.modality}. Choose from: {valid_bot_tasks}" | ||
| ) | ||
| task, default_bot_task = _MODALITY_TASK_MAP[args.modality] | ||
| if args.bot_task is None: | ||
| bot_task: str | None = default_bot_task | ||
| elif args.bot_task == "none": | ||
| bot_task = None | ||
| else: | ||
| bot_task = args.bot_task | ||
|
|
||
| if args.deploy_config is not None and args.stage_configs_path is not None: | ||
| raise ValueError("--deploy-config and --stage-configs-path are mutually exclusive.") | ||
|
|
@@ -182,13 +133,13 @@ def main(): | |
| if deploy_config is None and stage_configs_path is None: | ||
| deploy_config = _MODALITY_DEFAULT_DEPLOY_CONFIG[args.modality] | ||
|
|
||
| # Build Omni | ||
| omni_kwargs = { | ||
| "model": args.model, | ||
| "vae_use_tiling": args.vae_use_tiling, | ||
| "log_stats": args.log_stats, | ||
| "init_timeout": args.init_timeout, | ||
| "enforce_eager": args.enforce_eager, | ||
| "mode": _MODALITY_MODE[args.modality], | ||
| } | ||
|
|
||
| if additional_config is not None: | ||
|
|
@@ -197,85 +148,80 @@ def main(): | |
| omni_kwargs["deploy_config"] = deploy_config | ||
| else: | ||
| omni_kwargs["stage_configs_path"] = stage_configs_path | ||
| omni_kwargs["mode"] = _MODALITY_MODE[args.modality] | ||
|
|
||
| omni = Omni(**omni_kwargs) | ||
|
|
||
| # Prepare prompts | ||
| prompts = args.prompts or ["A cute cat"] | ||
| if not prompts: | ||
| print("[Info] No prompts provided, using default.") | ||
| prompts = ["A cute cat"] | ||
|
|
||
| # Load image if needed | ||
| input_image = None | ||
| input_images: list = [] | ||
| if args.modality in ("img2img", "img2text"): | ||
| if not args.image_path or not os.path.exists(args.image_path): | ||
| if not args.image_path: | ||
| raise ValueError(f"--image-path required for {args.modality}, got: {args.image_path}") | ||
| from PIL import Image | ||
|
|
||
| input_image = Image.open(args.image_path).convert("RGB") | ||
| image_paths = [p.strip() for p in args.image_path.split(",") if p.strip()] | ||
| if len(image_paths) > MAX_IMAGES_PER_REQUEST: | ||
| raise ValueError( | ||
| f"--image-path accepts at most {MAX_IMAGES_PER_REQUEST} images for " | ||
| f"HunyuanImage-3.0 IT2I, got {len(image_paths)}: {args.image_path}" | ||
| ) | ||
| for image_path in image_paths: | ||
| if not os.path.exists(image_path): | ||
| raise ValueError(f"Image path does not exist: {image_path}") | ||
| input_images.append(Image.open(image_path).convert("RGB")) | ||
| if not input_images: | ||
| raise ValueError(f"--image-path produced no usable paths: {args.image_path!r}") | ||
|
|
||
| # Load tokenizer for segment-wise prompt tokenization (matches HF | ||
| # apply_chat_template byte-for-byte; see build_prompt_tokens docstring). | ||
| from transformers import AutoTokenizer | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True) | ||
| mm_image_payload = (input_images[0] if len(input_images) == 1 else input_images) if input_images else None | ||
|
|
||
| # Format prompts | ||
| formatted_prompts: list[OmniPromptType] = [] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we explicitly enforce an early upper limit on the number of images? |
||
| for p in prompts: | ||
| result = build_prompt_tokens(p, tokenizer, task=task, sys_type=args.sys_type) | ||
| for prompt in prompts: | ||
| build_kwargs: dict = {"task": task, "bot_task": bot_task, "sys_type": args.sys_type} | ||
| if input_images: | ||
| build_kwargs["num_images"] = len(input_images) | ||
| result = build_prompt_tokens(prompt, tokenizer, **build_kwargs) | ||
| token_ids = result.token_ids | ||
| effective_sys_type = result.system_prompt_type | ||
| effective_sys_type = args.sys_type or resolve_sys_type(bot_task) | ||
|
|
||
| # `prompt_token_ids` drives the AR stage (matches HF byte-for-byte). | ||
| # `prompt` and `use_system_prompt` are forwarded by ar2diffusion to | ||
| # the DiT stage so the diffusion pipeline can rebuild the same | ||
| # system prefix when constructing its model inputs. | ||
| prompt_dict: dict = { | ||
| "prompt_token_ids": token_ids, | ||
| "prompt": p, | ||
| "prompt": prompt, | ||
| "use_system_prompt": effective_sys_type, | ||
| } | ||
|
|
||
| if args.modality == "text2img": | ||
| prompt_dict["modalities"] = ["image"] | ||
| elif args.modality == "img2img": | ||
| prompt_dict["modalities"] = ["image"] | ||
| prompt_dict["multi_modal_data"] = {"image": input_image} | ||
| prompt_dict["height"] = input_image.height | ||
| prompt_dict["width"] = input_image.width | ||
| prompt_dict["multi_modal_data"] = {"image": mm_image_payload} | ||
| prompt_dict["height"] = input_images[0].height | ||
| prompt_dict["width"] = input_images[0].width | ||
| elif args.modality == "img2text": | ||
| prompt_dict["modalities"] = ["text"] | ||
| prompt_dict["multi_modal_data"] = {"image": input_image} | ||
| elif args.modality == "text2text": | ||
| prompt_dict["multi_modal_data"] = {"image": mm_image_payload} | ||
| else: | ||
| prompt_dict["modalities"] = ["text"] | ||
|
|
||
| formatted_prompts.append(prompt_dict) | ||
|
|
||
| # Build sampling params from defaults | ||
| params_list = list(omni.default_sampling_params_list) | ||
|
|
||
| # Override diffusion params if applicable | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
|
|
||
| ar_stop_token_ids = resolve_stop_token_ids(task=task, bot_task=bot_task, tokenizer=tokenizer) | ||
| assert ar_stop_token_ids is not None | ||
| for sp in params_list: | ||
| if isinstance(sp, OmniDiffusionSamplingParams): | ||
| sp.num_inference_steps = args.steps | ||
| sp.guidance_scale = args.guidance_scale | ||
| sp.guidance_scale_provided = True | ||
| if args.seed is not None: | ||
| sp.seed = args.seed | ||
| if args.modality in ("text2img",): | ||
| if args.modality == "text2img": | ||
| sp.height = args.height | ||
| sp.width = args.width | ||
| elif hasattr(sp, "stop_token_ids"): | ||
| sp.stop_token_ids = ar_stop_token_ids | ||
|
|
||
| # Print configuration | ||
| print(f"\n{'=' * 60}") | ||
| print("HunyuanImage-3.0 Generation Configuration:") | ||
| print(f" Model: {args.model}") | ||
|
|
@@ -300,13 +246,9 @@ def main(): | |
| print(f" Prompts: {prompts}") | ||
| print(f"{'=' * 60}\n") | ||
|
|
||
| # Generate | ||
| omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list)) | ||
|
|
||
| # Process outputs | ||
| img_idx = 0 | ||
| for req_output in omni_outputs: | ||
| # Text output (AR stage or text-only) | ||
| ro = getattr(req_output, "request_output", None) | ||
| txt = "" | ||
| if ro and getattr(ro, "outputs", None): | ||
|
|
@@ -320,11 +262,9 @@ def main(): | |
| if txt: | ||
| print(f"[Output] Text:\n{txt}") | ||
|
|
||
| # Image output (DiT stage) | ||
| images = getattr(req_output, "images", None) | ||
| if not images and ro and hasattr(ro, "images"): | ||
| images = ro.images | ||
|
|
||
| if images: | ||
| for j, img in enumerate(images): | ||
| save_path = os.path.join(args.output, f"output_{img_idx}_{j}.png") | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we add a check for args.image_path in both online and offline, which you only verified no more than 3 input images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix