[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup by TaffyOfficial · Pull Request #3444 · vllm-project/vllm-omni

TaffyOfficial · 2026-05-08T05:34:29Z

Summary

This PR makes HunyuanImage-3.0-Instruct's IT2I path support up to 3 reference
images per request ("Multi-Image Fusion", as upstream supports), and folds in
two follow-up fixes:

A prompt_utils API cleanup that splits the conflated task parameter
into two orthogonal axes.
Online (/v1/images/edits) ↔ offline (end2end.py img2img) AR alignment,
so identical (prompt, image, seed) produces identical AR output across the
two paths.

Both are included here rather than in separate PRs because they touch the same
prompt_utils.py / serving / pipeline surface and would otherwise create
3-way merge conflicts.

Logical commits in chronological order:

Multi-image IT2I support — N consecutive <img> placeholders, per-image
VAE buckets, ragged flat_from_sizes reconstruction.
prompt_utils API cleanup — split task and bot_task.
HF byte-equivalent prompt_token_ids in /v1/images/edits — segment-
wise tokenization through build_prompt_tokens, mirrors offline end2end.py.
task / bot_task / sys_type / system_prompt Form fields at the OpenAI
edits endpoint, so callers can drive the same (task, bot_task) axes the
offline example uses.
size="auto" no longer collapses non-square AR-predicted aspect —
defers width/height to the AR <img_ratio_*> bucket via
stage_input_processors/hunyuan_image3.py.
RGBA cond image normalization — RGBA → RGB via white-bg
alpha-composite, gated on Hunyuan-aware request params, fixes systematic
online/offline divergence on PNGs with transparency.
Cond VAE determinism — latent_dist.sample() (consumes torch global
RNG) → latent_dist.mode() (deterministic posterior mean, matches the
official cond encode path for clean t=0 conditioning).
Merge origin/main — picks up HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS,
resolve_stop_token_ids, and the AR stop-token plumbing from PR [Config] Add HunyuanImage3 deploy configs #3172.
Cond preprocessing revert (magnet_repro baseline) — AR
_resize_and_crop default back to crop_type='resize', cond VAE encode
back to .sample(generator=fixed) instead of .mode(). Center-crop was
visibly under-conditioning portrait-input → landscape-output edits;
restoring the IT2I demo's tuned preprocessing recovers the magnet repro.
Supersedes Commit 7 for the VAE path (determinism preserved via
fixed generator instead of posterior mean).
Stop AR on <|endoftext|> for image-output tasks —
resolve_stop_token_ids was returning <answer> for every (task,
bot_task). For it2i/t2i, that chops off the
<answer><boi><img_size_*><img_ratio_*><|endoftext|> tail forced by
_stage_transitions; _extract_ratio_index then finds no
<img_ratio_*> and silently collapses the DiT bucket to the first
reference image's shape (square logo → 1024x1024 even when AR's CoT
planned a landscape). Now returns <|endoftext|> for image-output
tasks; comprehension i2t/t2t still stop on <answer>.
Cap AR KV snapshot at </recaption> —
deploy/hunyuan_image3.yaml sets
kv_transfer_criteria.type=special_token, token_id=128019, stop_after_transfer=false. Shipped KV now exactly matches the prefix
DiT reuses (S−N=1 invariant). AR keeps running past the snapshot so
it can still emit <img_ratio_*> for the bridge; orchestrator
_handle_kv_ready_raw_outputs defers the kv_ready forward when the
same raw_outputs batch doesn't yet contain a finished output for that
req_id, avoiding the ar_output.outputs[0] AttributeError that
bridges hit when kv_ready fires mid-decode.
Entry-layer cap on input image count (reviewer feedback from
@Gaohan123) — --image-path (offline) and
_build_multistage_generation_inputs (online) now reject

MAX_IMAGES_PER_REQUEST (=3) up front with an input-named error,
instead of letting the deeper _validate_num_images surface as
"num_images must be in [1, 3]".
Review iteration: align AR stop / KV cap / edits Form with
upstream (feedback from @Bounty-hunter) — supersedes commits 10
and 11. resolve_stop_token_ids for image-output tasks now returns
the full <img_ratio_*> token range
(list(range(start_ratio, end_ratio + 1)) +
ratio_token_other_slices), mirroring upstream
modeling_hunyuan_image_3.py:3289-3303. AR stops AT the ratio
token; KV is capped naturally; bridge _truncate_at_cot_end already
strips the ratio tail for DiT. kv_transfer_criteria block in the
deploy yaml is removed; orchestrator finished_in_batch defer is
removed (no mid-decode kv_ready exists in the new flow).
serving_chat._build_multistage_generation_inputs now calls
resolve_stop_token_ids so online matches offline end2end.py.
The task Form field on /v1/images/edits is dropped (the
endpoint is always IT2I; bot_task / sys_type / system_prompt
are the remaining knobs). The cot_token_ids_list segment-token
forwarding in pipeline_hunyuan_image3 is also removed; see the
"Optimization leftover" note below.

Plus housekeeping: ratio extraction simplified to a pure token-id reverse scan
(regex path dropped — token-ids are source of truth, AR yamls run with
skip_special_tokens=True), stale compound task names cleaned out of the e2e
test, AR/DiT system_prompt body forwarding so sys_type='custom' works
end-to-end, duplicate engine_prompt["prompt_token_ids"] assignment removed.

Commit 1 — Multi-image IT2I support

HunyuanImage-3.0-Instruct supports up to 3 reference images per IT2I request
(README §200-216, §500). vllm-omni's DiT pipeline, AR processor, OpenAI
schema, and ar2diffusion bridge already accepted list-shaped
multi_modal_data["image"], but four call sites still encoded a hard "N=1"
assumption. End-to-end smoke (4× L20X) on the official input_1_0.png +
input_1_1.png demo pair runs cleanly and preserves each image's native
bucket.

Surgery points:

prompt_utils.build_prompt(_tokens) takes num_images: int (default 1,
validated 1 ≤ N ≤ 3 for image-input tasks) and emits N consecutive <img>
placeholders between User: and the user prompt, matching the official
apply_general_template "successive user message" wrapping.
HunyuanImage3Processor.process_image: each cond image keeps its own VAE
reso_group bucket. Per-image VAE pixel tensors are flattened to 1-D and
concatenated; vae_pixel_size declares per-image numel so vLLM splits the
buffer back per image at consumption time via
MultiModalFieldConfig.flat_from_sizes(..., vae_pixel_size) (mirrors the
GLM-Image / Ming-Flash-Omni pattern).
_parse_and_validate_image_input reconstructs a list of per-image
(3, H_i, W_i) tensors from vae_token_grid_hw; embed_multimodal loops
over the list for VAE encode + patch_embed.
examples/.../end2end.py: --image-path accepts comma-separated paths;
mm_image_payload is unwrapped to a single image when N=1 to keep the
legacy single-image call shape.

Commit 2 — prompt_utils API cleanup: split `task` and `bot_task`

CR feedback observed that the old _TASK_PRESETS table conflated I/O
modality with prompting mode (e.g. it2i_think, t2i_recaption,
t2i_vanilla) and carried a bot_task field that was dead code under
every sys_type exercised in this codebase (only sys_type='dynamic'
consumed it, and nothing ever set that). Split into two orthogonal axes:

axis	values	controls
`task`	`t2t`, `i2t`, `it2i`, `t2i`	whether `<img>` placeholders are emitted
`bot_task`	`None`, `think`, `recaption`, `think_recaption`, `vanilla`	system prompt + trigger tag

Resolution table:

`bot_task`	`sys_type`	trigger tag
`None`	`en_unified`	(none)
`think`	`en_unified`	`<think>`
`recaption`	`en_unified`	`<recaption>`
`think_recaption`	`en_think_recaption`	`<think>`
`vanilla`	`en_vanilla`	(none, no chat template — `task='t2i'` only)

bot_task='vanilla' is validated to only combine with task='t2i';
unknown task / bot_task values raise ValueError. Public helpers
available_bot_tasks() and resolve_sys_type(bot_task) let callers derive
the default sys_type without re-encoding the table.

Migration mapping for any downstream caller:

old	new
`task='t2t'`	`task='t2t', bot_task=None`
`task='i2t'`	`task='i2t', bot_task=None`
`task='it2i_think'`	`task='it2i', bot_task='think'`
`task='it2i_recaption'`	`task='it2i', bot_task='recaption'`
`task='t2i_think'`	`task='t2i', bot_task='think'`
`task='t2i_recaption'`	`task='t2i', bot_task='recaption'`
`task='t2i_vanilla'`	`task='t2i', bot_task='vanilla'`
(newly accessible)	`task='t2i', bot_task='think_recaption'` → `en_think_recaption`

This is a hard breaking change with no aliases. Repo-wide grep across
tests/, examples/, vllm_omni/, and deploy/*.yaml confirms no
remaining references to the old compound strings or to _TASK_PRESETS.

Side fix on build_prompt: the legacy code stripped the system prompt's
leading whitespace while build_prompt_tokens did not. Invisible while every
system prompt was unified_system_prompt_en (no leading newline) but newly
observable now that bot_task='think_recaption' exposes en_think_recaption
(which starts with \n). build_prompt now keeps the system prompt verbatim,
matching the segment-by-segment tokenization path and HF's
apply_chat_template byte-for-byte.

end2end.py CLI changes: --bot-task choices are now
{none, think, recaption, think_recaption, vanilla}. The literal none is
the explicit way to request bot_task=None on a modality whose default is
think (e.g. text2img / img2img); leaving --bot-task unset still falls
back to the modality default. The duplicated _TASK_PRESETS literal in the
example script is removed in favor of resolve_sys_type(bot_task). AR stop
token ids are now resolved programmatically via resolve_stop_token_ids(task, bot_task, tokenizer) rather than hardcoded in the deploy yaml — keeps the
example self-contained and survives yaml drift.

Commits 3-5 — Online (`/v1/images/edits`) ↔ offline AR byte-alignment

The OpenAI edits endpoint built the AR prompt as a single string and let the
engine tokenizer run a whole-string BPE pass; offline end2end.py img2img
went through build_prompt_tokens segment-by-segment and fed the result via
prompt_token_ids. The two encodings differ on segment boundaries (e.g.
user-prompt-ends-with-。 + next-segment-\n\n → merged id 3490 vs
HF's [1811, 271]), so identical (prompt, image, seed) requests produced
diverging cot_text → diverging DiT input → diverging final image.

serving_chat._build_multistage_generation_inputs now goes through
build_prompt_tokens when a tokenizer is plumbed, byte-for-byte matching
apply_chat_template. Also forwards use_system_prompt and (when the
caller sets sys_type=custom) the verbatim system_prompt body so DiT can
rebuild the same system prefix.
api_server.py exposes new Form fields on /v1/images/edits: task,
sys_type, system_prompt. Legacy callers that pass a task enum under
the bot_task field still work (normalized to the canonical split). This
subsumes the simpler tokenizer plumbing landed in main as PR [Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline #3500 — we
additionally forward bot_task, sys_type, num_images, and
use_system_prompt.
size="auto" resolution now skips the gen_params / extra_body width/height
writes that would otherwise pin the bridge to the first reference image's
bucket and collapse non-square AR-predicted aspects to square in the
multi-image / mismatched-aspect case. stage_input_processors/hunyuan_image3.py
prefers the AR's predicted <img_size_*><img_ratio_*> tail (mirrors
upstream's reso_group[ratio_index] lookup) over the carried-through
height/width.

stage_input_processors/hunyuan_image3.py also drops the regex fallback on
generated_text for ratio extraction (only worked under
skip_special_tokens: False, which most deploy yamls don't set) and goes
straight to a token-id reverse scan against the tokenizer's <img_ratio_*>
id range — token-ids survive skip_special_tokens: True and are the source
of truth.

Commit 6 — RGBA cond image normalization

Online edit requests submitting PNGs with transparency systematically
produced different AR recaption text than offline (online "3 magnets" vs
offline "1 magnet" on the same input_2_*.png pair). Root cause was
not CUDA / MoE non-determinism — it was a systematic preprocessing
divergence:

Offline end2end.py img2img calls Image.open(p).convert("RGB"), which
replaces transparent pixels with black background.
Online previously skipped the RGB conversion, so the PIL-decoded RGBA went
into the AR processor's vision encoder as-is, and downstream layers
composited transparent pixels over a white canvas.

That single bit of difference (black bg vs white bg) on 57,671 transparent
pixels in the test PNG was enough to flip the AR's caption from a 1-object
description to a 3-object description, and the DiT followed the AR's
direction. api_server._load_input_images(..., normalize_rgb=True) is now
opt-in via Hunyuan-aware params (task / bot_task / sys_type present in
the request), defaulting to the offline behavior of explicit .convert("RGB").

The methodology lesson — systematic cross-path bias is not explainable
by stochastic CUDA/MoE non-determinism, and AR input alignment requires
three pillars (prompt token bytes, image tensor bytes, sampling params)
— is now codified as CLAUDE.md hard rule B21.

Commit 7 — Cond VAE determinism: `.sample()` → `.mode()`

Both the AR-side model_executor/.../hunyuan_image3.py::_vae_encode and the
DiT-side pipeline_hunyuan_image3.py previously called
vae_encode_result.latent_dist.sample() for cond image encoding. .sample()
without a generator consumes torch's global RNG, which is a silent
non-determinism source: cond latents drift between requests on a
long-running server while looking deterministic for fresh-process callers.

Cond image is clean (t=0) conditioning by design — the official upstream
takes the posterior mean for cond encode. Switched both call sites to
.mode() (added the method on the DiT-side DiagonalGaussianDistribution
to match the AR-side autoencoder_kl_3d shape).

Superseded by Commit 9 (preprocessing revert). The cond VAE path is now
back to .sample(generator=torch.Generator().manual_seed(0)) to match the
magnet_repro baseline; determinism is still maintained via the fixed seed
rather than the posterior mean. .mode() is no longer used in the DiT-side
pipeline_hunyuan_image3.py (DiagonalGaussianDistribution.mode() reverted
with it).

Commit 9 — Cond preprocessing revert to magnet_repro baseline

The intermediate "AR/DiT center-crop alignment" introduced earlier in this
PR (b83962160 + companion DiT comments) made AR's _resize_and_crop and
DiT's _resize_and_crop_center both default to crop_type='center', with
the intent that AR and DiT condition on byte-identical pixels. Visually it
under-conditioned the IT2I demo — portrait input expanded into a landscape
output had the conditioning crop drop too much of the relevant content, and
the magnet repro regressed.

Rolled back to the magnet_repro state:

AR _resize_and_crop default is crop_type='resize' (the path
infer_align_image_size=True exercises in the IT2I demo: stretch the
cond image to the bucket dims so <img_ratio_*> and ViT/VAE features
stay aligned with the bucket rather than dropping content).
Cond VAE encode in pipeline_hunyuan_image3.py switches back to
latent_dist.sample(torch.Generator(device=image.device).manual_seed(0));
the global-RNG concern from Commit 7 is addressed by the fixed seed
rather than the posterior mean.
DiagonalGaussianDistribution.mode() and the now-obsolete
AR↔DiT-byte-match regression test
(test_ar_and_dit_condition_image_preprocessing_match_without_hf_cache)
are removed.

AR and DiT no longer share byte-identical conditioning pixels (AR stretches,
DiT center-crops), but the upstream magnet_repro tuning is faithfully
reproduced and the visual quality regression is gone.

Commit 10 — Stop AR on `<|endoftext|>` for image-output tasks

Superseded by Commit 13 (review iteration). resolve_stop_token_ids
for image-output tasks now returns the full <img_ratio_*> token range
instead of [<|endoftext|>], matching upstream
modeling_hunyuan_image_3.py:3289-3303. The motivating problem (square
bucket collapse from missing <img_ratio_*>) is still solved; AR now
stops earlier (at the ratio token itself) so no decode steps are wasted
on <|endoftext|> after the ratio is sampled.

Original Commit 10:

resolve_stop_token_ids returned [<answer>] (id 128025) for every (task,
bot_task) pair. For image-output tasks (it2i / t2i) the
_stage_transitions[</recaption>] rule force-emits
<answer><boi><img_size_*>, then _apply_ratio_restriction samples
<img_ratio_*>, then <|endoftext|>. Stopping on <answer> cuts off the
size/ratio tail; ar2diffusion::_extract_ratio_index then scans
cumulative_token_ids for any <img_ratio_*> id, finds none, and falls
back to the prompt-carried height/width — which is the first reference
image's bucket in multi-image IT2I. Effect: a 512×512 logo + 1179×685
fabric collapses to a square output even when AR's CoT planned a landscape;
width and texture regress simultaneously because DiT has to squeeze the
landscape-planned content into a square.

Online didn't trip this because the deploy yaml explicitly set
stop_token_ids: [127957] (= <|endoftext|>). end2end.py overrode the
yaml with resolve_stop_token_ids(...), so offline always hit the broken
stop regardless of yaml.

Fix: resolve_stop_token_ids returns [<|endoftext|>] for it2i / t2i
so AR runs through the forced tail and <img_ratio_*> reaches the bridge.
i2t / t2t keep [<answer>] — those are comprehension stages where the
response body sits inside <answer> and the answer-open is the natural
terminator. test_resolve_stop_token_ids_image_tasks_stop_on_eos_not_answer
pins the new split.

Commit 11 — Cap AR KV snapshot at `</recaption>`, defer mid-decode kv_ready

Superseded by Commit 13 (review iteration). With the ratio-range
stop in place AR finishes naturally at the ratio token, so the shipped
KV is automatically the prefix DiT reuses; there is no mid-decode
kv_ready to defer. The kv_transfer_criteria yaml block, the
stop_after_transfer=false flag, and the
orchestrator._handle_kv_ready_raw_outputs finished_in_batch defer
are all removed. The bridge already strips the trailing ratio token
from the cot it forwards to DiT (via
stage_input_processors/hunyuan_image3._truncate_at_cot_end).

Original Commit 11:

Before this commit, AR shipped its KV all the way through the
</recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|> tail.
DiT then reused only the prefix up through </recaption> (the colleague-
confirmed positive_reuse_len invariant), so S − N == 6 instead of the
intended S − N == 1: six tail-token positions of KV were transferred and
immediately discarded, and the AR pipeline kept emitting tokens DiT would
never use.

deploy/hunyuan_image3.yaml:

omni_kv_config:
  need_send_cache: true
  kv_transfer_criteria:
    type: special_token
    token_id: 128019         # </recaption>
    stop_after_transfer: false

stop_after_transfer: false keeps the AR running past the snapshot so it
still emits <img_ratio_*> for ar2diffusion::_extract_ratio_index (which
derives output height/width). The mid-decode kv_ready signal that this
combination produces previously crashed bridges that read
ar_output.outputs[0] (no finished RequestOutput exists yet).
Orchestrator._handle_kv_ready_raw_outputs now defers the kv_ready
forward when the same raw_outputs batch doesn't yet contain a finished
output for that req_id; AR's natural completion later triggers the
forward through _route_output.

Net effect: KV transferred is byte-equivalent to what DiT actually reuses
(S − N == 1), AR no longer wastes 5 decode steps on tail tokens that DiT
discards, and <img_ratio_*> still reaches the bridge.

Commit 12 — Entry-layer cap on input image count (review feedback)

Per @Gaohan123's review on this PR: the MAX_IMAGES_PER_REQUEST = 3 cap
lived in prompt_utils._validate_num_images, which surfaced as
ValueError: num_images must be in [1, 3], got N deep inside the AR
prompt builder. The reviewer asked for a friendly, input-named error at
the entry boundary so users see the limit on the parameter they actually
typed.

Added in two places, both reusing MAX_IMAGES_PER_REQUEST (no hardcoded 3):

examples/offline_inference/hunyuan_image3/end2end.py — validate
--image-path count before opening any PIL image.
vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs
— validate reference_images count before building engine prompt data.

Behavior is otherwise unchanged: the deeper _validate_num_images cap is
still a hard backstop for any future callers that don't pass through these
entry points.

Commit 13 — Review iteration: align AR stop / KV cap / edits Form with upstream

Per @Bounty-hunter's review, the AR-stop and KV-cap logic from Commits 10
and 11 is replaced with the upstream-faithful approach from
modeling_hunyuan_image_3.py:3289-3303 (with _ConditionalSliceVocabLogitsProcessor
forcing the next token after <img_size_base> into the ratio range):

final_stop_tokens = list(range(start_ratio_token_id, end_ratio_token_id + 1))
for start, end in ratio_token_other_slices:
    final_stop_tokens.extend(range(start, end))

AR's natural trajectory under _stage_transitions is
</recaption><answer><boi><img_size_base><img_ratio_X>. Stopping AT the
ratio token means:

KV ends exactly at the prefix DiT reuses; no need for kv_transfer_criteria
special_token block or stop_after_transfer=false in the deploy yaml.
ar2diffusion::_extract_ratio_index reads the last token to derive the
output H/W.
_truncate_at_cot_end (already in the bridge) trims the cot at
</recaption> before forwarding to DiT, so the trailing
<answer><boi><img_size_X><img_ratio_X> never contaminates DiT's prompt
builder.

Net deletions:

vllm_omni/deploy/hunyuan_image3.yaml — drop the
omni_kv_config.kv_transfer_criteria block (special_token + token_id +
stop_after_transfer: false).
vllm_omni/engine/orchestrator.py::_handle_kv_ready_raw_outputs — drop
the finished_in_batch defer; mid-decode kv_ready no longer happens.
vllm_omni/entrypoints/openai/api_server.py::edit_images — drop the
task: str | None = Form(None) field. The endpoint is always IT2I;
bot_task / sys_type / system_prompt cover the remaining knobs;
legacy bot_task=<task-enum> still works via chat-handler normalization.

Net additions:

vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py::resolve_stop_token_ids
— image-output tasks return the full ratio token range
(list(range(128044, 128077)) for <img_ratio_0..32> plus
range(130103, 130107) for <img_ratio_33..36>).
vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs
— after resolving (task, bot_task), call resolve_stop_token_ids and
inject into the AR-stage sampling params, matching offline end2end.py
behavior. Without this the yaml-side default would let AR generate to
max_tokens=8192.

Optimization leftover: unified system/user/cot tokenization

pipeline_hunyuan_image3 previously forwarded AR-sampled ar_token_ids
through ar2diffusion -> extra["ar_token_ids"] -> prepare_model_inputs (cot_token_ids=...), preferring those token ids over re-encoding the
decoded cot_text. This avoided BPE re-merge drift across template
segment boundaries (e.g. "。\n\n" collapsing to a single id) that would
otherwise break positive_reuse_len and trigger the silent slice in
inject_ar_kv_into_layers.

Per @Bounty-hunter's review, this single-point optimization is out of
scope for the multi-image PR: the right unit of work is the whole prompt
(system prompt + images + user content) as one tokenization contract, and
the longer-term direction is to bypass DiT re-tokenization entirely by
reusing embeddings.

In this PR we delete the ar_token_ids plumbing in pipeline_hunyuan_image3
and ar2diffusion, but keep the lower-level tokenizer primitives
(apply_chat_template(batch_cot_token_ids=...) and
TokenizerWrapper.get_cot_sections_from_token_ids) intact for a follow-up
PR that will do the unified tokenization properly. The _kvreuse_alignment
regression tests still pin the tokenizer-level contract.

Test plan

offline（not kv）

online (not kv)

offline (use kv)

online (use kv)

chatgpt-codex-connector · 2026-05-08T05:34:33Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

princepride · 2026-05-08T05:50:07Z

Can you offer a more specific example and share the output(Including multiple images input and output text or image)?

TaffyOfficial · 2026-05-08T05:54:47Z

Can you offer a more specific example and share the output(Including multiple images input and output text or image)?

princepride · 2026-05-08T05:56:41Z

@Bounty-hunter Can you help review it?

Bounty-hunter · 2026-05-08T06:15:39Z

@Bounty-hunter Can you help review it?

ok

princepride · 2026-05-08T06:56:09Z

@TaffyOfficial pre-commit failed

Apply two rounds of code review fixes on the multi-image IT2I PR: Cond VAE determinism Replace `latent_dist.sample()` + `manual_seed(0)` hardcoding with `latent_dist.mode()` on both AR (`model_executor/.../hunyuan_image3.py ::_vae_encode`) and DiT (`diffusion/.../pipeline_hunyuan_image3.py`) sides. Cond image is clean (t=0) conditioning by design; posterior mean is deterministic by construction and matches the official cond encode path. Adds `.mode()` to the DiT-side `DiagonalGaussianDistribution`. Stale compound task names (two-axis API migration) Repo-wide grep for `{t2t,i2t,it2i,t2i}x{think,recaption,think_recaption, vanilla}` cross-product turned up two residual compound names that the initial cleanup missed: - tests/e2e/accuracy/test_hunyuan_image3.py: task='it2i_recaption' -> task='it2i', bot_task='recaption' (would have ValueErrored at _resolve_preset on the new two-axis API). - tests/diffusion/.../test_prompt_utils.py: task='t2i_think' / task='t2i_recaption' -> (task='t2i', bot_task='think|recaption'). Custom system prompt body forwarding (producer -> consumer trace) Online `/v1/images/edits` accepted `sys_type='custom'` + `system_prompt` body on the AR side via `build_prompt_tokens(custom_system_prompt=...)`, but only forwarded `use_system_prompt` to the engine_prompt. DiT's `get_system_prompt(use, "image", body)` reads the body as the third positional arg, so `sys_type='custom'` was silently falling back to an empty DiT system prefix -- AR/DiT divergence under a user-visible knob. Forward `system_prompt` through both `serving_chat` engine_prompt and `stage_input_processors/hunyuan_image3.py::ar2diffusion` -> DiT `diffusion_input`. Ratio extraction simplification Drop the regex path on `generated_text` -- only worked under `skip_special_tokens: False`, which most deploy yamls don't set. Pure token-id reverse scan against `_build_ratio_id_lookup` is the source of truth (AR `_stage_transitions` forces exactly one `<img_ratio_*>` emission). Drop unused `_RATIO_TOKEN_RE` constant, `re` import, and `generated_text` parameter from `_extract_ratio_index`. Housekeeping - Remove duplicate `engine_prompt["prompt_token_ids"]` assignment in serving_chat.py (merge residue, the second copy was added by the main-merge then re-introduced after the API split). - `examples/.../end2end.py`: stale `_TASK_PRESETS` comment -> `available_tasks` helper (symbol no longer exists post-split). - `process_image` comment in `model_executor/.../hunyuan_image3.py` clarifies the AR-side `_resize_and_crop` default vs the official `infer_align_image_size=False` (center crop) default. Signed-off-by: TaffyOfficial <2324465096@qq.com>

CI feedback from the previous push: - F841: drop unused `QKEY` in test_serving_chat_multistage_generation.py - typos: avoid the dictionary trigger on "PNGs" plural -- the lowercased form lands in the crate-ci/typos dictionary as a misspelling; rephrase to "transparent-logo uploads" without changing meaning. - ruff-format: collapse the `build_prompt_tokens(...)` call in the e2e accuracy test back to a single line (line is under the 120 char limit ruff-format enforces locally). Signed-off-by: TaffyOfficial <2324465096@qq.com>

…er crop) AR-side `HunyuanImage3Processor._resize_and_crop` previously defaulted to `crop_type="resize"` (stretch), while the DiT-side condition-image helper `_resize_and_crop_center` always center-crops. For any portrait input mapped to a landscape output bucket (or vice versa), AR and DiT then conditioned on **visibly different fabric regions**: AR saw the input stretched to fit, DiT saw the input center-cropped to fit. The two cond latents disagreed on what the surroundings should be, and DiT had to inpaint the lateral canvas extension on its own — producing seam-like vertical brightness bands at the AR/DiT-disagreement boundary (reported on `/tmp/rgbfix/result.png` IT2I run with 735x1104 input -> 1280x720 output). Change AR-side default to `crop_type="center"`, matching: - DiT-side `_resize_and_crop_center` (always center). - Official `generate_image(..., infer_align_image_size=False)` (the default; reading `hunyuan3.0_ins/image_processor.py:355-358` maps the False branch to `random_crop="center"`). Add a CPU-only regression test asserting AR and DiT preprocessing produce **byte-identical** pixels for 4 src sizes x 4 target buckets, covering portrait->landscape, landscape->portrait, and square aspects. No model weights / tokenizer / HF cache required, runs in CI. Co-authored-by: Codex Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: zuiho <2324465096@qq.com>

Signed-off-by: TaffyOfficial <2324465096@qq.com>

…state Restores the IT2I online image quality observed at the magnet_repro deploy. Two changes from the PR review-feedback round regressed image quality on multi-image edit prompts: 1. 4da2ff6 switched cond VAE from `latent_dist.sample(generator)` to `latent_dist.mode()` on both AR and DiT sides. The posterior mean produces visibly degraded conditioning vs the fixed-seed sample. 2. 1785580 changed AR `_resize_and_crop` default from `"resize"` to `"center"` to match a non-existent DiT center-crop default (DiT bridge actually defaults to `"resize"` too). This broke AR/DiT preprocessing alignment instead of fixing it. Revert both: - AR `_resize_and_crop` default back to `"resize"` and its docstring. - AR/DiT `_vae_encode`/`vae_encode` back to fixed-generator sample. - Remove the now-dead `.mode()` method on `DiagonalGaussianDistribution`. - Remove the AR/DiT byte-identical preprocessing test added by 1785580 -- it asserted the wrong invariant (AR `"center"` == DiT `_resize_and_crop_center`), which no longer holds and was never the right alignment target. Keeps the other 4da2ff6 fixes intact: system_prompt body forwarding, ratio extraction simplification, stale `it2i_recaption` compound name cleanup, duplicate `prompt_token_ids` assignment removal. Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

`resolve_stop_token_ids` returned `<answer>` (128025) for all (task, bot_task) combos. For image-output tasks (`it2i` / `t2i`) this stops the AR halfway through the size/ratio tail that `_stage_transitions[</recaption>]` forces: </recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|> ^^^^^^^^^^^^ stopped here, ratio never emitted Downstream `ar2diffusion::_extract_ratio_index` then scans `cumulative_token_ids` for any `<img_ratio_*>`, finds none, and falls back to the prompt-carried `height`/`width`. In `end2end.py` for multi-image IT2I that means the first reference image's shape -- e.g. a 512x512 logo + a 1179x685 fabric reference collapses the DiT bucket to 1024x1024 square even though the AR CoT planned image_2's landscape aspect. Width and texture both regress simultaneously because DiT has to squeeze the landscape-planned content into a square bucket. Online didn't trip this because the deploy yaml explicitly sets `stop_token_ids: [127957]` (= `<|endoftext|>`) and end2end.py is not in that codepath. `end2end.py` overrides yaml with `resolve_stop_token_ids(...)`, so offline always hit the broken stop regardless of yaml. Fix: return `[<|endoftext|>]` for `it2i` / `t2i` so AR runs through the forced tail and `<img_ratio_*>` reaches `ar2diffusion`. Keep `[<answer>]` for `i2t` / `t2t` -- those are comprehension stages where the response body sits inside `<answer>`, so the answer-open *is* the natural terminator. Update `test_resolve_stop_token_ids_uses_answer_for_generation_tasks` to assert the new (correct) split. Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

…-decode kv_ready forward Two coupled changes so HunyuanImage3 IT2I no longer ships KV for the <answer><boi><img_size><img_ratio><eos> tail that DiT discards anyway: 1. deploy/hunyuan_image3.yaml: add ``kv_transfer_criteria`` so AR's snapshot fires at </recaption> (token id 128019). ``stop_after_transfer: false`` keeps the AR running past the snapshot so it can still emit <img_ratio_*> for ``ar2diffusion._extract_ratio_index``. With this yaml + the orchestrator change below, the colleague-confirmed invariant S - N == 1 (where S is the shipped KV length and N is the DiT-side ``positive_reuse_len``) is restored. Without the yaml the AR ships KV all the way through <eos> and S - N collapses to 6. 2. engine/orchestrator.py: ``_handle_kv_ready_raw_outputs`` previously forwarded any kv_ready EngineCoreOutput straight to the next stage. With ``stop_after_transfer: false`` the kv_ready signal fires mid-decode (snapshot at </recaption>, AR still emitting tail), so the raw EngineCoreOutput has no ``.outputs[0]`` and bridges that read the AR's full text (HunyuanImage3 ``ar2diffusion``) hit ``AttributeError``. Skip the forward when no finished output for the same req_id is present in the same raw_outputs batch; the AR's eventual natural-finish RequestOutput will trigger the forward through ``_route_output``. Bagel's existing flow (kv_ready and the deferred-stop finish output co-emit in the same batch) is preserved. Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

…in entry layer Per PR vllm-project#3444 review (Gaohan123): give a friendly, input-named error at the entry boundary instead of relying on the deeper `prompt_utils._validate_num_images` to surface as a `num_images must be in [1, 3]` message. Reuse `MAX_IMAGES_PER_REQUEST` so the cap stays defined in one place. - offline `end2end.py`: validate `--image-path` count before opening PIL - online `serving_chat._build_multistage_generation_inputs`: validate `reference_images` count before building engine prompt data Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: TaffyOfficial <2324465096@qq.com>

…m (review) Addresses Bounty-hunter's PR review on vllm-project#3444: 1. resolve_stop_token_ids: image-output tasks now stop on the full <img_ratio_*> token range (ids 128044-128076 + 130103-130106), mirroring upstream modeling_hunyuan_image_3.py:3289-3303 (`final_stop_tokens = list(range(start_ratio, end_ratio + 1))`). Replaces the earlier `<|endoftext|>` stop which let AR waste decode steps past the ratio. test_prompt_utils.py renamed/updated to pin the new contract. 2. deploy/hunyuan_image3.yaml: drop the kv_transfer_criteria block. With the ratio-range stop in place AR finishes naturally at the ratio token, so KV is capped automatically -- no need for special_token criteria + stop_after_transfer=false. 3. orchestrator._handle_kv_ready_raw_outputs: drop the finished_in_batch defer. Mid-decode kv_ready only fired when stop_after_transfer=false was forcing AR past its natural stop; with vllm-project#2 removed there is no mid-decode kv_ready to defer. The ratio strip for DiT already lives in stage_input_processors/hunyuan_image3._truncate_at_cot_end. 4. serving_chat._build_multistage_generation_inputs: call resolve_stop_token_ids(task, bot_task) and inject into the AR-stage sampling params. Online now matches offline end2end.py rather than relying on yaml-side stop_token_ids. 5. api_server.edit_images: drop the redundant `task` Form field. /v1/images/edits is always IT2I; bot_task / sys_type / system_prompt remain. Legacy bot_task=<task-enum> still works via chat-handler normalization. 6. pipeline_hunyuan_image3 + stage_input_processors/hunyuan_image3: stop reading / writing the `ar_token_ids` extra. The tokenizer-level `batch_cot_token_ids` parameter is retained for a follow-up PR that will unify system/user/cot tokenization. See PR description for the optimization leftover note. Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: TaffyOfficial <2324465096@qq.com>

…sk input - Online chat handler: drop `task` from extra_body; derive task from reference_images presence. Legacy `bot_task=<task-enum>` still normalizes through to the right trigger. - Remove the AR-token-id cot reuse path (`batch_cot_token_ids` in apply_chat_template, `ctx_type == "token_ids"` branch in process_successive_message, and `get_cot_sections_from_token_ids`); it has no caller after the optimization was rolled back per reviewer feedback. - Simplify `_truncate_at_cot_end` to text-only; the token-id return was no longer consumed. - Trim over-explanatory comments across serving_chat / api_server / pipeline / end2end. Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: TaffyOfficial <2324465096@qq.com>

Collided with tests/e2e/accuracy/test_hunyuan_image3.py under pytest's default 'prepend' import mode (no __init__.py in either dir). Rename this one to make basenames unique. Signed-off-by: TaffyOfficial <2324465096@qq.com>

…2I keeps non-square AR shape Online /v1/images/edits collapsed AR-predicted aspects to a square (e.g. 1024x1024) while offline end2end.py honored the predicted ratio (e.g. 1216x832). Root cause is the AR stage in deploy/hunyuan_image3.yaml was marked ``is_comprehension: false`` (read literally as "this task generates an image, not text"), but ``is_comprehension`` inside vllm-omni is the tokenizer-owning AR-stage marker, not a user-visible task type. The serving path in entrypoints/openai/serving_chat.py looks up the AR stage by that flag to apply ``resolve_stop_token_ids`` (image-task stop set = ``<img_ratio_*>`` range). With the flag false the lookup returned None, the AR kept the YAML default ``stop_token_ids: [<answer>]``, and the HunyuanImage3 custom sampler's forced-transition step ``</recaption> -> <answer>`` triggered an immediate stop. The cumulative token ids never reached ``<img_size_BASE><img_ratio_X>``, so ``ar2diffusion._extract_ratio_index`` could not recover the AR aspect and fell back to the carried-through prompt size (1024x1024 for size=auto edits). Offline avoided this because end2end.py overrides the AR stage's stop_token_ids directly without going through the comprehension-stage lookup. Other models did not hit it because their AR stage already had ``is_comprehension: true`` (the field's framework-internal meaning). Fix is one line on the deploy config plus a comment explaining the flag's real semantics so the next model author does not repeat the same misread. Signed-off-by: TaffyOfficial <2324465096@qq.com>

…c from serving_chat PR vllm-project#3444 added 84 lines of HunyuanImage-3.0-specific handling to ``serving_chat._build_multistage_generation_inputs`` (task derivation from reference images, legacy task-enum mapping on ``bot_task``, ``MAX_IMAGES_PER_REQUEST`` cap, and an AR-stage ``stop_token_ids`` override via ``resolve_stop_token_ids``). The endpoint dispatch in ``api_server.py`` (``/v1/images/edits`` vs ``/v1/images/generations``) already encodes the task split, and the AR-stage stop override is redundant: ``HunyuanImage3ForCausalMM.sample`` already forces an EOS after sampling a ratio token (``hunyuan_image3.py`` generation-mode branch), so leaving the YAML default stop set empty lets the AR run through ``</recaption><answer><boi><img_size><img_ratio>`` and stop naturally on EOS; ``ar2diffusion._extract_ratio_index`` then reads the ratio off ``cumulative_token_ids``. The production deploy (``vllm_omni/deploy/hunyuan_image3.yaml``) already omits ``stop_token_ids`` for stage-0. Net effect on ``serving_chat.py``: +84/-19 -> +47/-19 (-37 lines). Behavior verified end-to-end on ``/v1/images/edits`` with a non-square target after removal: ``ar2diffusion`` reports ``AR ratio_idx=19, target size=1216x832`` (matches the offline ``end2end.py`` path), identical to the result with the now-removed override in place. Offline ``end2end.py`` still derives ``task`` and overrides ``stop_token_ids`` because it builds the params list directly without the endpoint-level task signal; that path is intentionally unchanged. Signed-off-by: TaffyOfficial <2324465096@qq.com>

…g_chat cleanup The serving_chat cleanup in the previous commit removed the legacy caller compatibility layer that translated ``bot_task in {"it2i", "t2i", "i2t", "t2t"}`` to ``None`` and ``bot_task in {"it2i_think", "it2i_recaption", ...}`` to the trailing ``think``/``recaption`` part. That translation existed because old callers stuffed task enums into the ``bot_task`` field; the new contract is the endpoint dispatch (``/v1/images/edits`` vs ``/v1/images/generations``) and ``reference_images`` presence carry the task signal, and ``bot_task`` only takes the documented values (``None`` / ``recaption`` / ``think`` / ``think_recaption`` / ``vanilla``). Two tests in ``test_serving_chat_multistage_generation.py`` were explicitly pinning the now-removed legacy form (``test_..._legacy_bot_task_form_unchanged``, ``test_..._legacy_composite_tasks_still_work``); deleting them. Three other tests passed ``bot_task="it2i"`` only to trigger the ``build_prompt`` path (the *value* did not matter, just non-None); switching them to ``bot_task="think"`` keeps the same intent against the new validator. Signed-off-by: TaffyOfficial <2324465096@qq.com>

Bounty-hunter

LGTM

TaffyOfficial · 2026-05-14T02:21:02Z

@Gaohan123 @hsliuustc0106

Signed-off-by: TaffyOfficial <2587297563@qq.com>

Gaohan123

LGTM. Thanks

…up (vllm-project#3444) Signed-off-by: TaffyOfficial <2324465096@qq.com> Signed-off-by: TaffyOfficial <wu15922848573@outlook.com> Signed-off-by: skf1999 <13234016272@163.com> Signed-off-by: zuiho <2324465096@qq.com> Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2587297563@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: TaffyOfficial <wu15922848573@outlook.com> Co-authored-by: skf1999 <13234016272@163.com>

TaffyOfficial requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, ZeldaHuang, david6666666, gcanlin, hsliuustc0106, linyueqian, princepride, tzhouam, wtomin, yenuo26, yuanheng-zhao and ywang96 as code owners May 8, 2026 05:34

TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch 2 times, most recently from 05e2f16 to 54caf74 Compare May 8, 2026 05:40

TaffyOfficial closed this May 8, 2026

TaffyOfficial reopened this May 8, 2026

TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch from 54caf74 to 4be3584 Compare May 8, 2026 05:49

TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch 3 times, most recently from 1687ff1 to 4ec4b46 Compare May 8, 2026 06:51

TaffyOfficial added 22 commits May 14, 2026 09:52

test(hunyuan_image3): apply ruff format hook fixes

297a2f5

Signed-off-by: zuiho <2324465096@qq.com>

fix(hunyuan_image3): preserve legacy plain prompt tasks

4cf71f2

Signed-off-by: TaffyOfficial <2324465096@qq.com>

fix(hunyuan_image3): align prompt token tests with result API

cf7e4a2

Signed-off-by: TaffyOfficial <2324465096@qq.com>

fix(hunyuan_image3): harden edit bridge compatibility

4fb78a3

Signed-off-by: TaffyOfficial <2324465096@qq.com>

chore: apply pre-commit ruff format / isort fixups

029f567

Signed-off-by: TaffyOfficial <2324465096@qq.com>

chore: rename MAX_IMAGES_PER_REQUEST alias to uppercase (ruff N811)

d8b9263

Signed-off-by: TaffyOfficial <2324465096@qq.com>

chore: apply pre-commit isort split for resolve_stop_token_ids import

8d90c17

Signed-off-by: TaffyOfficial <2324465096@qq.com>

chore: apply ruff-format fixup for cot_text_list comprehension

8d12ddd

Signed-off-by: TaffyOfficial <2324465096@qq.com>

chore: keep for-loop one-line in apply_chat_template (no spurious diff)

bfd17b3

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch from 141d59f to 161ba50 Compare May 14, 2026 01:53

Bounty-hunter approved these changes May 14, 2026

View reviewed changes

Merge branch 'main' into wt-hunyuan3-it2i-multi-image

b5b4d71

Signed-off-by: TaffyOfficial <2587297563@qq.com>

Gaohan123 approved these changes May 14, 2026

View reviewed changes

Gaohan123 merged commit 3f63aaf into vllm-project:main May 14, 2026
8 checks passed

Gaohan123 mentioned this pull request May 14, 2026

[Bugfix] Align Offline and Online Inference #3506

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup#3444

[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup#3444
Gaohan123 merged 44 commits into
vllm-project:mainfrom
TaffyOfficial:wt-hunyuan3-it2i-multi-image

TaffyOfficial commented May 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 8, 2026

Uh oh!

princepride commented May 8, 2026

Uh oh!

TaffyOfficial commented May 8, 2026 •

edited

Loading

Uh oh!

princepride commented May 8, 2026

Uh oh!

Bounty-hunter commented May 8, 2026

Uh oh!

princepride commented May 8, 2026

Uh oh!

Bounty-hunter left a comment

Uh oh!

TaffyOfficial commented May 14, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

TaffyOfficial commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — Multi-image IT2I support

Commit 2 — prompt_utils API cleanup: split task and bot_task

Commits 3-5 — Online (/v1/images/edits) ↔ offline AR byte-alignment

Commit 6 — RGBA cond image normalization

Commit 7 — Cond VAE determinism: .sample() → .mode()

Commit 9 — Cond preprocessing revert to magnet_repro baseline

Commit 10 — Stop AR on <|endoftext|> for image-output tasks

Commit 11 — Cap AR KV snapshot at </recaption>, defer mid-decode kv_ready

Commit 12 — Entry-layer cap on input image count (review feedback)

Commit 13 — Review iteration: align AR stop / KV cap / edits Form with upstream

Optimization leftover: unified system/user/cot tokenization

Test plan

Uh oh!

chatgpt-codex-connector Bot commented May 8, 2026

Uh oh!

princepride commented May 8, 2026

Uh oh!

TaffyOfficial commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

princepride commented May 8, 2026

Uh oh!

Bounty-hunter commented May 8, 2026

Uh oh!

princepride commented May 8, 2026

Uh oh!

Bounty-hunter left a comment

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial commented May 14, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

TaffyOfficial commented May 8, 2026 •

edited

Loading

Commit 2 — prompt_utils API cleanup: split `task` and `bot_task`

Commits 3-5 — Online (`/v1/images/edits`) ↔ offline AR byte-alignment

Commit 7 — Cond VAE determinism: `.sample()` → `.mode()`

Commit 10 — Stop AR on `<|endoftext|>` for image-output tasks

Commit 11 — Cap AR KV snapshot at `</recaption>`, defer mid-decode kv_ready

TaffyOfficial commented May 8, 2026 •

edited

Loading