Skip to content

[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup#3444

Merged
Gaohan123 merged 44 commits into
vllm-project:mainfrom
TaffyOfficial:wt-hunyuan3-it2i-multi-image
May 14, 2026
Merged

[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup#3444
Gaohan123 merged 44 commits into
vllm-project:mainfrom
TaffyOfficial:wt-hunyuan3-it2i-multi-image

Conversation

@TaffyOfficial
Copy link
Copy Markdown
Contributor

@TaffyOfficial TaffyOfficial commented May 8, 2026

Summary

This PR makes HunyuanImage-3.0-Instruct's IT2I path support up to 3 reference
images per request ("Multi-Image Fusion", as upstream supports), and folds in
two follow-up fixes:

  1. A prompt_utils API cleanup that splits the conflated task parameter
    into two orthogonal axes.
  2. Online (/v1/images/edits) ↔ offline (end2end.py img2img) AR alignment,
    so identical (prompt, image, seed) produces identical AR output across the
    two paths.

Both are included here rather than in separate PRs because they touch the same
prompt_utils.py / serving / pipeline surface and would otherwise create
3-way merge conflicts.

Logical commits in chronological order:

  1. Multi-image IT2I support — N consecutive <img> placeholders, per-image
    VAE buckets, ragged flat_from_sizes reconstruction.
  2. prompt_utils API cleanup — split task and bot_task.
  3. HF byte-equivalent prompt_token_ids in /v1/images/edits — segment-
    wise tokenization through build_prompt_tokens, mirrors offline end2end.py.
  4. task / bot_task / sys_type / system_prompt Form fields at the OpenAI
    edits endpoint, so callers can drive the same (task, bot_task) axes the
    offline example uses.
  5. size="auto" no longer collapses non-square AR-predicted aspect
    defers width/height to the AR <img_ratio_*> bucket via
    stage_input_processors/hunyuan_image3.py.
  6. RGBA cond image normalization — RGBA → RGB via white-bg
    alpha-composite, gated on Hunyuan-aware request params, fixes systematic
    online/offline divergence on PNGs with transparency.
  7. Cond VAE determinismlatent_dist.sample() (consumes torch global
    RNG) → latent_dist.mode() (deterministic posterior mean, matches the
    official cond encode path for clean t=0 conditioning).
  8. Merge origin/main — picks up HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS,
    resolve_stop_token_ids, and the AR stop-token plumbing from PR [Config] Add HunyuanImage3 deploy configs #3172.
  9. Cond preprocessing revert (magnet_repro baseline) — AR
    _resize_and_crop default back to crop_type='resize', cond VAE encode
    back to .sample(generator=fixed) instead of .mode(). Center-crop was
    visibly under-conditioning portrait-input → landscape-output edits;
    restoring the IT2I demo's tuned preprocessing recovers the magnet repro.
    Supersedes Commit 7 for the VAE path (determinism preserved via
    fixed generator instead of posterior mean).
  10. Stop AR on <|endoftext|> for image-output tasks
    resolve_stop_token_ids was returning <answer> for every (task,
    bot_task). For it2i/t2i, that chops off the
    <answer><boi><img_size_*><img_ratio_*><|endoftext|> tail forced by
    _stage_transitions; _extract_ratio_index then finds no
    <img_ratio_*> and silently collapses the DiT bucket to the first
    reference image's shape (square logo → 1024x1024 even when AR's CoT
    planned a landscape). Now returns <|endoftext|> for image-output
    tasks; comprehension i2t/t2t still stop on <answer>.
  11. Cap AR KV snapshot at </recaption>
    deploy/hunyuan_image3.yaml sets
    kv_transfer_criteria.type=special_token, token_id=128019, stop_after_transfer=false. Shipped KV now exactly matches the prefix
    DiT reuses (S−N=1 invariant). AR keeps running past the snapshot so
    it can still emit <img_ratio_*> for the bridge; orchestrator
    _handle_kv_ready_raw_outputs defers the kv_ready forward when the
    same raw_outputs batch doesn't yet contain a finished output for that
    req_id, avoiding the ar_output.outputs[0] AttributeError that
    bridges hit when kv_ready fires mid-decode.
  12. Entry-layer cap on input image count (reviewer feedback from
    @Gaohan123) — --image-path (offline) and
    _build_multistage_generation_inputs (online) now reject

    MAX_IMAGES_PER_REQUEST (=3) up front with an input-named error,
    instead of letting the deeper _validate_num_images surface as
    "num_images must be in [1, 3]".

  13. Review iteration: align AR stop / KV cap / edits Form with
    upstream
    (feedback from @Bounty-hunter) — supersedes commits 10
    and 11. resolve_stop_token_ids for image-output tasks now returns
    the full <img_ratio_*> token range
    (list(range(start_ratio, end_ratio + 1)) +
    ratio_token_other_slices), mirroring upstream
    modeling_hunyuan_image_3.py:3289-3303. AR stops AT the ratio
    token; KV is capped naturally; bridge _truncate_at_cot_end already
    strips the ratio tail for DiT. kv_transfer_criteria block in the
    deploy yaml is removed; orchestrator finished_in_batch defer is
    removed (no mid-decode kv_ready exists in the new flow).
    serving_chat._build_multistage_generation_inputs now calls
    resolve_stop_token_ids so online matches offline end2end.py.
    The task Form field on /v1/images/edits is dropped (the
    endpoint is always IT2I; bot_task / sys_type / system_prompt
    are the remaining knobs). The cot_token_ids_list segment-token
    forwarding in pipeline_hunyuan_image3 is also removed; see the
    "Optimization leftover" note below.

Plus housekeeping: ratio extraction simplified to a pure token-id reverse scan
(regex path dropped — token-ids are source of truth, AR yamls run with
skip_special_tokens=True), stale compound task names cleaned out of the e2e
test, AR/DiT system_prompt body forwarding so sys_type='custom' works
end-to-end, duplicate engine_prompt["prompt_token_ids"] assignment removed.


Commit 1 — Multi-image IT2I support

HunyuanImage-3.0-Instruct supports up to 3 reference images per IT2I request
(README §200-216, §500). vllm-omni's DiT pipeline, AR processor, OpenAI
schema, and ar2diffusion bridge already accepted list-shaped
multi_modal_data["image"], but four call sites still encoded a hard "N=1"
assumption. End-to-end smoke (4× L20X) on the official input_1_0.png +
input_1_1.png demo pair runs cleanly and preserves each image's native
bucket.

Surgery points:

  • prompt_utils.build_prompt(_tokens) takes num_images: int (default 1,
    validated 1 ≤ N ≤ 3 for image-input tasks) and emits N consecutive <img>
    placeholders between User: and the user prompt, matching the official
    apply_general_template "successive user message" wrapping.
  • HunyuanImage3Processor.process_image: each cond image keeps its own VAE
    reso_group bucket. Per-image VAE pixel tensors are flattened to 1-D and
    concatenated; vae_pixel_size declares per-image numel so vLLM splits the
    buffer back per image at consumption time via
    MultiModalFieldConfig.flat_from_sizes(..., vae_pixel_size) (mirrors the
    GLM-Image / Ming-Flash-Omni pattern).
  • _parse_and_validate_image_input reconstructs a list of per-image
    (3, H_i, W_i) tensors from vae_token_grid_hw; embed_multimodal loops
    over the list for VAE encode + patch_embed.
  • examples/.../end2end.py: --image-path accepts comma-separated paths;
    mm_image_payload is unwrapped to a single image when N=1 to keep the
    legacy single-image call shape.

Commit 2 — prompt_utils API cleanup: split task and bot_task

CR feedback observed that the old _TASK_PRESETS table conflated I/O
modality with prompting mode (e.g. it2i_think, t2i_recaption,
t2i_vanilla) and carried a bot_task field that was dead code under
every sys_type exercised in this codebase (only sys_type='dynamic'
consumed it, and nothing ever set that). Split into two orthogonal axes:

axis values controls
task t2t, i2t, it2i, t2i whether <img> placeholders are emitted
bot_task None, think, recaption, think_recaption, vanilla system prompt + trigger tag

Resolution table:

bot_task sys_type trigger tag
None en_unified (none)
think en_unified <think>
recaption en_unified <recaption>
think_recaption en_think_recaption <think>
vanilla en_vanilla (none, no chat template — task='t2i' only)

bot_task='vanilla' is validated to only combine with task='t2i';
unknown task / bot_task values raise ValueError. Public helpers
available_bot_tasks() and resolve_sys_type(bot_task) let callers derive
the default sys_type without re-encoding the table.

Migration mapping for any downstream caller:

old new
task='t2t' task='t2t', bot_task=None
task='i2t' task='i2t', bot_task=None
task='it2i_think' task='it2i', bot_task='think'
task='it2i_recaption' task='it2i', bot_task='recaption'
task='t2i_think' task='t2i', bot_task='think'
task='t2i_recaption' task='t2i', bot_task='recaption'
task='t2i_vanilla' task='t2i', bot_task='vanilla'
(newly accessible) task='t2i', bot_task='think_recaption'en_think_recaption

This is a hard breaking change with no aliases. Repo-wide grep across
tests/, examples/, vllm_omni/, and deploy/*.yaml confirms no
remaining references to the old compound strings or to _TASK_PRESETS.

Side fix on build_prompt: the legacy code stripped the system prompt's
leading whitespace while build_prompt_tokens did not. Invisible while every
system prompt was unified_system_prompt_en (no leading newline) but newly
observable now that bot_task='think_recaption' exposes en_think_recaption
(which starts with \n). build_prompt now keeps the system prompt verbatim,
matching the segment-by-segment tokenization path and HF's
apply_chat_template byte-for-byte.

end2end.py CLI changes: --bot-task choices are now
{none, think, recaption, think_recaption, vanilla}. The literal none is
the explicit way to request bot_task=None on a modality whose default is
think (e.g. text2img / img2img); leaving --bot-task unset still falls
back to the modality default. The duplicated _TASK_PRESETS literal in the
example script is removed in favor of resolve_sys_type(bot_task). AR stop
token ids are now resolved programmatically via resolve_stop_token_ids(task, bot_task, tokenizer) rather than hardcoded in the deploy yaml — keeps the
example self-contained and survives yaml drift.


Commits 3-5 — Online (/v1/images/edits) ↔ offline AR byte-alignment

The OpenAI edits endpoint built the AR prompt as a single string and let the
engine tokenizer run a whole-string BPE pass; offline end2end.py img2img
went through build_prompt_tokens segment-by-segment and fed the result via
prompt_token_ids. The two encodings differ on segment boundaries (e.g.
user-prompt-ends-with- + next-segment-\n\n → merged id 3490 vs
HF's [1811, 271]), so identical (prompt, image, seed) requests produced
diverging cot_text → diverging DiT input → diverging final image.

  • serving_chat._build_multistage_generation_inputs now goes through
    build_prompt_tokens when a tokenizer is plumbed, byte-for-byte matching
    apply_chat_template. Also forwards use_system_prompt and (when the
    caller sets sys_type=custom) the verbatim system_prompt body so DiT can
    rebuild the same system prefix.
  • api_server.py exposes new Form fields on /v1/images/edits: task,
    sys_type, system_prompt. Legacy callers that pass a task enum under
    the bot_task field still work (normalized to the canonical split). This
    subsumes the simpler tokenizer plumbing landed in main as PR [Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline #3500 — we
    additionally forward bot_task, sys_type, num_images, and
    use_system_prompt.
  • size="auto" resolution now skips the gen_params / extra_body width/height
    writes that would otherwise pin the bridge to the first reference image's
    bucket and collapse non-square AR-predicted aspects to square in the
    multi-image / mismatched-aspect case. stage_input_processors/hunyuan_image3.py
    prefers the AR's predicted <img_size_*><img_ratio_*> tail (mirrors
    upstream's reso_group[ratio_index] lookup) over the carried-through
    height/width.

stage_input_processors/hunyuan_image3.py also drops the regex fallback on
generated_text for ratio extraction (only worked under
skip_special_tokens: False, which most deploy yamls don't set) and goes
straight to a token-id reverse scan against the tokenizer's <img_ratio_*>
id range — token-ids survive skip_special_tokens: True and are the source
of truth.


Commit 6 — RGBA cond image normalization

Online edit requests submitting PNGs with transparency systematically
produced different AR recaption text than offline (online "3 magnets" vs
offline "1 magnet" on the same input_2_*.png pair). Root cause was
not CUDA / MoE non-determinism — it was a systematic preprocessing
divergence:

  • Offline end2end.py img2img calls Image.open(p).convert("RGB"), which
    replaces transparent pixels with black background.
  • Online previously skipped the RGB conversion, so the PIL-decoded RGBA went
    into the AR processor's vision encoder as-is, and downstream layers
    composited transparent pixels over a white canvas.

That single bit of difference (black bg vs white bg) on 57,671 transparent
pixels in the test PNG was enough to flip the AR's caption from a 1-object
description to a 3-object description, and the DiT followed the AR's
direction. api_server._load_input_images(..., normalize_rgb=True) is now
opt-in via Hunyuan-aware params (task / bot_task / sys_type present in
the request), defaulting to the offline behavior of explicit .convert("RGB").

The methodology lesson — systematic cross-path bias is not explainable
by stochastic CUDA/MoE non-determinism, and AR input alignment requires
three pillars (prompt token bytes, image tensor bytes, sampling params)
— is now codified as CLAUDE.md hard rule B21.


Commit 7 — Cond VAE determinism: .sample().mode()

Both the AR-side model_executor/.../hunyuan_image3.py::_vae_encode and the
DiT-side pipeline_hunyuan_image3.py previously called
vae_encode_result.latent_dist.sample() for cond image encoding. .sample()
without a generator consumes torch's global RNG, which is a silent
non-determinism source: cond latents drift between requests on a
long-running server while looking deterministic for fresh-process callers.

Cond image is clean (t=0) conditioning by design — the official upstream
takes the posterior mean for cond encode. Switched both call sites to
.mode() (added the method on the DiT-side DiagonalGaussianDistribution
to match the AR-side autoencoder_kl_3d shape).

Superseded by Commit 9 (preprocessing revert). The cond VAE path is now
back to .sample(generator=torch.Generator().manual_seed(0)) to match the
magnet_repro baseline; determinism is still maintained via the fixed seed
rather than the posterior mean. .mode() is no longer used in the DiT-side
pipeline_hunyuan_image3.py (DiagonalGaussianDistribution.mode() reverted
with it).


Commit 9 — Cond preprocessing revert to magnet_repro baseline

The intermediate "AR/DiT center-crop alignment" introduced earlier in this
PR (b83962160 + companion DiT comments) made AR's _resize_and_crop and
DiT's _resize_and_crop_center both default to crop_type='center', with
the intent that AR and DiT condition on byte-identical pixels. Visually it
under-conditioned the IT2I demo — portrait input expanded into a landscape
output had the conditioning crop drop too much of the relevant content, and
the magnet repro regressed.

Rolled back to the magnet_repro state:

  • AR _resize_and_crop default is crop_type='resize' (the path
    infer_align_image_size=True exercises in the IT2I demo: stretch the
    cond image to the bucket dims so <img_ratio_*> and ViT/VAE features
    stay aligned with the bucket rather than dropping content).
  • Cond VAE encode in pipeline_hunyuan_image3.py switches back to
    latent_dist.sample(torch.Generator(device=image.device).manual_seed(0));
    the global-RNG concern from Commit 7 is addressed by the fixed seed
    rather than the posterior mean.
  • DiagonalGaussianDistribution.mode() and the now-obsolete
    AR↔DiT-byte-match regression test
    (test_ar_and_dit_condition_image_preprocessing_match_without_hf_cache)
    are removed.

AR and DiT no longer share byte-identical conditioning pixels (AR stretches,
DiT center-crops), but the upstream magnet_repro tuning is faithfully
reproduced and the visual quality regression is gone.


Commit 10 — Stop AR on <|endoftext|> for image-output tasks

Superseded by Commit 13 (review iteration). resolve_stop_token_ids
for image-output tasks now returns the full <img_ratio_*> token range
instead of [<|endoftext|>], matching upstream
modeling_hunyuan_image_3.py:3289-3303. The motivating problem (square
bucket collapse from missing <img_ratio_*>) is still solved; AR now
stops earlier (at the ratio token itself) so no decode steps are wasted
on <|endoftext|> after the ratio is sampled.

Original Commit 10:

resolve_stop_token_ids returned [<answer>] (id 128025) for every (task,
bot_task) pair. For image-output tasks (it2i / t2i) the
_stage_transitions[</recaption>] rule force-emits
<answer><boi><img_size_*>, then _apply_ratio_restriction samples
<img_ratio_*>, then <|endoftext|>. Stopping on <answer> cuts off the
size/ratio tail; ar2diffusion::_extract_ratio_index then scans
cumulative_token_ids for any <img_ratio_*> id, finds none, and falls
back to the prompt-carried height/width — which is the first reference
image's bucket in multi-image IT2I. Effect: a 512×512 logo + 1179×685
fabric collapses to a square output even when AR's CoT planned a landscape;
width and texture regress simultaneously because DiT has to squeeze the
landscape-planned content into a square.

Online didn't trip this because the deploy yaml explicitly set
stop_token_ids: [127957] (= <|endoftext|>). end2end.py overrode the
yaml with resolve_stop_token_ids(...), so offline always hit the broken
stop regardless of yaml.

Fix: resolve_stop_token_ids returns [<|endoftext|>] for it2i / t2i
so AR runs through the forced tail and <img_ratio_*> reaches the bridge.
i2t / t2t keep [<answer>] — those are comprehension stages where the
response body sits inside <answer> and the answer-open is the natural
terminator. test_resolve_stop_token_ids_image_tasks_stop_on_eos_not_answer
pins the new split.


Commit 11 — Cap AR KV snapshot at </recaption>, defer mid-decode kv_ready

Superseded by Commit 13 (review iteration). With the ratio-range
stop in place AR finishes naturally at the ratio token, so the shipped
KV is automatically the prefix DiT reuses; there is no mid-decode
kv_ready to defer. The kv_transfer_criteria yaml block, the
stop_after_transfer=false flag, and the
orchestrator._handle_kv_ready_raw_outputs finished_in_batch defer
are all removed. The bridge already strips the trailing ratio token
from the cot it forwards to DiT (via
stage_input_processors/hunyuan_image3._truncate_at_cot_end).

Original Commit 11:

Before this commit, AR shipped its KV all the way through the
</recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|> tail.
DiT then reused only the prefix up through </recaption> (the colleague-
confirmed positive_reuse_len invariant), so S − N == 6 instead of the
intended S − N == 1: six tail-token positions of KV were transferred and
immediately discarded, and the AR pipeline kept emitting tokens DiT would
never use.

deploy/hunyuan_image3.yaml:

omni_kv_config:
  need_send_cache: true
  kv_transfer_criteria:
    type: special_token
    token_id: 128019         # </recaption>
    stop_after_transfer: false

stop_after_transfer: false keeps the AR running past the snapshot so it
still emits <img_ratio_*> for ar2diffusion::_extract_ratio_index (which
derives output height/width). The mid-decode kv_ready signal that this
combination produces previously crashed bridges that read
ar_output.outputs[0] (no finished RequestOutput exists yet).
Orchestrator._handle_kv_ready_raw_outputs now defers the kv_ready
forward when the same raw_outputs batch doesn't yet contain a finished
output for that req_id; AR's natural completion later triggers the
forward through _route_output.

Net effect: KV transferred is byte-equivalent to what DiT actually reuses
(S − N == 1), AR no longer wastes 5 decode steps on tail tokens that DiT
discards, and <img_ratio_*> still reaches the bridge.


Commit 12 — Entry-layer cap on input image count (review feedback)

Per @Gaohan123's review on this PR: the MAX_IMAGES_PER_REQUEST = 3 cap
lived in prompt_utils._validate_num_images, which surfaced as
ValueError: num_images must be in [1, 3], got N deep inside the AR
prompt builder. The reviewer asked for a friendly, input-named error at
the entry boundary so users see the limit on the parameter they actually
typed.

Added in two places, both reusing MAX_IMAGES_PER_REQUEST (no hardcoded 3):

  • examples/offline_inference/hunyuan_image3/end2end.py — validate
    --image-path count before opening any PIL image.
  • vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs
    — validate reference_images count before building engine prompt data.

Behavior is otherwise unchanged: the deeper _validate_num_images cap is
still a hard backstop for any future callers that don't pass through these
entry points.


Commit 13 — Review iteration: align AR stop / KV cap / edits Form with upstream

Per @Bounty-hunter's review, the AR-stop and KV-cap logic from Commits 10
and 11 is replaced with the upstream-faithful approach from
modeling_hunyuan_image_3.py:3289-3303 (with _ConditionalSliceVocabLogitsProcessor
forcing the next token after <img_size_base> into the ratio range):

final_stop_tokens = list(range(start_ratio_token_id, end_ratio_token_id + 1))
for start, end in ratio_token_other_slices:
    final_stop_tokens.extend(range(start, end))

AR's natural trajectory under _stage_transitions is
</recaption><answer><boi><img_size_base><img_ratio_X>. Stopping AT the
ratio token means:

  • KV ends exactly at the prefix DiT reuses; no need for kv_transfer_criteria
    special_token block or stop_after_transfer=false in the deploy yaml.
  • ar2diffusion::_extract_ratio_index reads the last token to derive the
    output H/W.
  • _truncate_at_cot_end (already in the bridge) trims the cot at
    </recaption> before forwarding to DiT, so the trailing
    <answer><boi><img_size_X><img_ratio_X> never contaminates DiT's prompt
    builder.

Net deletions:

  • vllm_omni/deploy/hunyuan_image3.yaml — drop the
    omni_kv_config.kv_transfer_criteria block (special_token + token_id +
    stop_after_transfer: false).
  • vllm_omni/engine/orchestrator.py::_handle_kv_ready_raw_outputs — drop
    the finished_in_batch defer; mid-decode kv_ready no longer happens.
  • vllm_omni/entrypoints/openai/api_server.py::edit_images — drop the
    task: str | None = Form(None) field. The endpoint is always IT2I;
    bot_task / sys_type / system_prompt cover the remaining knobs;
    legacy bot_task=<task-enum> still works via chat-handler normalization.

Net additions:

  • vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py::resolve_stop_token_ids
    — image-output tasks return the full ratio token range
    (list(range(128044, 128077)) for <img_ratio_0..32> plus
    range(130103, 130107) for <img_ratio_33..36>).
  • vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs
    — after resolving (task, bot_task), call resolve_stop_token_ids and
    inject into the AR-stage sampling params, matching offline end2end.py
    behavior. Without this the yaml-side default would let AR generate to
    max_tokens=8192.

Optimization leftover: unified system/user/cot tokenization

pipeline_hunyuan_image3 previously forwarded AR-sampled ar_token_ids
through ar2diffusion -> extra["ar_token_ids"] -> prepare_model_inputs (cot_token_ids=...), preferring those token ids over re-encoding the
decoded cot_text. This avoided BPE re-merge drift across template
segment boundaries (e.g. "。\n\n" collapsing to a single id) that would
otherwise break positive_reuse_len and trigger the silent slice in
inject_ar_kv_into_layers.

Per @Bounty-hunter's review, this single-point optimization is out of
scope for the multi-image PR: the right unit of work is the whole prompt
(system prompt + images + user content) as one tokenization contract, and
the longer-term direction is to bypass DiT re-tokenization entirely by
reusing embeddings.

In this PR we delete the ar_token_ids plumbing in pipeline_hunyuan_image3
and ar2diffusion, but keep the lower-level tokenizer primitives
(apply_chat_template(batch_cot_token_ids=...) and
TokenizerWrapper.get_cot_sections_from_token_ids) intact for a follow-up
PR that will do the unified tokenization properly. The _kvreuse_alignment
regression tests still pin the tokenizer-level contract.


Test plan

  • tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_multi_image.py
    — 5 invariants pinned (N consecutive <img> placeholders for N∈{1,2,3} on string
    • token paths; N=1 byte-identical to legacy single-image; N=N+1 extends by exactly
      one <img> id; out-of-range N rejected; text-only tasks ignore num_images).
  • tests/diffusion/models/hunyuan_image3/test_prompt_utils.py updated to the
    new (task, bot_task) parametrization. Adds
    test_available_bot_tasks_covers_all_modes,
    test_build_prompt_unknown_bot_task_raises,
    test_build_prompt_vanilla_rejects_non_t2i_task,
    and test_resolve_stop_token_ids_image_tasks_stop_on_ratio_range
    (pins Commit 13 — image-output tasks stop on the full <img_ratio_*>
    range, text-output on <answer>).
  • tests/entrypoints/openai_api/test_serving_chat_multistage_generation.py
    4 new regression tests: multi-image placeholder count, tokenizer-plumbed
    byte-for-byte path, legacy bot_task=task-enum compat, sys_type override.
  • tests/entrypoints/openai_api/test_image_server.pysize="auto" no
    longer collapses bridge-resolved aspect.
  • tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py
    — DiT-side _resize_and_crop_center byte-matches official HF
    image_processor.resize_and_crop(crop_type='center'). (The earlier
    AR↔DiT byte-match assertion is removed in Commit 9 since AR is back to
    crop_type='resize'.)
  • Entry-layer image-count cap (Commit 12): manual smoke confirms a
    4-image --image-path img1,img2,img3,img4 and equivalent online
    reference_images list both raise
    ValueError: ... accepts at most 3 images ... before any model code
    runs. The deeper _validate_num_images cap (num_images must be in [1, 3]) remains as a backstop for direct build_prompt(_tokens)
    callers.
  • Commit 13 ratio-range stop verified by inspecting AR cumulative
    token output and DiT's positive_reuse_len on a fresh image-edit
    request: AR stops at the sampled <img_ratio_X> (last token), DiT's
    positive_reuse_len matches AR's KV length (no S − N drift), and
    ar2diffusion._extract_ratio_index recovers the correct ratio idx
    from cumulative_token_ids. Confirms the upstream
    modeling_hunyuan_image_3.py:3289-3303 flow.
  • tests/e2e/accuracy/test_hunyuan_image3.py — migrated to two-axis API
    (task='it2i', bot_task='recaption').
  • End-to-end smoke (4× L20X): official input_1_0.png + input_1_1.png
    demo pair, peak 95.52 GB reserved / 90.10 GB allocated; output PNG saved
    cleanly with second image's native aspect preserved.
  • Online (/v1/images/edits) ↔ offline (end2end.py img2img) parity on
    RGBA input: online now produces the same 1-magnet output as offline on
    input_2_*.png (was diverging into 3-magnet description before the RGB
    normalization fix).
  • Repo-wide grep across tests/, examples/, vllm_omni/, deploy/*.yaml
    confirms no remaining references to old compound task strings or
    _TASK_PRESETS. Cross-product enumeration of
    {t2t,i2t,it2i,t2i}×{think,recaption,think_recaption,vanilla} = 16 names
    — none of the 16 appears as an active call site.
  • 154 unit tests pass on remote (4× L20X container) after the merge with
    origin/main (HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS + resolve_stop_token_ids
    from PR [Config] Add HunyuanImage3 deploy configs #3172 reconciled against the two-axis API).
input_1_0 input_1_1

offline(not kv)
magnet_revert_offline_v2

online (not kv)
magnet_repro

offline (use kv)
magnet_recap_cap

online (use kv)

magnet_shm_reuse_online

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@TaffyOfficial TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch 2 times, most recently from 05e2f16 to 54caf74 Compare May 8, 2026 05:40
@TaffyOfficial TaffyOfficial reopened this May 8, 2026
@TaffyOfficial TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch from 54caf74 to 4be3584 Compare May 8, 2026 05:49
@princepride
Copy link
Copy Markdown
Collaborator

Can you offer a more specific example and share the output(Including multiple images input and output text or image)?

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

TaffyOfficial commented May 8, 2026

Can you offer a more specific example and share the output(Including multiple images input and output text or image)?

input_1_1 input_1_0 output_0_0

@princepride
Copy link
Copy Markdown
Collaborator

@Bounty-hunter Can you help review it?

@Bounty-hunter
Copy link
Copy Markdown
Contributor

@Bounty-hunter Can you help review it?

ok

@TaffyOfficial TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch 3 times, most recently from 1687ff1 to 4ec4b46 Compare May 8, 2026 06:51
@princepride
Copy link
Copy Markdown
Collaborator

@TaffyOfficial pre-commit failed

TaffyOfficial added 22 commits May 14, 2026 09:52
Apply two rounds of code review fixes on the multi-image IT2I PR:

Cond VAE determinism
  Replace `latent_dist.sample()` + `manual_seed(0)` hardcoding with
  `latent_dist.mode()` on both AR (`model_executor/.../hunyuan_image3.py
  ::_vae_encode`) and DiT (`diffusion/.../pipeline_hunyuan_image3.py`)
  sides. Cond image is clean (t=0) conditioning by design; posterior mean
  is deterministic by construction and matches the official cond encode
  path. Adds `.mode()` to the DiT-side `DiagonalGaussianDistribution`.

Stale compound task names (two-axis API migration)
  Repo-wide grep for `{t2t,i2t,it2i,t2i}x{think,recaption,think_recaption,
  vanilla}` cross-product turned up two residual compound names that the
  initial cleanup missed:
    - tests/e2e/accuracy/test_hunyuan_image3.py: task='it2i_recaption'
      -> task='it2i', bot_task='recaption' (would have ValueErrored at
      _resolve_preset on the new two-axis API).
    - tests/diffusion/.../test_prompt_utils.py: task='t2i_think' /
      task='t2i_recaption' -> (task='t2i', bot_task='think|recaption').

Custom system prompt body forwarding (producer -> consumer trace)
  Online `/v1/images/edits` accepted `sys_type='custom'` + `system_prompt`
  body on the AR side via `build_prompt_tokens(custom_system_prompt=...)`,
  but only forwarded `use_system_prompt` to the engine_prompt. DiT's
  `get_system_prompt(use, "image", body)` reads the body as the third
  positional arg, so `sys_type='custom'` was silently falling back to an
  empty DiT system prefix -- AR/DiT divergence under a user-visible knob.
  Forward `system_prompt` through both `serving_chat` engine_prompt and
  `stage_input_processors/hunyuan_image3.py::ar2diffusion` -> DiT
  `diffusion_input`.

Ratio extraction simplification
  Drop the regex path on `generated_text` -- only worked under
  `skip_special_tokens: False`, which most deploy yamls don't set. Pure
  token-id reverse scan against `_build_ratio_id_lookup` is the source of
  truth (AR `_stage_transitions` forces exactly one `<img_ratio_*>`
  emission). Drop unused `_RATIO_TOKEN_RE` constant, `re` import, and
  `generated_text` parameter from `_extract_ratio_index`.

Housekeeping
  - Remove duplicate `engine_prompt["prompt_token_ids"]` assignment in
    serving_chat.py (merge residue, the second copy was added by the
    main-merge then re-introduced after the API split).
  - `examples/.../end2end.py`: stale `_TASK_PRESETS` comment ->
    `available_tasks` helper (symbol no longer exists post-split).
  - `process_image` comment in `model_executor/.../hunyuan_image3.py`
    clarifies the AR-side `_resize_and_crop` default vs the official
    `infer_align_image_size=False` (center crop) default.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
CI feedback from the previous push:
- F841: drop unused `QKEY` in test_serving_chat_multistage_generation.py
- typos: avoid the dictionary trigger on "PNGs" plural -- the lowercased
  form lands in the crate-ci/typos dictionary as a misspelling; rephrase
  to "transparent-logo uploads" without changing meaning.
- ruff-format: collapse the `build_prompt_tokens(...)` call in the e2e
  accuracy test back to a single line (line is under the 120 char limit
  ruff-format enforces locally).

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…er crop)

AR-side `HunyuanImage3Processor._resize_and_crop` previously defaulted to
`crop_type="resize"` (stretch), while the DiT-side condition-image helper
`_resize_and_crop_center` always center-crops. For any portrait input
mapped to a landscape output bucket (or vice versa), AR and DiT then
conditioned on **visibly different fabric regions**: AR saw the input
stretched to fit, DiT saw the input center-cropped to fit. The two cond
latents disagreed on what the surroundings should be, and DiT had to
inpaint the lateral canvas extension on its own — producing seam-like
vertical brightness bands at the AR/DiT-disagreement boundary (reported
on `/tmp/rgbfix/result.png` IT2I run with 735x1104 input -> 1280x720
output).

Change AR-side default to `crop_type="center"`, matching:

- DiT-side `_resize_and_crop_center` (always center).
- Official `generate_image(..., infer_align_image_size=False)` (the
  default; reading `hunyuan3.0_ins/image_processor.py:355-358` maps the
  False branch to `random_crop="center"`).

Add a CPU-only regression test asserting AR and DiT preprocessing
produce **byte-identical** pixels for 4 src sizes x 4 target buckets,
covering portrait->landscape, landscape->portrait, and square aspects.
No model weights / tokenizer / HF cache required, runs in CI.

Co-authored-by: Codex
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: zuiho <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…state

Restores the IT2I online image quality observed at the magnet_repro
deploy. Two changes from the PR review-feedback round regressed image
quality on multi-image edit prompts:

1. 4da2ff6 switched cond VAE from `latent_dist.sample(generator)` to
   `latent_dist.mode()` on both AR and DiT sides. The posterior mean
   produces visibly degraded conditioning vs the fixed-seed sample.
2. 1785580 changed AR `_resize_and_crop` default from `"resize"` to
   `"center"` to match a non-existent DiT center-crop default (DiT
   bridge actually defaults to `"resize"` too). This broke AR/DiT
   preprocessing alignment instead of fixing it.

Revert both:
- AR `_resize_and_crop` default back to `"resize"` and its docstring.
- AR/DiT `_vae_encode`/`vae_encode` back to fixed-generator sample.
- Remove the now-dead `.mode()` method on
  `DiagonalGaussianDistribution`.
- Remove the AR/DiT byte-identical preprocessing test added by
  1785580 -- it asserted the wrong invariant (AR `"center"` == DiT
  `_resize_and_crop_center`), which no longer holds and was never the
  right alignment target.

Keeps the other 4da2ff6 fixes intact: system_prompt body forwarding,
ratio extraction simplification, stale `it2i_recaption` compound name
cleanup, duplicate `prompt_token_ids` assignment removal.

Signed-off-by: Claude Code <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
`resolve_stop_token_ids` returned `<answer>` (128025) for all (task,
bot_task) combos. For image-output tasks (`it2i` / `t2i`) this stops
the AR halfway through the size/ratio tail that
`_stage_transitions[</recaption>]` forces:

    </recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|>
                ^^^^^^^^^^^^ stopped here, ratio never emitted

Downstream `ar2diffusion::_extract_ratio_index` then scans
`cumulative_token_ids` for any `<img_ratio_*>`, finds none, and falls
back to the prompt-carried `height`/`width`. In `end2end.py` for
multi-image IT2I that means the first reference image's shape -- e.g.
a 512x512 logo + a 1179x685 fabric reference collapses the DiT bucket
to 1024x1024 square even though the AR CoT planned image_2's
landscape aspect. Width and texture both regress simultaneously
because DiT has to squeeze the landscape-planned content into a
square bucket.

Online didn't trip this because the deploy yaml explicitly sets
`stop_token_ids: [127957]` (= `<|endoftext|>`) and end2end.py is not
in that codepath. `end2end.py` overrides yaml with
`resolve_stop_token_ids(...)`, so offline always hit the broken stop
regardless of yaml.

Fix: return `[<|endoftext|>]` for `it2i` / `t2i` so AR runs through
the forced tail and `<img_ratio_*>` reaches `ar2diffusion`. Keep
`[<answer>]` for `i2t` / `t2t` -- those are comprehension stages
where the response body sits inside `<answer>`, so the answer-open
*is* the natural terminator.

Update `test_resolve_stop_token_ids_uses_answer_for_generation_tasks`
to assert the new (correct) split.

Signed-off-by: Claude Code <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…-decode kv_ready forward

Two coupled changes so HunyuanImage3 IT2I no longer ships KV for the
<answer><boi><img_size><img_ratio><eos> tail that DiT discards anyway:

1. deploy/hunyuan_image3.yaml: add ``kv_transfer_criteria`` so AR's
   snapshot fires at </recaption> (token id 128019). ``stop_after_transfer:
   false`` keeps the AR running past the snapshot so it can still emit
   <img_ratio_*> for ``ar2diffusion._extract_ratio_index``. With this
   yaml + the orchestrator change below, the colleague-confirmed
   invariant S - N == 1 (where S is the shipped KV length and N is the
   DiT-side ``positive_reuse_len``) is restored. Without the yaml the AR
   ships KV all the way through <eos> and S - N collapses to 6.

2. engine/orchestrator.py: ``_handle_kv_ready_raw_outputs`` previously
   forwarded any kv_ready EngineCoreOutput straight to the next stage.
   With ``stop_after_transfer: false`` the kv_ready signal fires
   mid-decode (snapshot at </recaption>, AR still emitting tail), so the
   raw EngineCoreOutput has no ``.outputs[0]`` and bridges that read
   the AR's full text (HunyuanImage3 ``ar2diffusion``) hit
   ``AttributeError``. Skip the forward when no finished output for the
   same req_id is present in the same raw_outputs batch; the AR's
   eventual natural-finish RequestOutput will trigger the forward
   through ``_route_output``. Bagel's existing flow (kv_ready and the
   deferred-stop finish output co-emit in the same batch) is preserved.

Signed-off-by: zuiho <wu15922848573@outlook.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…in entry layer

Per PR vllm-project#3444 review (Gaohan123): give a friendly, input-named error at the
entry boundary instead of relying on the deeper
`prompt_utils._validate_num_images` to surface as a `num_images must be in
[1, 3]` message. Reuse `MAX_IMAGES_PER_REQUEST` so the cap stays defined in
one place.

- offline `end2end.py`: validate `--image-path` count before opening PIL
- online `serving_chat._build_multistage_generation_inputs`: validate
  `reference_images` count before building engine prompt data

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…m (review)

Addresses Bounty-hunter's PR review on vllm-project#3444:

1. resolve_stop_token_ids: image-output tasks now stop on the full
   <img_ratio_*> token range (ids 128044-128076 + 130103-130106),
   mirroring upstream modeling_hunyuan_image_3.py:3289-3303
   (`final_stop_tokens = list(range(start_ratio, end_ratio + 1))`).
   Replaces the earlier `<|endoftext|>` stop which let AR waste decode
   steps past the ratio. test_prompt_utils.py renamed/updated to pin
   the new contract.

2. deploy/hunyuan_image3.yaml: drop the kv_transfer_criteria block.
   With the ratio-range stop in place AR finishes naturally at the
   ratio token, so KV is capped automatically -- no need for
   special_token criteria + stop_after_transfer=false.

3. orchestrator._handle_kv_ready_raw_outputs: drop the finished_in_batch
   defer. Mid-decode kv_ready only fired when stop_after_transfer=false
   was forcing AR past its natural stop; with vllm-project#2 removed there is no
   mid-decode kv_ready to defer. The ratio strip for DiT already lives
   in stage_input_processors/hunyuan_image3._truncate_at_cot_end.

4. serving_chat._build_multistage_generation_inputs: call
   resolve_stop_token_ids(task, bot_task) and inject into the AR-stage
   sampling params. Online now matches offline end2end.py rather than
   relying on yaml-side stop_token_ids.

5. api_server.edit_images: drop the redundant `task` Form field.
   /v1/images/edits is always IT2I; bot_task / sys_type / system_prompt
   remain. Legacy bot_task=<task-enum> still works via chat-handler
   normalization.

6. pipeline_hunyuan_image3 + stage_input_processors/hunyuan_image3:
   stop reading / writing the `ar_token_ids` extra. The tokenizer-level
   `batch_cot_token_ids` parameter is retained for a follow-up PR that
   will unify system/user/cot tokenization. See PR description for the
   optimization leftover note.

Signed-off-by: Claude Code <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…sk input

- Online chat handler: drop `task` from extra_body; derive task from
  reference_images presence. Legacy `bot_task=<task-enum>` still
  normalizes through to the right trigger.
- Remove the AR-token-id cot reuse path (`batch_cot_token_ids` in
  apply_chat_template, `ctx_type == "token_ids"` branch in
  process_successive_message, and `get_cot_sections_from_token_ids`);
  it has no caller after the optimization was rolled back per reviewer
  feedback.
- Simplify `_truncate_at_cot_end` to text-only; the token-id return was
  no longer consumed.
- Trim over-explanatory comments across serving_chat / api_server /
  pipeline / end2end.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Collided with tests/e2e/accuracy/test_hunyuan_image3.py under pytest's
default 'prepend' import mode (no __init__.py in either dir). Rename
this one to make basenames unique.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…2I keeps non-square AR shape

Online /v1/images/edits collapsed AR-predicted aspects to a square
(e.g. 1024x1024) while offline end2end.py honored the predicted ratio
(e.g. 1216x832). Root cause is the AR stage in deploy/hunyuan_image3.yaml
was marked ``is_comprehension: false`` (read literally as "this task
generates an image, not text"), but ``is_comprehension`` inside vllm-omni
is the tokenizer-owning AR-stage marker, not a user-visible task type.

The serving path in entrypoints/openai/serving_chat.py looks up the AR
stage by that flag to apply ``resolve_stop_token_ids`` (image-task stop
set = ``<img_ratio_*>`` range). With the flag false the lookup returned
None, the AR kept the YAML default ``stop_token_ids: [<answer>]``, and
the HunyuanImage3 custom sampler's forced-transition step
``</recaption> -> <answer>`` triggered an immediate stop. The cumulative
token ids never reached ``<img_size_BASE><img_ratio_X>``, so
``ar2diffusion._extract_ratio_index`` could not recover the AR aspect
and fell back to the carried-through prompt size (1024x1024 for
size=auto edits).

Offline avoided this because end2end.py overrides the AR stage's
stop_token_ids directly without going through the comprehension-stage
lookup. Other models did not hit it because their AR stage already had
``is_comprehension: true`` (the field's framework-internal meaning).

Fix is one line on the deploy config plus a comment explaining the
flag's real semantics so the next model author does not repeat the
same misread.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…c from serving_chat

PR vllm-project#3444 added 84 lines of HunyuanImage-3.0-specific handling to
``serving_chat._build_multistage_generation_inputs`` (task derivation
from reference images, legacy task-enum mapping on ``bot_task``,
``MAX_IMAGES_PER_REQUEST`` cap, and an AR-stage ``stop_token_ids``
override via ``resolve_stop_token_ids``). The endpoint dispatch in
``api_server.py`` (``/v1/images/edits`` vs ``/v1/images/generations``)
already encodes the task split, and the AR-stage stop override is
redundant: ``HunyuanImage3ForCausalMM.sample`` already forces an EOS
after sampling a ratio token (``hunyuan_image3.py`` generation-mode
branch), so leaving the YAML default stop set empty lets the AR run
through ``</recaption><answer><boi><img_size><img_ratio>`` and stop
naturally on EOS; ``ar2diffusion._extract_ratio_index`` then reads the
ratio off ``cumulative_token_ids``. The production deploy
(``vllm_omni/deploy/hunyuan_image3.yaml``) already omits
``stop_token_ids`` for stage-0.

Net effect on ``serving_chat.py``: +84/-19 -> +47/-19 (-37 lines).
Behavior verified end-to-end on ``/v1/images/edits`` with a non-square
target after removal: ``ar2diffusion`` reports ``AR ratio_idx=19,
target size=1216x832`` (matches the offline ``end2end.py`` path),
identical to the result with the now-removed override in place.

Offline ``end2end.py`` still derives ``task`` and overrides
``stop_token_ids`` because it builds the params list directly without
the endpoint-level task signal; that path is intentionally unchanged.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…g_chat cleanup

The serving_chat cleanup in the previous commit removed the legacy
caller compatibility layer that translated ``bot_task in {"it2i",
"t2i", "i2t", "t2t"}`` to ``None`` and ``bot_task in {"it2i_think",
"it2i_recaption", ...}`` to the trailing ``think``/``recaption`` part.
That translation existed because old callers stuffed task enums into
the ``bot_task`` field; the new contract is the endpoint dispatch
(``/v1/images/edits`` vs ``/v1/images/generations``) and
``reference_images`` presence carry the task signal, and ``bot_task``
only takes the documented values (``None`` / ``recaption`` / ``think``
/ ``think_recaption`` / ``vanilla``).

Two tests in
``test_serving_chat_multistage_generation.py`` were explicitly pinning
the now-removed legacy form
(``test_..._legacy_bot_task_form_unchanged``,
``test_..._legacy_composite_tasks_still_work``); deleting them.

Three other tests passed ``bot_task="it2i"`` only to trigger the
``build_prompt`` path (the *value* did not matter, just non-None);
switching them to ``bot_task="think"`` keeps the same intent against
the new validator.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the wt-hunyuan3-it2i-multi-image branch from 141d59f to 161ba50 Compare May 14, 2026 01:53
Copy link
Copy Markdown
Contributor

@Bounty-hunter Bounty-hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@Gaohan123 @hsliuustc0106

Signed-off-by: TaffyOfficial <2587297563@qq.com>
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@Gaohan123 Gaohan123 merged commit 3f63aaf into vllm-project:main May 14, 2026
8 checks passed
MaciejBalaNV pushed a commit to MaciejBalaNV/vllm-omni that referenced this pull request May 14, 2026
…up (vllm-project#3444)

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <wu15922848573@outlook.com>
Signed-off-by: skf1999 <13234016272@163.com>
Signed-off-by: zuiho <2324465096@qq.com>
Signed-off-by: Claude Code <noreply@anthropic.com>
Signed-off-by: zuiho <wu15922848573@outlook.com>
Signed-off-by: TaffyOfficial <2587297563@qq.com>
Co-authored-by: TaffyOfficial <2324465096@qq.com>
Co-authored-by: TaffyOfficial <wu15922848573@outlook.com>
Co-authored-by: skf1999 <13234016272@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants