[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup#3444
Merged
Gaohan123 merged 44 commits intoMay 14, 2026
Merged
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
05e2f16 to
54caf74
Compare
54caf74 to
4be3584
Compare
Collaborator
|
Can you offer a more specific example and share the output(Including multiple images input and output text or image)? |
Contributor
Author
Collaborator
|
@Bounty-hunter Can you help review it? |
Contributor
ok |
1687ff1 to
4ec4b46
Compare
Collaborator
|
@TaffyOfficial pre-commit failed |
added 22 commits
May 14, 2026 09:52
Apply two rounds of code review fixes on the multi-image IT2I PR:
Cond VAE determinism
Replace `latent_dist.sample()` + `manual_seed(0)` hardcoding with
`latent_dist.mode()` on both AR (`model_executor/.../hunyuan_image3.py
::_vae_encode`) and DiT (`diffusion/.../pipeline_hunyuan_image3.py`)
sides. Cond image is clean (t=0) conditioning by design; posterior mean
is deterministic by construction and matches the official cond encode
path. Adds `.mode()` to the DiT-side `DiagonalGaussianDistribution`.
Stale compound task names (two-axis API migration)
Repo-wide grep for `{t2t,i2t,it2i,t2i}x{think,recaption,think_recaption,
vanilla}` cross-product turned up two residual compound names that the
initial cleanup missed:
- tests/e2e/accuracy/test_hunyuan_image3.py: task='it2i_recaption'
-> task='it2i', bot_task='recaption' (would have ValueErrored at
_resolve_preset on the new two-axis API).
- tests/diffusion/.../test_prompt_utils.py: task='t2i_think' /
task='t2i_recaption' -> (task='t2i', bot_task='think|recaption').
Custom system prompt body forwarding (producer -> consumer trace)
Online `/v1/images/edits` accepted `sys_type='custom'` + `system_prompt`
body on the AR side via `build_prompt_tokens(custom_system_prompt=...)`,
but only forwarded `use_system_prompt` to the engine_prompt. DiT's
`get_system_prompt(use, "image", body)` reads the body as the third
positional arg, so `sys_type='custom'` was silently falling back to an
empty DiT system prefix -- AR/DiT divergence under a user-visible knob.
Forward `system_prompt` through both `serving_chat` engine_prompt and
`stage_input_processors/hunyuan_image3.py::ar2diffusion` -> DiT
`diffusion_input`.
Ratio extraction simplification
Drop the regex path on `generated_text` -- only worked under
`skip_special_tokens: False`, which most deploy yamls don't set. Pure
token-id reverse scan against `_build_ratio_id_lookup` is the source of
truth (AR `_stage_transitions` forces exactly one `<img_ratio_*>`
emission). Drop unused `_RATIO_TOKEN_RE` constant, `re` import, and
`generated_text` parameter from `_extract_ratio_index`.
Housekeeping
- Remove duplicate `engine_prompt["prompt_token_ids"]` assignment in
serving_chat.py (merge residue, the second copy was added by the
main-merge then re-introduced after the API split).
- `examples/.../end2end.py`: stale `_TASK_PRESETS` comment ->
`available_tasks` helper (symbol no longer exists post-split).
- `process_image` comment in `model_executor/.../hunyuan_image3.py`
clarifies the AR-side `_resize_and_crop` default vs the official
`infer_align_image_size=False` (center crop) default.
Signed-off-by: TaffyOfficial <2324465096@qq.com>
CI feedback from the previous push: - F841: drop unused `QKEY` in test_serving_chat_multistage_generation.py - typos: avoid the dictionary trigger on "PNGs" plural -- the lowercased form lands in the crate-ci/typos dictionary as a misspelling; rephrase to "transparent-logo uploads" without changing meaning. - ruff-format: collapse the `build_prompt_tokens(...)` call in the e2e accuracy test back to a single line (line is under the 120 char limit ruff-format enforces locally). Signed-off-by: TaffyOfficial <2324465096@qq.com>
…er crop) AR-side `HunyuanImage3Processor._resize_and_crop` previously defaulted to `crop_type="resize"` (stretch), while the DiT-side condition-image helper `_resize_and_crop_center` always center-crops. For any portrait input mapped to a landscape output bucket (or vice versa), AR and DiT then conditioned on **visibly different fabric regions**: AR saw the input stretched to fit, DiT saw the input center-cropped to fit. The two cond latents disagreed on what the surroundings should be, and DiT had to inpaint the lateral canvas extension on its own — producing seam-like vertical brightness bands at the AR/DiT-disagreement boundary (reported on `/tmp/rgbfix/result.png` IT2I run with 735x1104 input -> 1280x720 output). Change AR-side default to `crop_type="center"`, matching: - DiT-side `_resize_and_crop_center` (always center). - Official `generate_image(..., infer_align_image_size=False)` (the default; reading `hunyuan3.0_ins/image_processor.py:355-358` maps the False branch to `random_crop="center"`). Add a CPU-only regression test asserting AR and DiT preprocessing produce **byte-identical** pixels for 4 src sizes x 4 target buckets, covering portrait->landscape, landscape->portrait, and square aspects. No model weights / tokenizer / HF cache required, runs in CI. Co-authored-by: Codex Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: zuiho <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…state Restores the IT2I online image quality observed at the magnet_repro deploy. Two changes from the PR review-feedback round regressed image quality on multi-image edit prompts: 1. 4da2ff6 switched cond VAE from `latent_dist.sample(generator)` to `latent_dist.mode()` on both AR and DiT sides. The posterior mean produces visibly degraded conditioning vs the fixed-seed sample. 2. 1785580 changed AR `_resize_and_crop` default from `"resize"` to `"center"` to match a non-existent DiT center-crop default (DiT bridge actually defaults to `"resize"` too). This broke AR/DiT preprocessing alignment instead of fixing it. Revert both: - AR `_resize_and_crop` default back to `"resize"` and its docstring. - AR/DiT `_vae_encode`/`vae_encode` back to fixed-generator sample. - Remove the now-dead `.mode()` method on `DiagonalGaussianDistribution`. - Remove the AR/DiT byte-identical preprocessing test added by 1785580 -- it asserted the wrong invariant (AR `"center"` == DiT `_resize_and_crop_center`), which no longer holds and was never the right alignment target. Keeps the other 4da2ff6 fixes intact: system_prompt body forwarding, ratio extraction simplification, stale `it2i_recaption` compound name cleanup, duplicate `prompt_token_ids` assignment removal. Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
`resolve_stop_token_ids` returned `<answer>` (128025) for all (task,
bot_task) combos. For image-output tasks (`it2i` / `t2i`) this stops
the AR halfway through the size/ratio tail that
`_stage_transitions[</recaption>]` forces:
</recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|>
^^^^^^^^^^^^ stopped here, ratio never emitted
Downstream `ar2diffusion::_extract_ratio_index` then scans
`cumulative_token_ids` for any `<img_ratio_*>`, finds none, and falls
back to the prompt-carried `height`/`width`. In `end2end.py` for
multi-image IT2I that means the first reference image's shape -- e.g.
a 512x512 logo + a 1179x685 fabric reference collapses the DiT bucket
to 1024x1024 square even though the AR CoT planned image_2's
landscape aspect. Width and texture both regress simultaneously
because DiT has to squeeze the landscape-planned content into a
square bucket.
Online didn't trip this because the deploy yaml explicitly sets
`stop_token_ids: [127957]` (= `<|endoftext|>`) and end2end.py is not
in that codepath. `end2end.py` overrides yaml with
`resolve_stop_token_ids(...)`, so offline always hit the broken stop
regardless of yaml.
Fix: return `[<|endoftext|>]` for `it2i` / `t2i` so AR runs through
the forced tail and `<img_ratio_*>` reaches `ar2diffusion`. Keep
`[<answer>]` for `i2t` / `t2t` -- those are comprehension stages
where the response body sits inside `<answer>`, so the answer-open
*is* the natural terminator.
Update `test_resolve_stop_token_ids_uses_answer_for_generation_tasks`
to assert the new (correct) split.
Signed-off-by: Claude Code <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…-decode kv_ready forward Two coupled changes so HunyuanImage3 IT2I no longer ships KV for the <answer><boi><img_size><img_ratio><eos> tail that DiT discards anyway: 1. deploy/hunyuan_image3.yaml: add ``kv_transfer_criteria`` so AR's snapshot fires at </recaption> (token id 128019). ``stop_after_transfer: false`` keeps the AR running past the snapshot so it can still emit <img_ratio_*> for ``ar2diffusion._extract_ratio_index``. With this yaml + the orchestrator change below, the colleague-confirmed invariant S - N == 1 (where S is the shipped KV length and N is the DiT-side ``positive_reuse_len``) is restored. Without the yaml the AR ships KV all the way through <eos> and S - N collapses to 6. 2. engine/orchestrator.py: ``_handle_kv_ready_raw_outputs`` previously forwarded any kv_ready EngineCoreOutput straight to the next stage. With ``stop_after_transfer: false`` the kv_ready signal fires mid-decode (snapshot at </recaption>, AR still emitting tail), so the raw EngineCoreOutput has no ``.outputs[0]`` and bridges that read the AR's full text (HunyuanImage3 ``ar2diffusion``) hit ``AttributeError``. Skip the forward when no finished output for the same req_id is present in the same raw_outputs batch; the AR's eventual natural-finish RequestOutput will trigger the forward through ``_route_output``. Bagel's existing flow (kv_ready and the deferred-stop finish output co-emit in the same batch) is preserved. Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
…in entry layer Per PR vllm-project#3444 review (Gaohan123): give a friendly, input-named error at the entry boundary instead of relying on the deeper `prompt_utils._validate_num_images` to surface as a `num_images must be in [1, 3]` message. Reuse `MAX_IMAGES_PER_REQUEST` so the cap stays defined in one place. - offline `end2end.py`: validate `--image-path` count before opening PIL - online `serving_chat._build_multistage_generation_inputs`: validate `reference_images` count before building engine prompt data Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…m (review) Addresses Bounty-hunter's PR review on vllm-project#3444: 1. resolve_stop_token_ids: image-output tasks now stop on the full <img_ratio_*> token range (ids 128044-128076 + 130103-130106), mirroring upstream modeling_hunyuan_image_3.py:3289-3303 (`final_stop_tokens = list(range(start_ratio, end_ratio + 1))`). Replaces the earlier `<|endoftext|>` stop which let AR waste decode steps past the ratio. test_prompt_utils.py renamed/updated to pin the new contract. 2. deploy/hunyuan_image3.yaml: drop the kv_transfer_criteria block. With the ratio-range stop in place AR finishes naturally at the ratio token, so KV is capped automatically -- no need for special_token criteria + stop_after_transfer=false. 3. orchestrator._handle_kv_ready_raw_outputs: drop the finished_in_batch defer. Mid-decode kv_ready only fired when stop_after_transfer=false was forcing AR past its natural stop; with vllm-project#2 removed there is no mid-decode kv_ready to defer. The ratio strip for DiT already lives in stage_input_processors/hunyuan_image3._truncate_at_cot_end. 4. serving_chat._build_multistage_generation_inputs: call resolve_stop_token_ids(task, bot_task) and inject into the AR-stage sampling params. Online now matches offline end2end.py rather than relying on yaml-side stop_token_ids. 5. api_server.edit_images: drop the redundant `task` Form field. /v1/images/edits is always IT2I; bot_task / sys_type / system_prompt remain. Legacy bot_task=<task-enum> still works via chat-handler normalization. 6. pipeline_hunyuan_image3 + stage_input_processors/hunyuan_image3: stop reading / writing the `ar_token_ids` extra. The tokenizer-level `batch_cot_token_ids` parameter is retained for a follow-up PR that will unify system/user/cot tokenization. See PR description for the optimization leftover note. Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
…sk input - Online chat handler: drop `task` from extra_body; derive task from reference_images presence. Legacy `bot_task=<task-enum>` still normalizes through to the right trigger. - Remove the AR-token-id cot reuse path (`batch_cot_token_ids` in apply_chat_template, `ctx_type == "token_ids"` branch in process_successive_message, and `get_cot_sections_from_token_ids`); it has no caller after the optimization was rolled back per reviewer feedback. - Simplify `_truncate_at_cot_end` to text-only; the token-id return was no longer consumed. - Trim over-explanatory comments across serving_chat / api_server / pipeline / end2end. Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Collided with tests/e2e/accuracy/test_hunyuan_image3.py under pytest's default 'prepend' import mode (no __init__.py in either dir). Rename this one to make basenames unique. Signed-off-by: TaffyOfficial <2324465096@qq.com>
…2I keeps non-square AR shape Online /v1/images/edits collapsed AR-predicted aspects to a square (e.g. 1024x1024) while offline end2end.py honored the predicted ratio (e.g. 1216x832). Root cause is the AR stage in deploy/hunyuan_image3.yaml was marked ``is_comprehension: false`` (read literally as "this task generates an image, not text"), but ``is_comprehension`` inside vllm-omni is the tokenizer-owning AR-stage marker, not a user-visible task type. The serving path in entrypoints/openai/serving_chat.py looks up the AR stage by that flag to apply ``resolve_stop_token_ids`` (image-task stop set = ``<img_ratio_*>`` range). With the flag false the lookup returned None, the AR kept the YAML default ``stop_token_ids: [<answer>]``, and the HunyuanImage3 custom sampler's forced-transition step ``</recaption> -> <answer>`` triggered an immediate stop. The cumulative token ids never reached ``<img_size_BASE><img_ratio_X>``, so ``ar2diffusion._extract_ratio_index`` could not recover the AR aspect and fell back to the carried-through prompt size (1024x1024 for size=auto edits). Offline avoided this because end2end.py overrides the AR stage's stop_token_ids directly without going through the comprehension-stage lookup. Other models did not hit it because their AR stage already had ``is_comprehension: true`` (the field's framework-internal meaning). Fix is one line on the deploy config plus a comment explaining the flag's real semantics so the next model author does not repeat the same misread. Signed-off-by: TaffyOfficial <2324465096@qq.com>
…c from serving_chat PR vllm-project#3444 added 84 lines of HunyuanImage-3.0-specific handling to ``serving_chat._build_multistage_generation_inputs`` (task derivation from reference images, legacy task-enum mapping on ``bot_task``, ``MAX_IMAGES_PER_REQUEST`` cap, and an AR-stage ``stop_token_ids`` override via ``resolve_stop_token_ids``). The endpoint dispatch in ``api_server.py`` (``/v1/images/edits`` vs ``/v1/images/generations``) already encodes the task split, and the AR-stage stop override is redundant: ``HunyuanImage3ForCausalMM.sample`` already forces an EOS after sampling a ratio token (``hunyuan_image3.py`` generation-mode branch), so leaving the YAML default stop set empty lets the AR run through ``</recaption><answer><boi><img_size><img_ratio>`` and stop naturally on EOS; ``ar2diffusion._extract_ratio_index`` then reads the ratio off ``cumulative_token_ids``. The production deploy (``vllm_omni/deploy/hunyuan_image3.yaml``) already omits ``stop_token_ids`` for stage-0. Net effect on ``serving_chat.py``: +84/-19 -> +47/-19 (-37 lines). Behavior verified end-to-end on ``/v1/images/edits`` with a non-square target after removal: ``ar2diffusion`` reports ``AR ratio_idx=19, target size=1216x832`` (matches the offline ``end2end.py`` path), identical to the result with the now-removed override in place. Offline ``end2end.py`` still derives ``task`` and overrides ``stop_token_ids`` because it builds the params list directly without the endpoint-level task signal; that path is intentionally unchanged. Signed-off-by: TaffyOfficial <2324465096@qq.com>
…g_chat cleanup
The serving_chat cleanup in the previous commit removed the legacy
caller compatibility layer that translated ``bot_task in {"it2i",
"t2i", "i2t", "t2t"}`` to ``None`` and ``bot_task in {"it2i_think",
"it2i_recaption", ...}`` to the trailing ``think``/``recaption`` part.
That translation existed because old callers stuffed task enums into
the ``bot_task`` field; the new contract is the endpoint dispatch
(``/v1/images/edits`` vs ``/v1/images/generations``) and
``reference_images`` presence carry the task signal, and ``bot_task``
only takes the documented values (``None`` / ``recaption`` / ``think``
/ ``think_recaption`` / ``vanilla``).
Two tests in
``test_serving_chat_multistage_generation.py`` were explicitly pinning
the now-removed legacy form
(``test_..._legacy_bot_task_form_unchanged``,
``test_..._legacy_composite_tasks_still_work``); deleting them.
Three other tests passed ``bot_task="it2i"`` only to trigger the
``build_prompt`` path (the *value* did not matter, just non-None);
switching them to ``bot_task="think"`` keeps the same intent against
the new validator.
Signed-off-by: TaffyOfficial <2324465096@qq.com>
141d59f to
161ba50
Compare
Contributor
Author
Signed-off-by: TaffyOfficial <2587297563@qq.com>
MaciejBalaNV
pushed a commit
to MaciejBalaNV/vllm-omni
that referenced
this pull request
May 14, 2026
…up (vllm-project#3444) Signed-off-by: TaffyOfficial <2324465096@qq.com> Signed-off-by: TaffyOfficial <wu15922848573@outlook.com> Signed-off-by: skf1999 <13234016272@163.com> Signed-off-by: zuiho <2324465096@qq.com> Signed-off-by: Claude Code <noreply@anthropic.com> Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2587297563@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: TaffyOfficial <wu15922848573@outlook.com> Co-authored-by: skf1999 <13234016272@163.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
This PR makes HunyuanImage-3.0-Instruct's IT2I path support up to 3 reference
images per request ("Multi-Image Fusion", as upstream supports), and folds in
two follow-up fixes:
prompt_utilsAPI cleanup that splits the conflatedtaskparameterinto two orthogonal axes.
/v1/images/edits) ↔ offline (end2end.py img2img) AR alignment,so identical (prompt, image, seed) produces identical AR output across the
two paths.
Both are included here rather than in separate PRs because they touch the same
prompt_utils.py/ serving / pipeline surface and would otherwise create3-way merge conflicts.
Logical commits in chronological order:
<img>placeholders, per-imageVAE buckets, ragged
flat_from_sizesreconstruction.taskandbot_task.prompt_token_idsin/v1/images/edits— segment-wise tokenization through
build_prompt_tokens, mirrors offlineend2end.py.edits endpoint, so callers can drive the same
(task, bot_task)axes theoffline example uses.
size="auto"no longer collapses non-square AR-predicted aspect —defers width/height to the AR
<img_ratio_*>bucket viastage_input_processors/hunyuan_image3.py.alpha-composite, gated on Hunyuan-aware request params, fixes systematic
online/offline divergence on PNGs with transparency.
latent_dist.sample()(consumes torch globalRNG) →
latent_dist.mode()(deterministic posterior mean, matches theofficial cond encode path for clean
t=0conditioning).origin/main— picks upHUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS,resolve_stop_token_ids, and the AR stop-token plumbing from PR [Config] Add HunyuanImage3 deploy configs #3172._resize_and_cropdefault back tocrop_type='resize', cond VAE encodeback to
.sample(generator=fixed)instead of.mode(). Center-crop wasvisibly under-conditioning portrait-input → landscape-output edits;
restoring the IT2I demo's tuned preprocessing recovers the magnet repro.
Supersedes Commit 7 for the VAE path (determinism preserved via
fixed generator instead of posterior mean).
<|endoftext|>for image-output tasks —resolve_stop_token_idswas returning<answer>for every (task,bot_task). For
it2i/t2i, that chops off the<answer><boi><img_size_*><img_ratio_*><|endoftext|>tail forced by_stage_transitions;_extract_ratio_indexthen finds no<img_ratio_*>and silently collapses the DiT bucket to the firstreference image's shape (square logo → 1024x1024 even when AR's CoT
planned a landscape). Now returns
<|endoftext|>for image-outputtasks; comprehension
i2t/t2tstill stop on<answer>.</recaption>—deploy/hunyuan_image3.yamlsetskv_transfer_criteria.type=special_token, token_id=128019, stop_after_transfer=false. Shipped KV now exactly matches the prefixDiT reuses (S−N=1 invariant). AR keeps running past the snapshot so
it can still emit
<img_ratio_*>for the bridge; orchestrator_handle_kv_ready_raw_outputsdefers the kv_ready forward when thesame raw_outputs batch doesn't yet contain a finished output for that
req_id, avoiding thear_output.outputs[0]AttributeError thatbridges hit when kv_ready fires mid-decode.
@Gaohan123) —
--image-path(offline) and_build_multistage_generation_inputs(online) now rejectupstream (feedback from @Bounty-hunter) — supersedes commits 10
and 11.
resolve_stop_token_idsfor image-output tasks now returnsthe full
<img_ratio_*>token range(
list(range(start_ratio, end_ratio + 1))+ratio_token_other_slices), mirroring upstreammodeling_hunyuan_image_3.py:3289-3303. AR stops AT the ratiotoken; KV is capped naturally; bridge
_truncate_at_cot_endalreadystrips the ratio tail for DiT.
kv_transfer_criteriablock in thedeploy yaml is removed; orchestrator
finished_in_batchdefer isremoved (no mid-decode kv_ready exists in the new flow).
serving_chat._build_multistage_generation_inputsnow callsresolve_stop_token_idsso online matches offlineend2end.py.The
taskForm field on/v1/images/editsis dropped (theendpoint is always IT2I;
bot_task/sys_type/system_promptare the remaining knobs). The
cot_token_ids_listsegment-tokenforwarding in
pipeline_hunyuan_image3is also removed; see the"Optimization leftover" note below.
Plus housekeeping: ratio extraction simplified to a pure token-id reverse scan
(regex path dropped — token-ids are source of truth, AR yamls run with
skip_special_tokens=True), stale compound task names cleaned out of the e2etest, AR/DiT
system_promptbody forwarding sosys_type='custom'worksend-to-end, duplicate
engine_prompt["prompt_token_ids"]assignment removed.Commit 1 — Multi-image IT2I support
HunyuanImage-3.0-Instruct supports up to 3 reference images per IT2I request
(README §200-216, §500). vllm-omni's DiT pipeline, AR processor, OpenAI
schema, and ar2diffusion bridge already accepted list-shaped
multi_modal_data["image"], but four call sites still encoded a hard "N=1"assumption. End-to-end smoke (4× L20X) on the official
input_1_0.png+input_1_1.pngdemo pair runs cleanly and preserves each image's nativebucket.
Surgery points:
prompt_utils.build_prompt(_tokens)takesnum_images: int(default 1,validated 1 ≤ N ≤ 3 for image-input tasks) and emits N consecutive
<img>placeholders between
User:and the user prompt, matching the officialapply_general_template"successive user message" wrapping.HunyuanImage3Processor.process_image: each cond image keeps its own VAEreso_groupbucket. Per-image VAE pixel tensors are flattened to 1-D andconcatenated;
vae_pixel_sizedeclares per-image numel so vLLM splits thebuffer back per image at consumption time via
MultiModalFieldConfig.flat_from_sizes(..., vae_pixel_size)(mirrors theGLM-Image / Ming-Flash-Omni pattern).
_parse_and_validate_image_inputreconstructs a list of per-image(3, H_i, W_i)tensors fromvae_token_grid_hw;embed_multimodalloopsover the list for VAE encode + patch_embed.
examples/.../end2end.py:--image-pathaccepts comma-separated paths;mm_image_payloadis unwrapped to a single image when N=1 to keep thelegacy single-image call shape.
Commit 2 — prompt_utils API cleanup: split
taskandbot_taskCR feedback observed that the old
_TASK_PRESETStable conflated I/Omodality with prompting mode (e.g.
it2i_think,t2i_recaption,t2i_vanilla) and carried abot_taskfield that was dead code underevery
sys_typeexercised in this codebase (onlysys_type='dynamic'consumed it, and nothing ever set that). Split into two orthogonal axes:
taskt2t,i2t,it2i,t2i<img>placeholders are emittedbot_taskNone,think,recaption,think_recaption,vanillaResolution table:
bot_tasksys_typeNoneen_unifiedthinken_unified<think>recaptionen_unified<recaption>think_recaptionen_think_recaption<think>vanillaen_vanillatask='t2i'only)bot_task='vanilla'is validated to only combine withtask='t2i';unknown
task/bot_taskvalues raiseValueError. Public helpersavailable_bot_tasks()andresolve_sys_type(bot_task)let callers derivethe default sys_type without re-encoding the table.
Migration mapping for any downstream caller:
task='t2t'task='t2t', bot_task=Nonetask='i2t'task='i2t', bot_task=Nonetask='it2i_think'task='it2i', bot_task='think'task='it2i_recaption'task='it2i', bot_task='recaption'task='t2i_think'task='t2i', bot_task='think'task='t2i_recaption'task='t2i', bot_task='recaption'task='t2i_vanilla'task='t2i', bot_task='vanilla'task='t2i', bot_task='think_recaption'→en_think_recaptionThis is a hard breaking change with no aliases. Repo-wide grep across
tests/,examples/,vllm_omni/, anddeploy/*.yamlconfirms noremaining references to the old compound strings or to
_TASK_PRESETS.Side fix on
build_prompt: the legacy code stripped the system prompt'sleading whitespace while
build_prompt_tokensdid not. Invisible while everysystem prompt was
unified_system_prompt_en(no leading newline) but newlyobservable now that
bot_task='think_recaption'exposesen_think_recaption(which starts with
\n).build_promptnow keeps the system prompt verbatim,matching the segment-by-segment tokenization path and HF's
apply_chat_templatebyte-for-byte.end2end.pyCLI changes:--bot-taskchoices are now{none, think, recaption, think_recaption, vanilla}. The literalnoneisthe explicit way to request
bot_task=Noneon a modality whose default isthink(e.g.text2img/img2img); leaving--bot-taskunset still fallsback to the modality default. The duplicated
_TASK_PRESETSliteral in theexample script is removed in favor of
resolve_sys_type(bot_task). AR stoptoken ids are now resolved programmatically via
resolve_stop_token_ids(task, bot_task, tokenizer)rather than hardcoded in the deploy yaml — keeps theexample self-contained and survives yaml drift.
Commits 3-5 — Online (
/v1/images/edits) ↔ offline AR byte-alignmentThe OpenAI edits endpoint built the AR prompt as a single string and let the
engine tokenizer run a whole-string BPE pass; offline
end2end.py img2imgwent through
build_prompt_tokenssegment-by-segment and fed the result viaprompt_token_ids. The two encodings differ on segment boundaries (e.g.user-prompt-ends-with-
。+ next-segment-\n\n→ merged id3490vsHF's
[1811, 271]), so identical (prompt, image, seed) requests produceddiverging cot_text → diverging DiT input → diverging final image.
serving_chat._build_multistage_generation_inputsnow goes throughbuild_prompt_tokenswhen a tokenizer is plumbed, byte-for-byte matchingapply_chat_template. Also forwardsuse_system_promptand (when thecaller sets
sys_type=custom) the verbatimsystem_promptbody so DiT canrebuild the same system prefix.
api_server.pyexposes new Form fields on/v1/images/edits:task,sys_type,system_prompt. Legacy callers that pass ataskenum underthe
bot_taskfield still work (normalized to the canonical split). Thissubsumes the simpler tokenizer plumbing landed in main as PR [Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline #3500 — we
additionally forward
bot_task,sys_type,num_images, anduse_system_prompt.size="auto"resolution now skips the gen_params / extra_body width/heightwrites that would otherwise pin the bridge to the first reference image's
bucket and collapse non-square AR-predicted aspects to square in the
multi-image / mismatched-aspect case.
stage_input_processors/hunyuan_image3.pyprefers the AR's predicted
<img_size_*><img_ratio_*>tail (mirrorsupstream's
reso_group[ratio_index]lookup) over the carried-throughheight/width.
stage_input_processors/hunyuan_image3.pyalso drops the regex fallback ongenerated_textfor ratio extraction (only worked underskip_special_tokens: False, which most deploy yamls don't set) and goesstraight to a token-id reverse scan against the tokenizer's
<img_ratio_*>id range — token-ids survive
skip_special_tokens: Trueand are the sourceof truth.
Commit 6 — RGBA cond image normalization
Online edit requests submitting PNGs with transparency systematically
produced different AR recaption text than offline (online "3 magnets" vs
offline "1 magnet" on the same
input_2_*.pngpair). Root cause wasnot CUDA / MoE non-determinism — it was a systematic preprocessing
divergence:
end2end.py img2imgcallsImage.open(p).convert("RGB"), whichreplaces transparent pixels with black background.
into the AR processor's vision encoder as-is, and downstream layers
composited transparent pixels over a white canvas.
That single bit of difference (black bg vs white bg) on 57,671 transparent
pixels in the test PNG was enough to flip the AR's caption from a 1-object
description to a 3-object description, and the DiT followed the AR's
direction.
api_server._load_input_images(..., normalize_rgb=True)is nowopt-in via Hunyuan-aware params (
task/bot_task/sys_typepresent inthe request), defaulting to the offline behavior of explicit
.convert("RGB").The methodology lesson —
systematiccross-path bias is not explainableby
stochasticCUDA/MoE non-determinism, and AR input alignment requiresthree pillars (prompt token bytes, image tensor bytes, sampling params)
— is now codified as CLAUDE.md hard rule B21.
Commit 7 — Cond VAE determinism:
.sample()→.mode()Both the AR-side
model_executor/.../hunyuan_image3.py::_vae_encodeand theDiT-side
pipeline_hunyuan_image3.pypreviously calledvae_encode_result.latent_dist.sample()for cond image encoding..sample()without a generator consumes torch's global RNG, which is a silent
non-determinism source: cond latents drift between requests on a
long-running server while looking deterministic for fresh-process callers.
Cond image is clean (
t=0) conditioning by design — the official upstreamtakes the posterior mean for cond encode. Switched both call sites to
.mode()(added the method on the DiT-sideDiagonalGaussianDistributionto match the AR-side
autoencoder_kl_3dshape).Commit 9 — Cond preprocessing revert to magnet_repro baseline
The intermediate "AR/DiT center-crop alignment" introduced earlier in this
PR (
b83962160+ companion DiT comments) made AR's_resize_and_cropandDiT's
_resize_and_crop_centerboth default tocrop_type='center', withthe intent that AR and DiT condition on byte-identical pixels. Visually it
under-conditioned the IT2I demo — portrait input expanded into a landscape
output had the conditioning crop drop too much of the relevant content, and
the magnet repro regressed.
Rolled back to the magnet_repro state:
_resize_and_cropdefault iscrop_type='resize'(the pathinfer_align_image_size=Trueexercises in the IT2I demo: stretch thecond image to the bucket dims so
<img_ratio_*>and ViT/VAE featuresstay aligned with the bucket rather than dropping content).
pipeline_hunyuan_image3.pyswitches back tolatent_dist.sample(torch.Generator(device=image.device).manual_seed(0));the global-RNG concern from Commit 7 is addressed by the fixed seed
rather than the posterior mean.
DiagonalGaussianDistribution.mode()and the now-obsoleteAR↔DiT-byte-match regression test
(
test_ar_and_dit_condition_image_preprocessing_match_without_hf_cache)are removed.
AR and DiT no longer share byte-identical conditioning pixels (AR stretches,
DiT center-crops), but the upstream magnet_repro tuning is faithfully
reproduced and the visual quality regression is gone.
Commit 10 — Stop AR on
<|endoftext|>for image-output tasksOriginal Commit 10:
resolve_stop_token_idsreturned[<answer>](id 128025) for every (task,bot_task) pair. For image-output tasks (
it2i/t2i) the_stage_transitions[</recaption>]rule force-emits<answer><boi><img_size_*>, then_apply_ratio_restrictionsamples<img_ratio_*>, then<|endoftext|>. Stopping on<answer>cuts off thesize/ratio tail;
ar2diffusion::_extract_ratio_indexthen scanscumulative_token_idsfor any<img_ratio_*>id, finds none, and fallsback to the prompt-carried
height/width— which is the first referenceimage's bucket in multi-image IT2I. Effect: a 512×512 logo + 1179×685
fabric collapses to a square output even when AR's CoT planned a landscape;
width and texture regress simultaneously because DiT has to squeeze the
landscape-planned content into a square.
Online didn't trip this because the deploy yaml explicitly set
stop_token_ids: [127957](=<|endoftext|>).end2end.pyoverrode theyaml with
resolve_stop_token_ids(...), so offline always hit the brokenstop regardless of yaml.
Fix:
resolve_stop_token_idsreturns[<|endoftext|>]forit2i/t2iso AR runs through the forced tail and
<img_ratio_*>reaches the bridge.i2t/t2tkeep[<answer>]— those are comprehension stages where theresponse body sits inside
<answer>and the answer-open is the naturalterminator.
test_resolve_stop_token_ids_image_tasks_stop_on_eos_not_answerpins the new split.
Commit 11 — Cap AR KV snapshot at
</recaption>, defer mid-decode kv_readyOriginal Commit 11:
Before this commit, AR shipped its KV all the way through the
</recaption><answer><boi><img_size_*><img_ratio_*><|endoftext|>tail.DiT then reused only the prefix up through
</recaption>(the colleague-confirmed
positive_reuse_leninvariant), soS − N == 6instead of theintended
S − N == 1: six tail-token positions of KV were transferred andimmediately discarded, and the AR pipeline kept emitting tokens DiT would
never use.
deploy/hunyuan_image3.yaml:stop_after_transfer: falsekeeps the AR running past the snapshot so itstill emits
<img_ratio_*>forar2diffusion::_extract_ratio_index(whichderives output
height/width). The mid-decodekv_readysignal that thiscombination produces previously crashed bridges that read
ar_output.outputs[0](no finishedRequestOutputexists yet).Orchestrator._handle_kv_ready_raw_outputsnow defers thekv_readyforward when the same
raw_outputsbatch doesn't yet contain a finishedoutput for that
req_id; AR's natural completion later triggers theforward through
_route_output.Net effect: KV transferred is byte-equivalent to what DiT actually reuses
(
S − N == 1), AR no longer wastes 5 decode steps on tail tokens that DiTdiscards, and
<img_ratio_*>still reaches the bridge.Commit 12 — Entry-layer cap on input image count (review feedback)
Per @Gaohan123's review on this PR: the
MAX_IMAGES_PER_REQUEST = 3caplived in
prompt_utils._validate_num_images, which surfaced asValueError: num_images must be in [1, 3], got Ndeep inside the ARprompt builder. The reviewer asked for a friendly, input-named error at
the entry boundary so users see the limit on the parameter they actually
typed.
Added in two places, both reusing
MAX_IMAGES_PER_REQUEST(no hardcoded 3):examples/offline_inference/hunyuan_image3/end2end.py— validate--image-pathcount before opening any PIL image.vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs— validate
reference_imagescount before building engine prompt data.Behavior is otherwise unchanged: the deeper
_validate_num_imagescap isstill a hard backstop for any future callers that don't pass through these
entry points.
Commit 13 — Review iteration: align AR stop / KV cap / edits Form with upstream
Per @Bounty-hunter's review, the AR-stop and KV-cap logic from Commits 10
and 11 is replaced with the upstream-faithful approach from
modeling_hunyuan_image_3.py:3289-3303(with_ConditionalSliceVocabLogitsProcessorforcing the next token after
<img_size_base>into the ratio range):AR's natural trajectory under
_stage_transitionsis</recaption><answer><boi><img_size_base><img_ratio_X>. Stopping AT theratio token means:
kv_transfer_criteriaspecial_tokenblock orstop_after_transfer=falsein the deploy yaml.ar2diffusion::_extract_ratio_indexreads the last token to derive theoutput H/W.
_truncate_at_cot_end(already in the bridge) trims the cot at</recaption>before forwarding to DiT, so the trailing<answer><boi><img_size_X><img_ratio_X>never contaminates DiT's promptbuilder.
Net deletions:
vllm_omni/deploy/hunyuan_image3.yaml— drop theomni_kv_config.kv_transfer_criteriablock (special_token + token_id +stop_after_transfer: false).vllm_omni/engine/orchestrator.py::_handle_kv_ready_raw_outputs— dropthe
finished_in_batchdefer; mid-decodekv_readyno longer happens.vllm_omni/entrypoints/openai/api_server.py::edit_images— drop thetask: str | None = Form(None)field. The endpoint is always IT2I;bot_task/sys_type/system_promptcover the remaining knobs;legacy
bot_task=<task-enum>still works via chat-handler normalization.Net additions:
vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py::resolve_stop_token_ids— image-output tasks return the full ratio token range
(
list(range(128044, 128077))for<img_ratio_0..32>plusrange(130103, 130107)for<img_ratio_33..36>).vllm_omni/entrypoints/openai/serving_chat.py::_build_multistage_generation_inputs— after resolving
(task, bot_task), callresolve_stop_token_idsandinject into the AR-stage sampling params, matching offline
end2end.pybehavior. Without this the yaml-side default would let AR generate to
max_tokens=8192.Optimization leftover: unified system/user/cot tokenization
pipeline_hunyuan_image3previously forwarded AR-sampledar_token_idsthrough
ar2diffusion -> extra["ar_token_ids"] -> prepare_model_inputs (cot_token_ids=...), preferring those token ids over re-encoding thedecoded
cot_text. This avoided BPE re-merge drift across templatesegment boundaries (e.g.
"。\n\n"collapsing to a single id) that wouldotherwise break
positive_reuse_lenand trigger the silent slice ininject_ar_kv_into_layers.Per @Bounty-hunter's review, this single-point optimization is out of
scope for the multi-image PR: the right unit of work is the whole prompt
(system prompt + images + user content) as one tokenization contract, and
the longer-term direction is to bypass DiT re-tokenization entirely by
reusing embeddings.
In this PR we delete the
ar_token_idsplumbing inpipeline_hunyuan_image3and
ar2diffusion, but keep the lower-level tokenizer primitives(
apply_chat_template(batch_cot_token_ids=...)andTokenizerWrapper.get_cot_sections_from_token_ids) intact for a follow-upPR that will do the unified tokenization properly. The
_kvreuse_alignmentregression tests still pin the tokenizer-level contract.
Test plan
tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_multi_image.py— 5 invariants pinned (N consecutive
<img>placeholders for N∈{1,2,3} on stringone
<img>id; out-of-range N rejected; text-only tasks ignorenum_images).tests/diffusion/models/hunyuan_image3/test_prompt_utils.pyupdated to thenew
(task, bot_task)parametrization. Addstest_available_bot_tasks_covers_all_modes,test_build_prompt_unknown_bot_task_raises,test_build_prompt_vanilla_rejects_non_t2i_task,and
test_resolve_stop_token_ids_image_tasks_stop_on_ratio_range(pins Commit 13 — image-output tasks stop on the full
<img_ratio_*>range, text-output on
<answer>).tests/entrypoints/openai_api/test_serving_chat_multistage_generation.py—4 new regression tests: multi-image placeholder count, tokenizer-plumbed
byte-for-byte path, legacy
bot_task=task-enumcompat,sys_typeoverride.tests/entrypoints/openai_api/test_image_server.py—size="auto"nolonger collapses bridge-resolved aspect.
tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py— DiT-side
_resize_and_crop_centerbyte-matches official HFimage_processor.resize_and_crop(crop_type='center'). (The earlierAR↔DiT byte-match assertion is removed in Commit 9 since AR is back to
crop_type='resize'.)4-image
--image-path img1,img2,img3,img4and equivalent onlinereference_imageslist both raiseValueError: ... accepts at most 3 images ...before any model coderuns. The deeper
_validate_num_imagescap (num_images must be in [1, 3]) remains as a backstop for directbuild_prompt(_tokens)callers.
token output and DiT's
positive_reuse_lenon a fresh image-editrequest: AR stops at the sampled
<img_ratio_X>(last token), DiT'spositive_reuse_lenmatches AR's KV length (noS − Ndrift), andar2diffusion._extract_ratio_indexrecovers the correct ratio idxfrom
cumulative_token_ids. Confirms the upstreammodeling_hunyuan_image_3.py:3289-3303flow.tests/e2e/accuracy/test_hunyuan_image3.py— migrated to two-axis API(
task='it2i', bot_task='recaption').input_1_0.png+input_1_1.pngdemo pair, peak 95.52 GB reserved / 90.10 GB allocated; output PNG saved
cleanly with second image's native aspect preserved.
/v1/images/edits) ↔ offline (end2end.py img2img) parity onRGBA input: online now produces the same 1-magnet output as offline on
input_2_*.png(was diverging into 3-magnet description before the RGBnormalization fix).
tests/,examples/,vllm_omni/,deploy/*.yamlconfirms no remaining references to old compound
taskstrings or_TASK_PRESETS. Cross-product enumeration of{t2t,i2t,it2i,t2i}×{think,recaption,think_recaption,vanilla}= 16 names— none of the 16 appears as an active call site.
origin/main(HUNYUAN_IMAGE3_SPECIAL_TOKEN_IDS+resolve_stop_token_idsfrom PR [Config] Add HunyuanImage3 deploy configs #3172 reconciled against the two-axis API).
offline(not kv)

online (not kv)

offline (use kv)

online (use kv)