Skip to content

[Feature] HunyuanImage-3.0 IT2I (image editing) support#3107

Merged
hsliuustc0106 merged 4 commits into
vllm-project:mainfrom
skf-1999:it2i
May 6, 2026
Merged

[Feature] HunyuanImage-3.0 IT2I (image editing) support#3107
hsliuustc0106 merged 4 commits into
vllm-project:mainfrom
skf-1999:it2i

Conversation

@skf-1999
Copy link
Copy Markdown
Contributor

@skf-1999 skf-1999 commented Apr 24, 2026

Co-authored-by: dengyunyang 584797741@qq.com
Co-authored-by: TaffyOfficial 2324465096@qq.com
Co-authored-by: John Liu BUAA liukecheng97@gmail.com

Purpose

Adds the image-to-image (image + text edit instruction -> edited image)
path for HunyuanImage-3.0-Instruct:

* Diffusion pipeline (pipeline_hunyuan_image3.py)
  - Source-image VAE + ViT encode via _encode_cond_image().
  - img2img forward path: threads batch_cond_image_info through
    prepare_model_inputs() so cond_vae_images/cond_vit_images/vit_kwargs
    reach the denoiser.
  - Unified helpers for PIL / tensor / path image loading and joint
    image-info serialisation.
  - Switch to 'instruct' sequence_template (read from generation_config)
    instead of hard-coded 'pretrain'; AR text prefix matches checkpoint
    training distribution.

* Transformer (hunyuan_image3_transformer.py)
  - LightProjector.forward() for the ViT aligner used in IT2I
    source-image conditioning.

* Stage input processor (stage_input_processors/hunyuan_image3.py)
  - ar2diffusion() bridges AR output -> DiT input: forwards
    ar_generated_text (with AutoTokenizer fallback when detokenize=false
    on the AR stage), multi_modal_data, use_system_prompt and sampling
    params. Fixes #2590 'IT2I model ignores image' regression.

* Entry point (examples/offline_inference/hunyuan_image3/end2end.py)
  - Unified build_prompt() using the Instruct chat template for all
    tasks and modalities.
  - New img2img / img2text branches plumb multi_modal_data and
    use_system_prompt through to both stages.

* Stage configs
  - hunyuan_image3_{i2t,it2i,t2t}.yaml: declare runtime defaults and
    per-edge window_size/max_inflight for serial AR -> DiT execution.

* Worker / misc
  - diffusion_worker.py: disable cuDNN at device-init time to work
    around CUDNN_STATUS_NOT_INITIALIZED on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to the PyTorch native
    implementation.
  - rope.py: guard the optional flash_attn.ops.triton.rotary import so
    an ABI-incompatible flash-attn install does not break startup.

Test Plan

python3 examples/offline_inference/hunyuan_image3/end2end.py \
    --model /data/HunyuanImage-3.0-Instruct \
    --modality img2img \
    --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml \
    --image-path /vllm-omni/edit_dog.png \
    --prompts "新年宠物海报,Q版圆润的可爱标题“新年快乐汪”,副标题“HAPPY NEW YEAR”。 鱼眼镜头,背景是房间门口,近景,上传的主体歪头笑,围着红色围巾,戴着红色毛线帽,高清,绒毛细节,面部特写。 宝丽莱相纸,超 surrealism,写实主义,胶片摄影,打印颗粒感肌理。肌理,超写实,复古感。" \
    --seed 42 \
    --steps 30 \
    --guidance-scale 5.0 \
    --output ./results

Test Result

PSNR:
609c2981f65b84661d1f44cd6d764612

bot-task think AR results.
Official
用户希望将这张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。参考图中是一只坐在木质地板上、背景有白色蒲公英的小狗,它正对着镜头开心地吐着舌头。原始指令非常具体,要求添加特定的标题文字、改变小狗的配饰、调整构图视角以及应用特定的胶片摄影风格。这是一个中等复杂度的任务,因为它涉及了文字添加、物体替换、视角变换和整体风格滤镜的叠加。首先,我需要处理文字部分,将“新年快乐汪”和“HAPPY NEW YEAR”以圆润可爱的字体放置在画面上方。接着,针对小狗的配饰,需要将原本粉色的项圈替换为鲜艳的红色针织围巾,并给它戴上一顶配套的红色毛线帽,这能极大地增强新年主题。在构图上,原始指令提到的鱼眼镜头意味着画面中心会向小狗头部聚拢,产生一种夸张的近景特写效果,背景的蒲公英和木板会因为透视而产生弯曲。最后,为了达到宝丽莱相纸和胶片摄影的效果,我需要给整张图加上白色的相纸边框,并加入细腻的胶片颗粒感和复古的色调,使画面看起来像是一张冲洗出来的老照片。通过这些步骤,原本普通的宠物照就能转化为一张极具感染力的节日海报。

The PR
用户希望将这张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。参考图中,小狗坐在木质地板上,背景是模糊的白色花朵,整体色调自然清新。为了实现"新年宠物海报"这一目标,我需要将背景从户外的自然景观彻底替换为具有中国传统新年特色的室内或庭院场景。这涉及到色彩的全面调整,从自然的木色和绿色转变为喜庆的红色和金色。在视觉元素上,我应该加入红灯笼、春联、福字、鞭炮等经典符号,并在小狗周围添加一些动态的烟花或光斑效果,以增强节日的热闹感。小狗本身作为主体需要保留,但为了更好地融入新环境,它的光影应该调整为受红色环境光影响的暖色调。此外,海报还需要包含醒目的文字标题,如"新年快乐"和"宠物贺岁",并采用具有设计感的艺术字体。最后,为了增加海报的仪式感,可以在画面四周添加一个带有传统纹样的红色边框。通过这些具体的视觉转化,原本普通的宠物照就能变成一张主题鲜明、细节丰富的节日宣传海报。

E2E Output picture

Input Official The pr
> **Note:** When running on an 8-card A100, please add `--vae-use-tiling` to prevent out of memory.
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@skf-1999 skf-1999 requested a review from hsliuustc0106 as a code owner April 24, 2026 11:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00c4135ed4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return rotated.flatten(-2, -1).to(x.dtype)


def apply_rope_to_qk(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore Wan rotary class export

This change removes RotaryEmbeddingWan from rope.py, but Wan 2.2 still imports and instantiates that symbol in vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py. In environments where Wan is loaded, the module import now fails before inference starts, so Wan pipelines become unusable. Please keep RotaryEmbeddingWan exported here (or migrate downstream imports in the same change).

Useful? React with 👍 / 👎.

"ltx2",
"pipeline_ltx2_image2video",
"LTX2I2VDMD2Pipeline",
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-register LTX2.3 pipelines in diffusion registry

The commit drops LTX23Pipeline and LTX23ImageToVideoPipeline from _DIFFUSION_MODELS (and their post-process entries), so any config using those model_class_name values now fails model initialization with a registry lookup error. Since the LTX-2.3 pipeline implementations are still present in the repo, this is a functional regression in supported model loading.

Useful? React with 👍 / 👎.


def init_device(self) -> None:
"""Initialize the device and distributed environment."""
torch.backends.cudnn.enabled = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardcode torch.backends.cudnn.enabled = False here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is quite a bit of redundant code at present.

# Detect KV-reuse stage configs. Note: the AR prompt layout is now the
# same Instruct template in both paths (see build_prompt); the flag is
# only used for informational purposes.
_kv_reuse_configs = {"hunyuan_image3_it2i_kv_reuse.yaml"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR seems unrelated to kv reuse

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will be adjusted accordingly.

@skf-1999 skf-1999 force-pushed the it2i branch 2 times, most recently from 033ff7a to c238a8e Compare April 27, 2026 09:01
@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 27, 2026
@Bounty-hunter
Copy link
Copy Markdown
Contributor

We need accuracy test like test_hunyuanimage3_text2img.py ?

prompt_images = mm_data.get("images")
if prompt_images is not None:
diffusion_input["pil_image"] = prompt_images
diffusion_input["multi_modal_data"] = {"image": prompt_images}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these two duplicate fields needed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We remove the changes.

generated_token_ids = output.cumulative_token_ids
generated_text = getattr(output, "text", "") or ""

if not generated_text and generated_token_ids:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just detokenize in AR stage by setting yaml with detokenize: true?

TaffyOfficial pushed a commit to TaffyOfficial/vllm-omni that referenced this pull request Apr 30, 2026
HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as
the bot role prefix in instruct sequence_template (verified by
decoding HF prepare_model_inputs output with system_prompt=en_unified
+ image + bot_task=think: token 72803 = "Assistant"). Switch
build_prompt() to use the full word so the AR prefill aligns with the
official HF tokenization.

Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107
reference implementation does the same; the previous T2T-specific
branch was a workaround for an earlier prompt-format experiment).

Note: BPE merge across user_prompt/Assistant boundary still produces
1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template
keeps them separate. Full byte-identical alignment requires passing
pre-tokenized prompt_token_ids — that path is supported by vllm-omni
(OmniTokensPrompt) but not yet plumbed through build_prompt().

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an e2e function L4 test. Thanks

@skf-1999
Copy link
Copy Markdown
Contributor Author

Please add an e2e function L4 test. Thanks

We will align the precision as soon as possible, streamline the code, and add tests.

@Gaohan123 Gaohan123 added the high priority high priority issue, needs to be done asap label Apr 30, 2026
raise TypeError(f"Unsupported image input type: {type(image)}")


def _resize_and_crop_center(image: PILImage.Image, target_width: int, target_height: int) -> PILImage.Image:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AR stage also has the same function, HunyuanImage3Processor::process_image. Can we inherit from it or use another approach to reuse that implementation, so their behavior remains consistent?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be addressed in the next PR

@Bounty-hunter
Copy link
Copy Markdown
Contributor

Since AR and DiT call different functions to construct the input template (or input token id) (for example, prepare_model_inputs for DiT), Maybe we should add unit tests to verify that their behavior remains consistent.

TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 4, 2026
- it2i.yaml: switch AR stage to detokenize=true so ar_generated_text comes
  from the engine; drop the AutoTokenizer fallback in ar2diffusion that
  was a workaround for detokenize=false.
- ar2diffusion: send the conditioning image only via multi_modal_data
  (matching vLLM's standard schema); the pipeline pre-process already
  reads it from there. Removes the duplicate pil_image field.
- pipeline_hunyuan_image3._resize_and_crop_center: replace with the
  exact algorithm used by HunyuanImage3Processor._resize_and_crop on
  the AR side so AR and DiT preprocess condition images identically.

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 4, 2026
Address PR vllm-project#3107 review feedback from Bounty-hunter (2026-05-04):
'AR and DiT call different functions to construct the input template
(prepare_model_inputs for DiT). Maybe we should add unit tests to
verify that their behavior remains consistent.'

Two CPU-only structural guards added to test_prompt_utils.py:

1. test_dit_pipeline_reads_sequence_template_from_generation_config
   AST-verifies that HunyuanImage3Pipeline.prepare_model_inputs still
   pulls sequence_template from generation_config (not hardcoded) and
   forwards it into the tokenizer wrapper. Catches a regression where
   someone hardcodes 'pretrain' or removes the lookup -- the historic
   shape of the AR/DiT template-drift bug.

2. test_dit_tokenizer_wrapper_supports_instruct_branch
   Asserts the DiT-side TokenizerWrapper still recognizes
   sequence_template='instruct' (the AR-side instruct template anchors
   live in the model's chat-template definition, not in the wrapper
   module itself, so we route on the dispatch keyword instead of the
   literal anchor strings).

Both tests are pure-AST/string scans and require no GPU, model
weights, or HF cache, so they run in the same core_model+cpu lane as
the rest of test_prompt_utils.py.

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
# Siglip2VisionModel; the top-level module is now itself the vision
# tower. Older transformers expose both, so dropping `.vision_model`
# is forward-compatible.
self.vision_model = Siglip2VisionModel(vision_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just import Siglip2VisionTransformer from vllm_omni.model_executor.models.hunyuan_image3.siglip2 ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I tested both paths end-to-end and they're numerically equivalent — so the choice is mostly stylistic. A couple of practical notes:

Numerical equivalence (verified) With identical weights + identical inputs (cartoon test image, dtype=bf16):

  | transformers.Siglip2VisionModel | local siglip2.Siglip2VisionTransformer -- | -- | -- last_hidden_state.std() | 0.06015 | 0.06015 max abs diff vs the other | — | 0.01544 mean abs diff vs the other | — | 0.00003

Both also produce 1119/1152 channels with cross-token std < 0.01 — that's the model's natural post-LayerNorm behavior, not a wrapper bug.

Trade-offs of the local import

  • ➕ Self-contained, no dependency on transformers internals; matches what the HF snapshot bundles, byte-for-byte
  • ➖ The local file uses _prepare_4d_attention_mask from transformers.modeling_attn_mask_utils, which is deprecated and slated for removal in transformers 5.10 (DEPRECATION warning fires today). The transformers impl already migrated to create_bidirectional_mask
  • ➖ Different forward signature (attention_mask= vs pixel_attention_mask=, requires explicit return_dict=True, takes a dict instead of Siglip2VisionConfig) — call sites need adapting (vit_kwargs is currently keyed pixel_attention_mask per transformers 5.x)
  • ➖ We'd carry two SigLIP2 implementations in-tree (already imported by model_executor.models.hunyuan_image3.hunyuan_image3 for the AR side)

Recommendation Keep transformers.Siglip2VisionModel here. It's numerically equivalent, maintained upstream, and avoids the deprecated mask-utils path. If we want to drop the transformers dep entirely we should do it consistently across both AR and DiT paths in a separate refactor PR.

(Also note: I was investigating a separate painterly-style drift suspecting this site, and confirmed via the same test that the SigLIP2 wrapper isn't the cause — the cond features come out identical either way.)

Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct,
using the same two-stage AR -> DiT pipeline as the existing T2I path with
the AR stage receiving an additional condition image alongside the user
prompt.

Highlights:

* Pipeline & runtime
  - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py:
    cond image VAE-encode, ViT-encode, and scatter the resulting features
    into the DiT prefill via instantiate_vae_image_tokens /
    instantiate_vit_image_tokens (matches HF reference modeling layout).
  - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py:
    ar2diffusion bridge forwards condition image + system_prompt + user
    prompt from AR stage to DiT stage.
  - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml:
    8-GPU IT2I stage config (4 AR + 4 DiT).
  - examples/offline_inference/hunyuan_image3/end2end.py + README.md:
    img2img modality entry; prompt_dict uses vllm-standard `prompt` key
    so the offline path receives the raw user prompt at the DiT stage
    (DiT pipeline reads `p.get("prompt")` only).

* DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after
  the 0.20 rebase). Both addressed by aligning with the upstream PR
  vllm-project#3373 by @dengyunyang who independently surfaced
  the same accuracy gap.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py:
    HunyuanFusedMoEDefault used to register a forward pre-hook that
    called `self.quant_method.process_weights_after_loading(self)` on
    first forward, to compensate for the 0.18-era standard model loader
    not invoking it on FusedMoE layers. vLLM 0.20's standard loader
    (`model_executor/model_loader/base_loader.py`) now invokes
    `process_weights_after_loading` model-wide on init, so the hook
    fires a second time on first forward, double-applying non-idempotent
    in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight`
    re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel
    oracle on already-padded weights). Corrupted w13/w2 layout + wrong
    kernel oracle config produces a small per-token, per-layer expert-
    dispatch bias that accumulates across the 32 DiT MoE layers into a
    "painterly / oil texture" attractor on the generated image. The
    unquantized FusedMoE method has no
    `_already_called_process_weights_after_loading` guard (only the FP8
    quant method does), so non-quantized HunyuanImage3 reliably trips
    this. Hook deliberately not registered.

  - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py
    (HunYuanSparseMoeBlock):
    Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel`
    in forward, and drop `reduce_results=False` on the FusedMoE init.
    Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the
    `shared_mlp` output is merged inside FusedMoE.forward and the TP
    all-reduce is done internally; the wrapper code that did both of
    these externally was a 0.18-era workaround that became a double
    op after 0.20. Net effect of double-reduce + double shared_mlp add
    was a small numerical bias on top of the painterly drift; removing
    the wrapper restores HF-reference parity.

  Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same
  cartoon-block input + cute orange cat prompt yields a clean flat-
  cartoon output, visually matching HF generate_image() reference.

* Tests
  - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py:
    unit-level - AR prefill input_ids byte-equal HF chat template,
    image-tensor byte-equal AR-side processor.
  - tests/e2e/accuracy/test_hunyuan_image3_it2i.py:
    full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at
    PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple.

Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: skf <54565339+skf-1999@users.noreply.github.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@skf-1999 skf-1999 changed the title [WIP][Feature] HunyuanImage-3.0 IT2I (image editing) support [Feature] HunyuanImage-3.0 IT2I (image editing) support May 6, 2026
pytestmark = [pytest.mark.full_model, pytest.mark.diffusion]


MODEL_NAME = "tencent/HunyuanImage-3.0-Instruct"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to nightly ci

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open another PR later, and add accuracy metrics and functional test cases to CI together.

def __init__(self, *, prefix: str = "", **kwargs: Any) -> None:
super().__init__(prefix=prefix, **kwargs)
self._prefix = prefix
# NOTE: prior to vLLM 0.20, this class registered a forward pre-hook
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove useless comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes applied

else:
self.shared_mlp = None

# Since vLLM 0.20, FusedMoE merges `shared_experts` output and runs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove useless comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifications completed

height = original_prompt.get("height", 1024)
width = original_prompt.get("width", 1024)
text_prompt = original_prompt.get("prompt", "")
text_prompt = original_prompt.get("user_prompt") or original_prompt.get("prompt", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't see user_prompt field initialized in end2end.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the unnecessary changes

TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as
the bot role prefix in instruct sequence_template (verified by
decoding HF prepare_model_inputs output with system_prompt=en_unified
+ image + bot_task=think: token 72803 = "Assistant"). Switch
build_prompt() to use the full word so the AR prefill aligns with the
official HF tokenization.

Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107
reference implementation does the same; the previous T2T-specific
branch was a workaround for an earlier prompt-format experiment).

Note: BPE merge across user_prompt/Assistant boundary still produces
1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template
keeps them separate. Full byte-identical alignment requires passing
pre-tokenized prompt_token_ids — that path is supported by vllm-omni
(OmniTokensPrompt) but not yet plumbed through build_prompt().

Signed-off-by: TaffyOfficial <2324465096@qq.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
- it2i.yaml: switch AR stage to detokenize=true so ar_generated_text comes
  from the engine; drop the AutoTokenizer fallback in ar2diffusion that
  was a workaround for detokenize=false.
- ar2diffusion: send the conditioning image only via multi_modal_data
  (matching vLLM's standard schema); the pipeline pre-process already
  reads it from there. Removes the duplicate pil_image field.
- pipeline_hunyuan_image3._resize_and_crop_center: replace with the
  exact algorithm used by HunyuanImage3Processor._resize_and_crop on
  the AR side so AR and DiT preprocess condition images identically.

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Address PR vllm-project#3107 review feedback from Bounty-hunter (2026-05-04):
'AR and DiT call different functions to construct the input template
(prepare_model_inputs for DiT). Maybe we should add unit tests to
verify that their behavior remains consistent.'

Two CPU-only structural guards added to test_prompt_utils.py:

1. test_dit_pipeline_reads_sequence_template_from_generation_config
   AST-verifies that HunyuanImage3Pipeline.prepare_model_inputs still
   pulls sequence_template from generation_config (not hardcoded) and
   forwards it into the tokenizer wrapper. Catches a regression where
   someone hardcodes 'pretrain' or removes the lookup -- the historic
   shape of the AR/DiT template-drift bug.

2. test_dit_tokenizer_wrapper_supports_instruct_branch
   Asserts the DiT-side TokenizerWrapper still recognizes
   sequence_template='instruct' (the AR-side instruct template anchors
   live in the model's chat-template definition, not in the wrapper
   module itself, so we route on the dispatch keyword instead of the
   literal anchor strings).

Both tests are pure-AST/string scans and require no GPU, model
weights, or HF cache, so they run in the same core_model+cpu lane as
the rest of test_prompt_utils.py.

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Address PR vllm-project#3107 review (Bounty-hunter / Gaohan123) requesting
AR-output-format and DiT-output-accuracy regression tests. Layout
mirrors PR vllm-project#2949's split (CPU unit test under tests/diffusion/...,
GPU accuracy test under tests/e2e/accuracy/...).

CPU unit test
  tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py
  - test_ar_prefill_tokens_match_hf_apply_chat_template_for_it2i:
    asserts build_prompt_tokens (the AR-side prefill builder) is
    token-id-identical to HF tokenizer.apply_chat_template for the
    same (system, user_prompt, image) triple. Catches drift between
    the AR's input distribution and the model's training distribution
    -- the same failure mode PR vllm-project#3243 fixed for T2I.
  - test_dit_condition_image_preprocessing_byte_matches_ar_processor:
    asserts the diffusion-side _resize_and_crop_center produces
    byte-identical pixels to the AR-side
    HunyuanImage3Processor._resize_and_crop on the canonical resize
    targets. Direct response to Bounty-hunter's PR vllm-project#3107 review.

Both tests gate on tencent/HunyuanImage-3.0-Instruct being in the local
HF cache (no GPU/model weights required at runtime, just the tokenizer
config + image processor).

GPU accuracy test
  tests/e2e/accuracy/test_hunyuan_image3_it2i.py
  - test_hunyuan_image3_it2i_matches_hf_reference_psnr_40:
    drives vllm-omni's offline IT2I path through Omni and runs the
    official HF reference via AutoModelForCausalLM.generate_image,
    compared via the shared assert_similarity helper at PSNR>=40 dB
    and SSIM>=0.92. Marked full_model + skipif<8 GPUs; the threshold
    follows PR vllm-project#2949's review discussion (40 dB gives slack for TP=2
    NCCL drift while still catching prompt/image-preprocessing bugs).

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
…output alignment

The previous CPU-side test in test_hunyuan_image3_it2i_ar_format.py
called the official tokenizer's apply_chat_template to render the AR
prefill prompt and compared its token id sequence to vllm-omni's
build_prompt_tokens output. Two problems:

  - it tested the *input* prompt only, not the AR's *generated output*
    (which is what 'AR output matches official' actually demands);
  - HunyuanImage3TokenizerFast.from_pretrained(snap) returns a
    byte-fallback (char-level) tokenizer in a vacuum, which is not the
    encoding the vllm-omni production path uses (AutoTokenizer with the
    BPE merges from tokenizer.json) -- so the comparison was apples vs
    char-bytes and could never pass on PyPI 0.20.x.

Replaced with a real GPU-required e2e test in
tests/e2e/accuracy/test_hunyuan_image3_it2i_ar_output.py that:

  - drives the HF reference via AutoModelForCausalLM.from_pretrained +
    model.prepare_model_inputs + model.generate(do_sample=False) (the
    pattern already in scripts/bench/bench_ar_hf.py);
  - drives vllm-omni AR via the i2t stage YAML with the prompt fed as
    prompt_token_ids = HF prefill (the alignment path documented in
    workflow-starter/memory/hf/hf_omni_alignment_method.md);
  - asserts prefill input_ids byte-equality and the first 8 of 16
    greedy AR-generated tokens match between HF and omni.

Skips cleanly when the snapshot is missing the two manual modeling
patches (RoPE broadcast / 2D attention_mask SDPA fall-through) that
the project's HF baseline runbook requires.

The CPU-only DiT/HF condition-image preprocessing byte-equality check
in test_hunyuan_image3_it2i_ar_format.py is preserved -- it passes
locally and guards Bounty-hunter's PR vllm-project#3107 review item directly.

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
Signed-off-by: zuiho <2324465096@qq.com>
@TaffyOfficial
Copy link
Copy Markdown
Contributor

@Bounty-hunter @Gaohan123 @hsliuustc0106 need review

@Bounty-hunter
Copy link
Copy Markdown
Contributor

LGTM now

Co-authored-by: dengyunyang <584797741@qq.com>

Co-authored-by: TaffyOfficial <2324465096@qq.com>

Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: skf1999 <13234016272@163.com>
token_ids = build_prompt_tokens(p, tokenizer, task=task, sys_type=args.sys_type)

prompt_dict: dict = {"prompt_token_ids": token_ids, "prompt": p}
preset_sys_type, _, _ = _TASK_PRESETS[task]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we list tasks explicitly in end2end.py rather than private enumerate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion adopted.

Co-authored-by: dengyunyang <584797741@qq.com>

Co-authored-by: TaffyOfficial <2324465096@qq.com>

Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: skf1999 <13234016272@163.com>
@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label May 6, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit 576afb6 into vllm-project:main May 6, 2026
8 checks passed
gcanlin pushed a commit that referenced this pull request May 7, 2026
…rebase (#3395)

Signed-off-by: dengyunyang <584797741@qq.com>
TaffyOfficial pushed a commit to TaffyOfficial/vllm-omni that referenced this pull request May 9, 2026
Default IT2I (`hunyuan_image3_it2i.yaml`) and the AR+DiT T2I config
(`hunyuan_image3_t2i_2gpu.yaml`) left `skip_special_tokens` at the vLLM
default (True), so the AR engine stripped the trailing
`<img_size_BASE><img_ratio_Y>` markers from `output.text` before the
`ar2diffusion` bridge could read them. With the previous
ar2diffusion fix, that meant the bridge fell back to the prompt-carried
height/width — i.e. `pil_images[0].size` from the OpenAI edits path,
which collapsed to a square whenever the first reference image was
square.

The KV-reuse config (`hunyuan_image3_it2i_kv_reuse.yaml`) already sets
this flag (added in vllm-project#3346 because the KV reuse machinery needs the
exact AR token stream), but the original IT2I yaml from vllm-project#3107 did not
need it at the time and was never updated when ar2diffusion grew the
ratio-token consumer.

Aligns both configs with `_kv_reuse.yaml`. AR token-id fallback in
ar2diffusion still works for users who keep the default, but having
the text path live by default is cheaper (no tokenizer load) and
avoids the model-name/path ambiguity the token-id fallback hits when
the model is loaded from a local directory rather than the HF hub
identifier.

Signed-off-by: TaffyOfficial <wu15922848573@outlook.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…#3107)

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: zuiho <2324465096@qq.com>
Signed-off-by: skf1999 <13234016272@163.com>
Co-authored-by: TaffyOfficial <2324465096@qq.com>
Co-authored-by: dengyunyang <584797741@qq.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants