[Feature] HunyuanImage-3.0 IT2I (image editing) support by skf-1999 · Pull Request #3107 · vllm-project/vllm-omni

skf-1999 · 2026-04-24T11:14:09Z

Co-authored-by: dengyunyang 584797741@qq.com
Co-authored-by: TaffyOfficial 2324465096@qq.com
Co-authored-by: John Liu BUAA liukecheng97@gmail.com

Purpose

Adds the image-to-image (image + text edit instruction -> edited image)
path for HunyuanImage-3.0-Instruct:

* Diffusion pipeline (pipeline_hunyuan_image3.py)
  - Source-image VAE + ViT encode via _encode_cond_image().
  - img2img forward path: threads batch_cond_image_info through
    prepare_model_inputs() so cond_vae_images/cond_vit_images/vit_kwargs
    reach the denoiser.
  - Unified helpers for PIL / tensor / path image loading and joint
    image-info serialisation.
  - Switch to 'instruct' sequence_template (read from generation_config)
    instead of hard-coded 'pretrain'; AR text prefix matches checkpoint
    training distribution.

* Transformer (hunyuan_image3_transformer.py)
  - LightProjector.forward() for the ViT aligner used in IT2I
    source-image conditioning.

* Stage input processor (stage_input_processors/hunyuan_image3.py)
  - ar2diffusion() bridges AR output -> DiT input: forwards
    ar_generated_text (with AutoTokenizer fallback when detokenize=false
    on the AR stage), multi_modal_data, use_system_prompt and sampling
    params. Fixes #2590 'IT2I model ignores image' regression.

* Entry point (examples/offline_inference/hunyuan_image3/end2end.py)
  - Unified build_prompt() using the Instruct chat template for all
    tasks and modalities.
  - New img2img / img2text branches plumb multi_modal_data and
    use_system_prompt through to both stages.

* Stage configs
  - hunyuan_image3_{i2t,it2i,t2t}.yaml: declare runtime defaults and
    per-edge window_size/max_inflight for serial AR -> DiT execution.

* Worker / misc
  - diffusion_worker.py: disable cuDNN at device-init time to work
    around CUDNN_STATUS_NOT_INITIALIZED on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to the PyTorch native
    implementation.
  - rope.py: guard the optional flash_attn.ops.triton.rotary import so
    an ABI-incompatible flash-attn install does not break startup.

Test Plan

python3 examples/offline_inference/hunyuan_image3/end2end.py \
    --model /data/HunyuanImage-3.0-Instruct \
    --modality img2img \
    --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml \
    --image-path /vllm-omni/edit_dog.png \
    --prompts "新年宠物海报，Q版圆润的可爱标题“新年快乐汪”，副标题“HAPPY NEW YEAR”。 鱼眼镜头，背景是房间门口，近景，上传的主体歪头笑，围着红色围巾，戴着红色毛线帽，高清，绒毛细节，面部特写。 宝丽莱相纸，超 surrealism，写实主义，胶片摄影，打印颗粒感肌理。肌理，超写实，复古感。" \
    --seed 42 \
    --steps 30 \
    --guidance-scale 5.0 \
    --output ./results

Test Result

PSNR：

bot-task think AR results.
Official
用户希望将这张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。参考图中是一只坐在木质地板上、背景有白色蒲公英的小狗，它正对着镜头开心地吐着舌头。原始指令非常具体，要求添加特定的标题文字、改变小狗的配饰、调整构图视角以及应用特定的胶片摄影风格。这是一个中等复杂度的任务，因为它涉及了文字添加、物体替换、视角变换和整体风格滤镜的叠加。首先，我需要处理文字部分，将“新年快乐汪”和“HAPPY NEW YEAR”以圆润可爱的字体放置在画面上方。接着，针对小狗的配饰，需要将原本粉色的项圈替换为鲜艳的红色针织围巾，并给它戴上一顶配套的红色毛线帽，这能极大地增强新年主题。在构图上，原始指令提到的鱼眼镜头意味着画面中心会向小狗头部聚拢，产生一种夸张的近景特写效果，背景的蒲公英和木板会因为透视而产生弯曲。最后，为了达到宝丽莱相纸和胶片摄影的效果，我需要给整张图加上白色的相纸边框，并加入细腻的胶片颗粒感和复古的色调，使画面看起来像是一张冲洗出来的老照片。通过这些步骤，原本普通的宠物照就能转化为一张极具感染力的节日海报。

The PR
用户希望将这张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。参考图中，小狗坐在木质地板上，背景是模糊的白色花朵，整体色调自然清新。为了实现"新年宠物海报"这一目标，我需要将背景从户外的自然景观彻底替换为具有中国传统新年特色的室内或庭院场景。这涉及到色彩的全面调整，从自然的木色和绿色转变为喜庆的红色和金色。在视觉元素上，我应该加入红灯笼、春联、福字、鞭炮等经典符号，并在小狗周围添加一些动态的烟花或光斑效果，以增强节日的热闹感。小狗本身作为主体需要保留，但为了更好地融入新环境，它的光影应该调整为受红色环境光影响的暖色调。此外，海报还需要包含醒目的文字标题，如"新年快乐"和"宠物贺岁"，并采用具有设计感的艺术字体。最后，为了增加海报的仪式感，可以在画面四周添加一个带有传统纹样的红色边框。通过这些具体的视觉转化，原本普通的宠物照就能变成一张主题鲜明、细节丰富的节日宣传海报。

E2E Output picture

Input	Official	The pr

> **Note:** When running on an 8-card A100, please add `--vae-use-tiling` to prevent out of memory.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00c4135ed4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T11:20:05Z

-        return rotated.flatten(-2, -1).to(x.dtype)
-
-
 def apply_rope_to_qk(


Restore Wan rotary class export

This change removes RotaryEmbeddingWan from rope.py, but Wan 2.2 still imports and instantiates that symbol in vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py. In environments where Wan is loaded, the module import now fails before inference starts, so Wan pipelines become unusable. Please keep RotaryEmbeddingWan exported here (or migrate downstream imports in the same change).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-24T11:20:05Z

        "ltx2",
        "pipeline_ltx2_image2video",
        "LTX2I2VDMD2Pipeline",
    ),


Re-register LTX2.3 pipelines in diffusion registry

The commit drops LTX23Pipeline and LTX23ImageToVideoPipeline from _DIFFUSION_MODELS (and their post-process entries), so any config using those model_class_name values now fails model initialization with a registry lookup error. Since the LTX-2.3 pipeline implementations are still present in the repo, this is a functional regression in supported model loading.

Useful? React with 👍 / 👎.

Fishermanykx · 2026-04-25T08:55:36Z


    def init_device(self) -> None:
        """Initialize the device and distributed environment."""
+        torch.backends.cudnn.enabled = False


Why hardcode torch.backends.cudnn.enabled = False here?

Indeed, there is quite a bit of redundant code at present.

zengchuang-hw · 2026-04-27T01:55:22Z

+    # Detect KV-reuse stage configs.  Note: the AR prompt layout is now the
+    # same Instruct template in both paths (see build_prompt); the flag is
+    # only used for informational purposes.
+    _kv_reuse_configs = {"hunyuan_image3_it2i_kv_reuse.yaml"}


this PR seems unrelated to kv reuse

The code will be adjusted accordingly.

Bounty-hunter · 2026-04-28T01:21:16Z

We need accuracy test like test_hunyuanimage3_text2img.py ？

Bounty-hunter · 2026-04-29T00:38:39Z

+                prompt_images = mm_data.get("images")
+            if prompt_images is not None:
+                diffusion_input["pil_image"] = prompt_images
+                diffusion_input["multi_modal_data"] = {"image": prompt_images}


Why these two duplicate fields needed?

Yes. We remove the changes.

Bounty-hunter · 2026-04-29T00:46:18Z

        generated_token_ids = output.cumulative_token_ids
        generated_text = getattr(output, "text", "") or ""

+        if not generated_text and generated_token_ids:


can we just detokenize in AR stage by setting yaml with detokenize: true?

HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as the bot role prefix in instruct sequence_template (verified by decoding HF prepare_model_inputs output with system_prompt=en_unified + image + bot_task=think: token 72803 = "Assistant"). Switch build_prompt() to use the full word so the AR prefill aligns with the official HF tokenization. Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107 reference implementation does the same; the previous T2T-specific branch was a workaround for an earlier prompt-format experiment). Note: BPE merge across user_prompt/Assistant boundary still produces 1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template keeps them separate. Full byte-identical alignment requires passing pre-tokenized prompt_token_ids — that path is supported by vllm-omni (OmniTokensPrompt) but not yet plumbed through build_prompt(). Signed-off-by: TaffyOfficial <2324465096@qq.com>

Gaohan123

Please add an e2e function L4 test. Thanks

skf-1999 · 2026-04-30T06:17:42Z

Please add an e2e function L4 test. Thanks

We will align the precision as soon as possible, streamline the code, and add tests.

Bounty-hunter · 2026-05-03T15:24:24Z

+    raise TypeError(f"Unsupported image input type: {type(image)}")
+
+
+def _resize_and_crop_center(image: PILImage.Image, target_width: int, target_height: int) -> PILImage.Image:


The AR stage also has the same function, HunyuanImage3Processor::process_image. Can we inherit from it or use another approach to reuse that implementation, so their behavior remains consistent?

Maybe this can be addressed in the next PR

Bounty-hunter · 2026-05-04T01:23:50Z

Since AR and DiT call different functions to construct the input template (or input token id) (for example, prepare_model_inputs for DiT), Maybe we should add unit tests to verify that their behavior remains consistent.

- it2i.yaml: switch AR stage to detokenize=true so ar_generated_text comes from the engine; drop the AutoTokenizer fallback in ar2diffusion that was a workaround for detokenize=false. - ar2diffusion: send the conditioning image only via multi_modal_data (matching vLLM's standard schema); the pipeline pre-process already reads it from there. Removes the duplicate pil_image field. - pipeline_hunyuan_image3._resize_and_crop_center: replace with the exact algorithm used by HunyuanImage3Processor._resize_and_crop on the AR side so AR and DiT preprocess condition images identically. Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

Address PR vllm-project#3107 review feedback from Bounty-hunter (2026-05-04): 'AR and DiT call different functions to construct the input template (prepare_model_inputs for DiT). Maybe we should add unit tests to verify that their behavior remains consistent.' Two CPU-only structural guards added to test_prompt_utils.py: 1. test_dit_pipeline_reads_sequence_template_from_generation_config AST-verifies that HunyuanImage3Pipeline.prepare_model_inputs still pulls sequence_template from generation_config (not hardcoded) and forwards it into the tokenizer wrapper. Catches a regression where someone hardcodes 'pretrain' or removes the lookup -- the historic shape of the AR/DiT template-drift bug. 2. test_dit_tokenizer_wrapper_supports_instruct_branch Asserts the DiT-side TokenizerWrapper still recognizes sequence_template='instruct' (the AR-side instruct template anchors live in the model's chat-template definition, not in the wrapper module itself, so we route on the dispatch keyword instead of the literal anchor strings). Both tests are pure-AST/string scans and require no GPU, model weights, or HF cache, so they run in the same core_model+cpu lane as the rest of test_prompt_utils.py. Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

Bounty-hunter · 2026-05-04T09:05:01Z

+        # Siglip2VisionModel; the top-level module is now itself the vision
+        # tower. Older transformers expose both, so dropping `.vision_model`
+        # is forward-compatible.
+        self.vision_model = Siglip2VisionModel(vision_config)


can we just import Siglip2VisionTransformer from vllm_omni.model_executor.models.hunyuan_image3.siglip2 ?

We could, but I tested both paths end-to-end and they're numerically equivalent — so the choice is mostly stylistic. A couple of practical notes:
Numerical equivalence (verified) With identical weights + identical inputs (cartoon test image, dtype=bf16):
| transformers.Siglip2VisionModel | local siglip2.Siglip2VisionTransformer -- | -- | -- last_hidden_state.std() | 0.06015 | 0.06015 max abs diff vs the other | — | 0.01544 mean abs diff vs the other | — | 0.00003
Both also produce 1119/1152 channels with cross-token std < 0.01 — that's the model's natural post-LayerNorm behavior, not a wrapper bug.
Trade-offs of the local import
➕ Self-contained, no dependency on transformers internals; matches what the HF snapshot bundles, byte-for-byte
➖ The local file uses _prepare_4d_attention_mask from transformers.modeling_attn_mask_utils, which is deprecated and slated for removal in transformers 5.10 (DEPRECATION warning fires today). The transformers impl already migrated to create_bidirectional_mask
➖ Different forward signature (attention_mask= vs pixel_attention_mask=, requires explicit return_dict=True, takes a dict instead of Siglip2VisionConfig) — call sites need adapting (vit_kwargs is currently keyed pixel_attention_mask per transformers 5.x)
➖ We'd carry two SigLIP2 implementations in-tree (already imported by model_executor.models.hunyuan_image3.hunyuan_image3 for the AR side)
Recommendation Keep transformers.Siglip2VisionModel here. It's numerically equivalent, maintained upstream, and avoids the deprecated mask-utils path. If we want to drop the transformers dep entirely we should do it consistently across both AR and DiT paths in a separate refactor PR.
(Also note: I was investigating a separate painterly-style drift suspecting this site, and confirmed via the same test that the SigLIP2 wrapper isn't the cause — the cond features come out identical either way.)

@dengyunyang

Adds image-to-image editing capability for tencent/HunyuanImage-3.0-Instruct, using the same two-stage AR -> DiT pipeline as the existing T2I path with the AR stage receiving an additional condition image alongside the user prompt. Highlights: * Pipeline & runtime - vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py: cond image VAE-encode, ViT-encode, and scatter the resulting features into the DiT prefill via instantiate_vae_image_tokens / instantiate_vit_image_tokens (matches HF reference modeling layout). - vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py: ar2diffusion bridge forwards condition image + system_prompt + user prompt from AR stage to DiT stage. - vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml: 8-GPU IT2I stage config (4 AR + 4 DiT). - examples/offline_inference/hunyuan_image3/end2end.py + README.md: img2img modality entry; prompt_dict uses vllm-standard `prompt` key so the offline path receives the raw user prompt at the DiT stage (DiT pipeline reads `p.get("prompt")` only). * DiT MoE accuracy fixes (stale 0.18-era code surfaced as bugs after the 0.20 rebase). Both addressed by aligning with the upstream PR vllm-project#3373 by @dengyunyang who independently surfaced the same accuracy gap. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_fused_moe.py: HunyuanFusedMoEDefault used to register a forward pre-hook that called `self.quant_method.process_weights_after_loading(self)` on first forward, to compensate for the 0.18-era standard model loader not invoking it on FusedMoE layers. vLLM 0.20's standard loader (`model_executor/model_loader/base_loader.py`) now invokes `process_weights_after_loading` model-wide on init, so the hook fires a second time on first forward, double-applying non-idempotent in-place transforms (`UnquantizedFusedMoEMethod._maybe_pad_weight` re-pads w13/w2 in place; `_setup_kernel` re-registers the moe_kernel oracle on already-padded weights). Corrupted w13/w2 layout + wrong kernel oracle config produces a small per-token, per-layer expert- dispatch bias that accumulates across the 32 DiT MoE layers into a "painterly / oil texture" attractor on the generated image. The unquantized FusedMoE method has no `_already_called_process_weights_after_loading` guard (only the FP8 quant method does), so non-quantized HunyuanImage3 reliably trips this. Hook deliberately not registered. - vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py (HunYuanSparseMoeBlock): Drop external `shared_experts` merge + `maybe_all_reduce_tensor_model_parallel` in forward, and drop `reduce_results=False` on the FusedMoE init. Since vLLM 0.20, when `shared_experts` is passed to FusedMoE, the `shared_mlp` output is merged inside FusedMoE.forward and the TP all-reduce is done internally; the wrapper code that did both of these externally was a 0.18-era workaround that became a double op after 0.20. Net effect of double-reduce + double shared_mlp add was a small numerical bias on top of the painterly drift; removing the wrapper restores HF-reference parity. Verified on 4xL20X TP=2/2 (vllm 0.20.0 + torch 2.11.0+cu130): same cartoon-block input + cute orange cat prompt yields a clean flat- cartoon output, visually matching HF generate_image() reference. * Tests - tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py: unit-level - AR prefill input_ids byte-equal HF chat template, image-tensor byte-equal AR-side processor. - tests/e2e/accuracy/test_hunyuan_image3_it2i.py: full-pipeline e2e - vllm-omni AR -> DiT vs HF generate_image() at PSNR >= 40 dB on the same (condition_image, prompt, seed) tuple. Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: skf <54565339+skf-1999@users.noreply.github.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

Bounty-hunter · 2026-05-06T11:45:19Z

+pytestmark = [pytest.mark.full_model, pytest.mark.diffusion]
+
+
+MODEL_NAME = "tencent/HunyuanImage-3.0-Instruct"


add to nightly ci

Will open another PR later, and add accuracy metrics and functional test cases to CI together.

Bounty-hunter · 2026-05-06T11:47:48Z

    def __init__(self, *, prefix: str = "", **kwargs: Any) -> None:
        super().__init__(prefix=prefix, **kwargs)
        self._prefix = prefix
+        # NOTE: prior to vLLM 0.20, this class registered a forward pre-hook


remove useless comments

Changes applied

Bounty-hunter · 2026-05-06T11:48:33Z

        else:
            self.shared_mlp = None

+        # Since vLLM 0.20, FusedMoE merges `shared_experts` output and runs


remove useless comments

Modifications completed

Bounty-hunter · 2026-05-06T11:57:42Z

        height = original_prompt.get("height", 1024)
        width = original_prompt.get("width", 1024)
-        text_prompt = original_prompt.get("prompt", "")
+        text_prompt = original_prompt.get("user_prompt") or original_prompt.get("prompt", "")


can't see user_prompt field initialized in end2end.py

delete the unnecessary changes

HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as the bot role prefix in instruct sequence_template (verified by decoding HF prepare_model_inputs output with system_prompt=en_unified + image + bot_task=think: token 72803 = "Assistant"). Switch build_prompt() to use the full word so the AR prefill aligns with the official HF tokenization. Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107 reference implementation does the same; the previous T2T-specific branch was a workaround for an earlier prompt-format experiment). Note: BPE merge across user_prompt/Assistant boundary still produces 1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template keeps them separate. Full byte-identical alignment requires passing pre-tokenized prompt_token_ids — that path is supported by vllm-omni (OmniTokensPrompt) but not yet plumbed through build_prompt(). Signed-off-by: TaffyOfficial <2324465096@qq.com>

- it2i.yaml: switch AR stage to detokenize=true so ar_generated_text comes from the engine; drop the AutoTokenizer fallback in ar2diffusion that was a workaround for detokenize=false. - ar2diffusion: send the conditioning image only via multi_modal_data (matching vLLM's standard schema); the pipeline pre-process already reads it from there. Removes the duplicate pil_image field. - pipeline_hunyuan_image3._resize_and_crop_center: replace with the exact algorithm used by HunyuanImage3Processor._resize_and_crop on the AR side so AR and DiT preprocess condition images identically. Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

Address PR vllm-project#3107 review feedback from Bounty-hunter (2026-05-04): 'AR and DiT call different functions to construct the input template (prepare_model_inputs for DiT). Maybe we should add unit tests to verify that their behavior remains consistent.' Two CPU-only structural guards added to test_prompt_utils.py: 1. test_dit_pipeline_reads_sequence_template_from_generation_config AST-verifies that HunyuanImage3Pipeline.prepare_model_inputs still pulls sequence_template from generation_config (not hardcoded) and forwards it into the tokenizer wrapper. Catches a regression where someone hardcodes 'pretrain' or removes the lookup -- the historic shape of the AR/DiT template-drift bug. 2. test_dit_tokenizer_wrapper_supports_instruct_branch Asserts the DiT-side TokenizerWrapper still recognizes sequence_template='instruct' (the AR-side instruct template anchors live in the model's chat-template definition, not in the wrapper module itself, so we route on the dispatch keyword instead of the literal anchor strings). Both tests are pure-AST/string scans and require no GPU, model weights, or HF cache, so they run in the same core_model+cpu lane as the rest of test_prompt_utils.py. Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

Address PR vllm-project#3107 review (Bounty-hunter / Gaohan123) requesting AR-output-format and DiT-output-accuracy regression tests. Layout mirrors PR vllm-project#2949's split (CPU unit test under tests/diffusion/..., GPU accuracy test under tests/e2e/accuracy/...). CPU unit test tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py - test_ar_prefill_tokens_match_hf_apply_chat_template_for_it2i: asserts build_prompt_tokens (the AR-side prefill builder) is token-id-identical to HF tokenizer.apply_chat_template for the same (system, user_prompt, image) triple. Catches drift between the AR's input distribution and the model's training distribution -- the same failure mode PR vllm-project#3243 fixed for T2I. - test_dit_condition_image_preprocessing_byte_matches_ar_processor: asserts the diffusion-side _resize_and_crop_center produces byte-identical pixels to the AR-side HunyuanImage3Processor._resize_and_crop on the canonical resize targets. Direct response to Bounty-hunter's PR vllm-project#3107 review. Both tests gate on tencent/HunyuanImage-3.0-Instruct being in the local HF cache (no GPU/model weights required at runtime, just the tokenizer config + image processor). GPU accuracy test tests/e2e/accuracy/test_hunyuan_image3_it2i.py - test_hunyuan_image3_it2i_matches_hf_reference_psnr_40: drives vllm-omni's offline IT2I path through Omni and runs the official HF reference via AutoModelForCausalLM.generate_image, compared via the shared assert_similarity helper at PSNR>=40 dB and SSIM>=0.92. Marked full_model + skipif<8 GPUs; the threshold follows PR vllm-project#2949's review discussion (40 dB gives slack for TP=2 NCCL drift while still catching prompt/image-preprocessing bugs). Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

…output alignment The previous CPU-side test in test_hunyuan_image3_it2i_ar_format.py called the official tokenizer's apply_chat_template to render the AR prefill prompt and compared its token id sequence to vllm-omni's build_prompt_tokens output. Two problems: - it tested the *input* prompt only, not the AR's *generated output* (which is what 'AR output matches official' actually demands); - HunyuanImage3TokenizerFast.from_pretrained(snap) returns a byte-fallback (char-level) tokenizer in a vacuum, which is not the encoding the vllm-omni production path uses (AutoTokenizer with the BPE merges from tokenizer.json) -- so the comparison was apples vs char-bytes and could never pass on PyPI 0.20.x. Replaced with a real GPU-required e2e test in tests/e2e/accuracy/test_hunyuan_image3_it2i_ar_output.py that: - drives the HF reference via AutoModelForCausalLM.from_pretrained + model.prepare_model_inputs + model.generate(do_sample=False) (the pattern already in scripts/bench/bench_ar_hf.py); - drives vllm-omni AR via the i2t stage YAML with the prompt fed as prompt_token_ids = HF prefill (the alignment path documented in workflow-starter/memory/hf/hf_omni_alignment_method.md); - asserts prefill input_ids byte-equality and the first 8 of 16 greedy AR-generated tokens match between HF and omni. Skips cleanly when the snapshot is missing the two manual modeling patches (RoPE broadcast / 2D attention_mask SDPA fall-through) that the project's HF baseline runbook requires. The CPU-only DiT/HF condition-image preprocessing byte-equality check in test_hunyuan_image3_it2i_ar_format.py is preserved -- it passes locally and guards Bounty-hunter's PR vllm-project#3107 review item directly. Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

Signed-off-by: zuiho <2324465096@qq.com>

TaffyOfficial · 2026-05-06T12:25:12Z

@Bounty-hunter @Gaohan123 @hsliuustc0106 need review

Bounty-hunter · 2026-05-06T12:27:35Z

LGTM now

Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: skf1999 <13234016272@163.com>

Gaohan123 · 2026-05-06T12:56:09Z

        token_ids = build_prompt_tokens(p, tokenizer, task=task, sys_type=args.sys_type)
-
-        prompt_dict: dict = {"prompt_token_ids": token_ids, "prompt": p}
+        preset_sys_type, _, _ = _TASK_PRESETS[task]


I suggest we list tasks explicitly in end2end.py rather than private enumerate

Suggestion adopted.

Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: skf1999 <13234016272@163.com>

hsliuustc0106

lgtm

…rebase (#3395) Signed-off-by: dengyunyang <584797741@qq.com>

Default IT2I (`hunyuan_image3_it2i.yaml`) and the AR+DiT T2I config (`hunyuan_image3_t2i_2gpu.yaml`) left `skip_special_tokens` at the vLLM default (True), so the AR engine stripped the trailing `<img_size_BASE><img_ratio_Y>` markers from `output.text` before the `ar2diffusion` bridge could read them. With the previous ar2diffusion fix, that meant the bridge fell back to the prompt-carried height/width — i.e. `pil_images[0].size` from the OpenAI edits path, which collapsed to a square whenever the first reference image was square. The KV-reuse config (`hunyuan_image3_it2i_kv_reuse.yaml`) already sets this flag (added in vllm-project#3346 because the KV reuse machinery needs the exact AR token stream), but the original IT2I yaml from vllm-project#3107 did not need it at the time and was never updated when ar2diffusion grew the ratio-token consumer. Aligns both configs with `_kv_reuse.yaml`. AR token-id fallback in ar2diffusion still works for users who keep the default, but having the text path live by default is cheaper (no tokenizer load) and avoids the model-name/path ambiguity the token-id fallback hits when the model is loaded from a local directory rather than the HF hub identifier. Signed-off-by: TaffyOfficial <wu15922848573@outlook.com>

…#3107) Signed-off-by: TaffyOfficial <2324465096@qq.com> Signed-off-by: zuiho <2324465096@qq.com> Signed-off-by: skf1999 <13234016272@163.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>

…project#3107 rebase (vllm-project#3395) Signed-off-by: dengyunyang <584797741@qq.com>

skf-1999 requested a review from hsliuustc0106 as a code owner April 24, 2026 11:14

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

skf-1999 force-pushed the it2i branch from 00c4135 to 3e907f4 Compare April 25, 2026 08:30

Fishermanykx reviewed Apr 25, 2026

View reviewed changes

zengchuang-hw reviewed Apr 27, 2026

View reviewed changes

skf-1999 force-pushed the it2i branch 2 times, most recently from 033ff7a to c238a8e Compare April 27, 2026 09:01

Gaohan123 added this to the v0.20.0 milestone Apr 27, 2026

Bounty-hunter reviewed Apr 29, 2026

View reviewed changes

Gaohan123 reviewed Apr 30, 2026

View reviewed changes

Gaohan123 mentioned this pull request Apr 30, 2026

[Model] Add edit preprocessor for HunyuanImage3 #1644

Closed

5 tasks

Gaohan123 added the high priority high priority issue, needs to be done asap label Apr 30, 2026

Bounty-hunter reviewed May 3, 2026

View reviewed changes

TaffyOfficial force-pushed the it2i branch from 79a177e to 67b5b6e Compare May 4, 2026 08:31

TaffyOfficial force-pushed the it2i branch from 67b5b6e to a2e1b86 Compare May 4, 2026 08:57

Bounty-hunter reviewed May 4, 2026

View reviewed changes

Bounty-hunter mentioned this pull request May 5, 2026

[Feature][Hunyuan image 3.0] AR + DIT with kv reuse. #3346

Merged

5 tasks

TaffyOfficial force-pushed the it2i branch 4 times, most recently from 7d2c974 to e89d986 Compare May 6, 2026 06:23

Bounty-hunter mentioned this pull request May 6, 2026

[RFC]: Support Hunyuan image AR + DIT JiusiServe/vllm-omni#183

Closed

1 task

TaffyOfficial force-pushed the it2i branch from e89d986 to 98b8c23 Compare May 6, 2026 10:08

skf-1999 changed the title ~~[WIP][Feature] HunyuanImage-3.0 IT2I (image editing) support~~ [Feature] HunyuanImage-3.0 IT2I (image editing) support May 6, 2026

Bounty-hunter reviewed May 6, 2026

View reviewed changes

TaffyOfficial force-pushed the it2i branch from 98b8c23 to 3861ba9 Compare May 6, 2026 12:17

TaffyOfficial force-pushed the it2i branch from 3861ba9 to 98b8c23 Compare May 6, 2026 12:19

remove vLLM 0.20 migration comments per CR

302e60b

Signed-off-by: zuiho <2324465096@qq.com>

TaffyOfficial force-pushed the it2i branch from f8ab8ee to 302e60b Compare May 6, 2026 12:21

[Feature] HunyuanImage-3.0 IT2I (image editing) support

3dfc181

Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: skf1999 <13234016272@163.com>

Gaohan123 reviewed May 6, 2026

View reviewed changes

[Feature] HunyuanImage-3.0 IT2I (image editing) support

dd3a6bf

Co-authored-by: dengyunyang <584797741@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: skf1999 <13234016272@163.com>

Gaohan123 added the ready label to trigger buildkite CI label May 6, 2026

hsliuustc0106 approved these changes May 6, 2026

View reviewed changes

hsliuustc0106 merged commit 576afb6 into vllm-project:main May 6, 2026
8 checks passed

Bounty-hunter mentioned this pull request May 7, 2026

[bugfix][hunyuaniamge] Fix parameter issue introduced during PR #3107 rebase #3395

Merged

5 tasks

gcanlin pushed a commit that referenced this pull request May 7, 2026

[bugfix][hunyuaniamge] Fix parameter issue introduced during PR #3107 …

3c85ca5

…rebase (#3395) Signed-off-by: dengyunyang <584797741@qq.com>

Bounty-hunter mentioned this pull request May 7, 2026

[Feature]: [Hunyuanimage]Support DIT reuse kv from AR stage JiusiServe/vllm-omni#216

Open

1 task

Bounty-hunter mentioned this pull request May 10, 2026

[RFC]: HunyuanImage Model deployment optimization #2015

Open

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[bugfix][hunyuaniamge] Fix parameter issue introduced during PR vllm-…

92bec10

…project#3107 rebase (vllm-project#3395) Signed-off-by: dengyunyang <584797741@qq.com>

		return rotated.flatten(-2, -1).to(x.dtype)


		def apply_rope_to_qk(

		raise TypeError(f"Unsupported image input type: {type(image)}")


		def _resize_and_crop_center(image: PILImage.Image, target_width: int, target_height: int) -> PILImage.Image:

		pytestmark = [pytest.mark.full_model, pytest.mark.diffusion]


		MODEL_NAME = "tencent/HunyuanImage-3.0-Instruct"

Conversation

skf-1999 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bounty-hunter commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

skf-1999 commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bounty-hunter commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial commented May 6, 2026

Uh oh!

Bounty-hunter commented May 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skf-1999 commented Apr 24, 2026 •

edited

Loading