Skip to content

[Example] Add Hunyuan-Image3 end2end.py and README.md#2590

Merged
hsliuustc0106 merged 3 commits intovllm-project:mainfrom
kechengliu97:ar-reuse
Apr 20, 2026
Merged

[Example] Add Hunyuan-Image3 end2end.py and README.md#2590
hsliuustc0106 merged 3 commits intovllm-project:mainfrom
kechengliu97:ar-reuse

Conversation

@kechengliu97
Copy link
Copy Markdown
Contributor

@kechengliu97 kechengliu97 commented Apr 8, 2026

This pull request introduces major improvements to the HunyuanImage-3.0-Instruct example for offline inference. It adds a new unified end-to-end inference script (end2end.py) supporting all modalities, significantly expands the documentation for setup and usage, and provides new and updated stage configuration files for advanced features like AR→DiT KV cache reuse and MoE. The changes make the example easier to use, more flexible, and better documented for different GPU setups and use cases.

Key changes:

1. New Unified Inference Script

  • Added end2end.py to examples/offline_inference/hunyuan_image3/, providing a single entry point for all modalities (text2img, img2img, img2text, text2text) with flexible command-line arguments and prompt formatting. This script handles prompt construction, image loading, output saving, and integrates with the vLLM-Omni pipeline.

2. Documentation Overhaul and Usage Instructions

  • Completely rewrote README.md in the hunyuan_image3 example folder to document setup, modality control, command-line arguments, inference steps, configuration options, prompt format, and troubleshooting tips. The documentation now clearly explains how to use the new script and stage configs for various GPU setups and tasks.

3. New and Updated Stage Configuration Files

  • Added hunyuan_image3_moe.yaml, a new stage config enabling AR→DiT KV cache reuse for efficient multi-GPU inference, with detailed inline documentation and settings for 8x L40S GPUs.
  • Updated hunyuan_image3_t2i.yaml to remove unnecessary omni_kv_config from the DiT stage, improving clarity and correctness of the configuration.

These changes make the HunyuanImage-3.0-Instruct example much more user-friendly, flexible, and ready for advanced multi-modal inference scenarios.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 5 times, most recently from 73d4bd7 to 6e7ad32 Compare April 9, 2026 03:11
@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [WIP][Perf] Enable AR KV prefix reuse in DiT image transformer (Hunyuan-Image-3)

Thanks for the work on AR-to-DiT KV cache reuse -- this is a meaningful optimization for the two-stage pipeline. Here are findings from reviewing the diff. Since this is marked WIP, I am leaving comments rather than requesting changes.


1. Missing cleanup of ar_kv_prefix after denoising loop (memory leak / stale state)

hunyuan_image_3_transformer.py -- __call__ method of the pipeline

The AR KV prefix tensors are injected onto each layer's image_attn via layer.self_attn.image_attn.ar_kv_prefix = (ar_key, ar_value) before the sampling loop, but they are never cleared after the loop finishes. This means:

  • The tensors remain pinned in GPU memory until the next request (or forever if no subsequent request uses KV reuse).
  • If a subsequent request does NOT use KV reuse, the stale prefix from the previous request would still be present on the attention modules (since getattr(self, "ar_kv_prefix", None) would find them).

Suggested fix: Add a finally block or post-loop cleanup that sets ar_kv_prefix = None on all layers after the denoising loop completes, regardless of success or failure:

# After the denoising loop
if ar_kv_reuse_len > 0:
    for layer in self.model.model.layers:
        if hasattr(layer, "self_attn") and hasattr(layer.self_attn, "image_attn"):
            layer.self_attn.image_attn.ar_kv_prefix = None

2. Redundant pop-and-reassign pattern in _generate

In pipeline_hunyuan_image_3.py, _generate method (around lines 813-819):

ar_kv_cache = kwargs.pop("ar_kv_cache", None)
ar_kv_metadata = kwargs.pop("ar_kv_metadata", None)
if ar_kv_cache is not None:
    kwargs["ar_kv_cache"] = ar_kv_cache
    kwargs["ar_kv_metadata"] = ar_kv_metadata

This pops the keys from kwargs and immediately puts them back. The net effect is a no-op when ar_kv_cache is not None, and it silently drops ar_kv_metadata when ar_kv_cache is None. If the intent is to ensure these keys are only passed when the cache exists, a simpler approach would be:

if "ar_kv_cache" not in kwargs or kwargs.get("ar_kv_cache") is None:
    kwargs.pop("ar_kv_cache", None)
    kwargs.pop("ar_kv_metadata", None)

Or just leave them in kwargs and let downstream handle None.

3. Shallow list multiplication for batch_cond_image_info

In pipeline_hunyuan_image_3.py forward method:

batch_cond_image_info = [cond_image_infos] * len(req.prompts)

This creates a list where every element is the same list object. If any downstream code mutates one element (e.g., appends or pops from it), all elements are affected. Consider using a list comprehension:

batch_cond_image_info = [list(cond_image_infos) for _ in range(len(req.prompts))]

4. Dynamic attribute injection (setattr pattern) on nn.Module

Setting ar_kv_prefix as a dynamic attribute on attention modules via attribute assignment is fragile:

  • It bypasses nn.Module's parameter/buffer tracking, so these tensors are invisible to .state_dict(), .to(), torch.save(), etc.
  • It creates an implicit contract between the pipeline's __call__ and the attention __call__ that is easy to break silently.

Consider instead passing the AR KV prefix through model_kwargs and threading it through the forward calls, or at minimum, initializing self.ar_kv_prefix = None in __init__ of the attention class so the attribute is always present and discoverable.

5. SP (Sequence Parallel) path silently disabled

elif ar_kv_prefix is not None:
    # SP case: AR KV reuse not yet supported, clear to avoid stale state
    self.ar_kv_prefix = None

This silently discards the prefix in SP mode. For a WIP this is acceptable, but it would be good to emit a one-time warning (e.g., via logger.warning_once) so users know KV reuse is not active when SP is enabled. Otherwise it is a silent performance regression from user expectation.

6. Error handling gap in _get_cla_factor fallback

In the CLA fallback logic:

if (ar_key is None or ar_value is None) and cla_factor > 1:
    base_idx = (layer_idx // cla_factor) * cla_factor
    if base_idx < num_kv_layers:
        ar_key = ar_kv_cache.key_cache[base_idx]
        ar_value = ar_kv_cache.value_cache[base_idx]

If base_idx >= num_kv_layers, ar_key and ar_value remain None, and the code falls through to the if ar_key is not None check which then sets ar_kv_reuse_len = 0 and cleans up. This is correct behavior but the path is subtle. A comment noting this intentional fallthrough would help readability.

7. Minor: text_and_image_to_image.py example script

  • Line from vllm_omni.diffusion.data import DiffusionParallelConfig, logger -- importing logger from vllm_omni.diffusion.data and using it in an example script is unusual. Consider using logging.getLogger(__name__) for the example, or just print() which is already used elsewhere in the script.
  • action="store_true", default=True for --infer-align-image-size means the flag is always True and cannot be turned off from the CLI. Should probably be default=False if you want it to be opt-in, or remove the flag and hardcode it.

8. YAML config: hardcoded port number

In hunyuan_image_3_moe_mooncake.yaml:

zmq_port: 50051

Hardcoded ports can conflict in multi-tenant environments. Consider documenting that users should change this, or use port 0 / auto-detection if the connector supports it.


Summary

The core approach -- injecting AR KV as a prefix into DiT attention at each denoising step -- is sound and well-structured. The CLA fallback logic and CFG-parallel awareness show good attention to the model's architecture. The main concern is the missing post-loop cleanup (issue 1), which is a real bug that would cause stale state and memory leaks. The other items are code quality improvements that should be addressed before removing the WIP label.

@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 4 times, most recently from 45473bb to f2519c2 Compare April 17, 2026 03:36
@kechengliu97 kechengliu97 changed the title [WIP][Perf] Enable AR KV prefix reuse in DiT image transformer (Hunyuan-Image-3) [Feautre] Enable AR KV prefix reuse in DiT image transformer (Hunyuan-Image-3) Apr 17, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Please fix precommit. Thanks

@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 2 times, most recently from f497b95 to d091496 Compare April 17, 2026 04:35
@Gaohan123 Gaohan123 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 17, 2026
@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 2 times, most recently from b6853df to 4f805d5 Compare April 17, 2026 06:24
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

multi_modal_data = p.get("multi_modal_data", {})
images = multi_modal_data.get("image") or multi_modal_data.get("images")
if images is not None:
from PIL import Image as PILImage
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this import to the top of the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if ar_kv_reuse_len > 0:
for layer in self.model.model.layers:
if hasattr(layer, "self_attn") and hasattr(layer.self_attn, "image_attn"):
layer.self_attn.image_attn.ar_kv_prefix = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: cleanup is not exception-safe
The cleanup at line 3034 only runs on the happy path. If anything inside the denoising loop (scheduler step, per-layer attention, OOM, shape mismatch, connector error) raises, control never reaches this block and ar_kv_prefix stays pinned on every image_attn module. The next request then either (a) doesn't use KV reuse and the getattr(self, "ar_kv_prefix", None) check at line 1055 still returns the stale tensors, silently injecting them, or (b) uses KV reuse but a layer fails mid-injection (lines 2911–2920 break) and only some layers get cleaned up.
Please wrap the denoising loop + the self._ar_kv_reuse_len = ar_kv_reuse_len assignment in try, and move the cleanup into finally:

try:
    # denoising loop ...
    self._ar_kv_reuse_len = ar_kv_reuse_len
finally:
    for layer in self.model.model.layers:
        if hasattr(layer, "self_attn") and hasattr(layer.self_attn, "image_attn"):
            layer.self_attn.image_attn.ar_kv_prefix = None

Cleanup should be unconditional, not gated on ar_kv_reuse_len > 0 — the break path at line 2920 may have partially injected before bailing.

ref_img.width, ref_img.height
)
height, width = aligned_h, aligned_w
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: break causes silent batch corruption for multi-prompt requests
The loop at lines 1024–1043 walks req.prompts, and on the first prompt containing an image it constructs batch_cond_image_info by replicating that prompt's images across the entire batch (line 1034), then breaks. This means:

If prompts 0 and 1 both have images but different ones, prompt 1's images are silently replaced by prompt 0's.
If prompt 0 has no image but prompt 1 does, prompt 1's images get broadcast to prompt 0, which did not request conditioning.
A batch with mixed conditioned/unconditioned prompts becomes fully conditioned.

This currently appears to work because test coverage is single-prompt only.
Suggested fix — build per-slot and don't break:

batch_cond_image_info = None
for i, p in enumerate(req.prompts):
    if not isinstance(p, dict):
        continue
    multi_modal_data = p.get("multi_modal_data", {})
    images = multi_modal_data.get("image") or multi_modal_data.get("images")
    if images is None:
        continue

    from PIL import Image as PILImage
    if isinstance(images, PILImage.Image):
        images = [images]

    if batch_cond_image_info is None:
        batch_cond_image_info = [[] for _ in range(len(req.prompts))]
    batch_cond_image_info[i] = [
        self.image_processor.build_joint_image_info(img) for img in images
    ]

Separately, the infer_align_image_size path at lines 1036–1042 derives a single height, width from the first-seen image. In a true batched call with different reference sizes, what's the intended semantics? Please either document this as "first prompt wins" or reject mixed-size batches.

model_kwargs=kwargs,
)
# [AR→DiT KV Reuse] Propagate reuse length for output metadata
self._ar_kv_reuse_len = getattr(self.pipeline, "_ar_kv_reuse_len", 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: _ar_kv_reuse_len leaks across requests
Line 825: self._ar_kv_reuse_len = getattr(self.pipeline, "_ar_kv_reuse_len", 0) reads from the self.pipeline instance, which is long-lived. Line 3030 of hunyuan_image3_transformer.py writes self._ar_kv_reuse_len = ar_kv_reuse_len only when the pipeline's call is entered for this request — but nothing resets it on entry.
Consequence: Request A uses KV reuse and sets ar_kv_reuse_len = 500. Request B is baseline (no ar_kv_cache). Request B's pipeline call never touches _ar_kv_reuse_len, so it's still 500 from A. Line 1075 of pipeline_hunyuan_image3.py (forward()) then emits custom_output = {"ar_kv_reuse_len": 500} for B.
This also makes the e2e test at test_hunyuanimage3_rdma_kv_reuse.py (checking ar_kv_reuse_len > 0) unreliable on CI reruns in the same process.
Fix: reset at entry of the transformer call, before the AR injection block at line 2858:

self._ar_kv_reuse_len = 0
# ---- [AR→DiT KV Reuse] Prepare AR KV injection ----

Or — cleaner — remove the instance attribute and thread the value back through the existing return path.

parts.append("\n\nUser: ")
if has_image_input:
parts.append("<img>")
parts.append(user_prompt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (needs justification): semantic change to build_prompt
This diff removes \n\nUser: / \n\nAssistant: — the entire Instruct conversation template — and reorders so that trigger_tag now comes before user_prompt instead of after \n\nAssistant:

# Before:
parts.append("\n\nUser: ")
if has_image_input: parts.append("<img>")
parts.append(user_prompt)
parts.append("\n\nAssistant: ")
if trigger_tag: parts.append(trigger_tag)

# After:
if has_image_input: parts.append("<img>")
if trigger_tag: parts.append(trigger_tag)
parts.append(user_prompt)

Two concerns:

The i2t docstring now says pretrain template (<|startoftext|>{system}{question}), but prompt_utils.py is shared across all tasks (t2i_think, it2i_think, etc.). Do all those tasks expect pretrain format as well? If the model was trained with Instruct format for some of them, this will silently degrade quality with no obvious error.
trigger_tag (e.g. , ) used to come after the Assistant: turn marker — semantically "I, the assistant, begin with ." After the change it's {user_prompt}, which reads as "the user prompt starts with ." Is that the intended format per the base model's training?

Please either add a comment/commit message explaining why pretrain template is correct for all tasks this touches, or keep the Instruct branch available.

"modalities": ["image"],
"height": 512,
"width": 512,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: modalities field is inconsistent with the request payload
The prompt_dict at lines 162–167:

prompt_dict = {
    "prompt": prompt,
    "modalities": ["image"],
    "height": 512,
    "width": 512,
}

has modalities=["image"] but no multi_modal_data — i.e. no image input. For a T2I request, either the field is a weak hint (in which case it's misleading to future readers) or it's a real routing input (in which case this may be passing by accident). Please either remove the field or populate it correctly.
Same issue in test_hunyuanimage3_i2t.py line 50: modalities=["text"] paired with multi_modal_data={"image": input_image}. Mirror image — the test has an image input but claims text modality.
These tests are meant to guard against regression; self-inconsistent payloads undermine that.

@@ -0,0 +1,251 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: overlapping scope with examples/offline_inference/hunyuan_image3/image_to_image.py
This PR introduces two TI2I examples in different directories:

examples/offline_inference/hunyuan_image3/image_to_image.py (115 lines, minimal)
examples/offline_inference/text_to_image/text_and_image_to_image.py (251 lines, full-featured)

run.sh ti2i invokes the second one, and the README's run.sh ti2i row is the only TI2I entry point. Two questions for future maintainers:

Why are there two TI2I entrypoints?
Which is canonical?

If the shorter one is a minimal demo and the longer one is the full example, please say so in each file's module docstring. Otherwise consider deleting the shorter one and pointing users to the single canonical script, to avoid divergence over time.

name: MooncakeTransferEngineConnector
extra:
host: "auto" # Auto-detect local RDMA IP
zmq_port: 50051 # ZMQ base port; change in multi-tenant envs to avoid conflicts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: hardcoded zmq_port: 50051
The CI config (tests/e2e/offline_inference/stage_configs/hunyuan_image3_mooncake_rdma_ci.yaml line 107) correctly uses a ${ZMQ_PORT} placeholder that gets substituted to a free port at runtime. The production YAML here hardcodes 50051, which will collide in any multi-tenant or concurrent-job environment.
The inline comment does note "change in multi-tenant envs to avoid conflicts", but this is easy to miss. Suggest either:

Add a more prominent comment at the top of the file ("IMPORTANT: change this port before deploying alongside other instances"), or
If Mooncake supports it, use port 0 / auto-allocation here as well.

# CLA fallback: use base layer's KV if this layer's is None
if (ar_key is None or ar_value is None) and cla_factor > 1:
base_idx = (layer_idx // cla_factor) * cla_factor
if base_idx < num_kv_layers:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: dead conditional

if layer_idx < num_kv_layers:                        # line 2893
    ...
    if (ar_key is None or ar_value is None) and cla_factor > 1:
        base_idx = (layer_idx // cla_factor) * cla_factor
        if base_idx < num_kv_layers:                 # line 2899 — always True

Because layer_idx < num_kv_layers is already asserted at line 2893, and base_idx = (layer_idx // cla_factor) * cla_factor ≤ layer_idx, base_idx < num_kv_layers is a tautology. The comment at lines 2902–2903 says "If base_idx >= num_kv_layers, ar_key/ar_value stay None…" but that branch is unreachable.
Either drop the inner check, or (if you want defensive coding) convert the comment into an assert.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# Negative branch or sequential CFG: use common prefix
ar_kv_reuse_len = min(common_prefix_len, ar_kv_seq_len)

if ar_kv_reuse_len > 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to extract some functions to avoid excessively high function complexity?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 17, 2026
Blockers fixed:
- Add try/finally wrapper around denoising loop for exception-safe cleanup
- Fix batch corruption: build per-slot cond_image_info instead of broadcasting
- Reset _ar_kv_reuse_len=0 at entry to prevent stale state leaking across requests
- Add docstring explaining pretrain template format choice in prompt_utils.py
- Convert CLA fallback dead conditional to defensive assert

Should-fix items:
- Add prominent zmq_port warning comment in YAML config for multi-tenant environments
- Document TI2I scripts to clarify canonical vs simplified versions
@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 4 times, most recently from 41d5980 to 2b223f9 Compare April 20, 2026 06:07
seq_len,
cfg_parallel_ready,
cfg_rank,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: cfg_rank is used before initialization in the non-CFG-parallel path

In HunyuanImage3Text2ImagePipeline.call, cfg_rank is only assigned inside the if cfg_parallel_ready: branch at lines 2950-2952, but it
is later passed unconditionally into compute_reuse_len(...) at line 2990. In the else: branch (lines 2963-2964), only cfg_factor is
initialized.

That means whenever ar_kv_cache is not None and cfg_parallel_ready == False (which is the normal sequential CFG / non-CFG-parallel path),
this code will raise UnboundLocalError before sampling starts.

@kechengliu97 kechengliu97 changed the title [Feautre] Enable AR KV prefix reuse in DiT image transformer (Hunyuan-Image-3) [Example] Add Hunyuan-Image3 end2end.py and README.md Apr 20, 2026
@kechengliu97 kechengliu97 force-pushed the ar-reuse branch 2 times, most recently from f148592 to 7638461 Compare April 20, 2026 06:58
Introduce a unified end-to-end inference entrypoint (examples/offline_inference/hunyuan_image3/end2end.py) that supports text2img, img2img, img2text and text2text modalities, consolidating previous example utilities. Rewrite README to document modality control, CLI args, stage configs, usage examples and MoE notes. Add a new MoE stage config (vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml) to enable AR→DiT KV reuse across GPUs. Remove older per-task example scripts (image_to_text.py, prompt_utils.py) in favor of the unified workflow.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Apr 20, 2026
Remove several engine-related and batching settings from vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml. The removed keys are: gpu_memory_utilization, engine_output_type, enable_prefix_caching, and max_num_batched_tokens. This cleans up the stage config to rely on defaults or external runtime settings and reduces redundant configuration entries.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
@hsliuustc0106 hsliuustc0106 merged commit 6128f6d into vllm-project:main Apr 20, 2026
8 checks passed
nainiu258 pushed a commit to nainiu258/vllm-omni that referenced this pull request Apr 21, 2026
)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: nainiu258 <cperfect02@163.com>
@mephisto1484
Copy link
Copy Markdown

HunyuanImage-3.0 IT2I Inference Issue: Model Not Reading Image Information

Issue Description

When using it2i_inference.py with hunyuan_it2i_4gpu.yaml for IT2I (Image-to-Image) inference with HunyuanImage-3.0, the model fails to properly read and process the input image information, resulting in unexpected generation results.

Environment Information

  • Model: HunyuanImage-3.0
  • Inference Script: it2i_inference.py
  • Configuration File: hunyuan_it2i_4gpu.yaml
  • Hardware Configuration: 4x H20 96GB GPU

Reproduction Steps

  1. Run inference using it2i_inference.py script
  2. Provide image input and text prompt
  3. Observe the generated results

Abnormal Behavior

When using it2i_inference.py + hunyuan_it2i_4gpu.yaml for HunyuanImage-3.0 inference:

  • ❌ Model does not see the image information
  • ❌ Generated image is unrelated to the input image
  • ❌ Appears to generate based solely on text prompt, ignoring the image
  • ⚠️ Key observation: Log shows text length=0 during AR generation (line 302)

Problematic Scenario - Execution Log

# Paste the complete log from running it2i_inference.py here

~:bash infer.sh 
======================================
HunyuanImage-3.0 IT2I Test Script
======================================

Configuration:
  Input image: cat.png
  Prompt: 给图中的人画个发型,其他的一切都不变
  Model: /My/local/model/path
  Config: ./hunyuan_it2i_4gpu.yaml
  Output: ./output_it2i/07

Starting IT2I inference...
This may take 5-10 minutes for first run (model loading + compilation)


============================================================
Loading input image...
  Image path: cat.png
  Image size: 1024x1024
  Image mode: RGB

Prompt Configuration:
  User prompt: 给图中的人画个发型,其他的一切都不变
  Task mode: think
  System type: en_unified

Initializing Omni with 4x H20-96G GPUs...
  Model: /My/local/model/path
  Stage config: ./hunyuan_it2i_4gpu.yaml
  Init timeout: 600s
INFO 04-22 14:47:34 [omni_base.py:146] [Omni] Initializing with model ./hunyuan/HunyuanImage-3.0-Instruct
INFO 04-22 14:47:34 [async_omni_engine.py:274] [AsyncOmniEngine] Initializing with model ./hunyuan/HunyuanImage-3.0-Instruct
WARNING 04-22 14:47:34 [utils.py:584] --stage-configs-path is deprecated; migrate './hunyuan_it2i_4gpu.yaml' and use --deploy-config.
INFO 04-22 14:47:34 [async_omni_engine.py:331] [AsyncOmniEngine] Launching Orchestrator thread with 2 stages
INFO 04-22 14:47:34 [stage_init_utils.py:370] [stage_init] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
INFO 04-22 14:47:34 [initialization.py:314] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 04-22 14:47:34 [initialization.py:351] Loaded OmniTransferConfig with 1 connector configurations
INFO 04-22 14:47:34 [async_omni_engine.py:735] [AsyncOmniEngine] Initializing stage 0
INFO 04-22 14:47:34 [stage_init_utils.py:385] [stage_init] Stage-0 set runtime devices: 0,1
INFO 04-22 14:47:34 [async_omni_engine.py:735] [AsyncOmniEngine] Initializing stage 1
WARNING 04-22 14:47:34 [config.py:347] Config format `mistral` is already registered, and will be overwritten by the new parser class `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>`.
INFO 04-22 14:47:34 [config.py:358] Registered config parser `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>` with config format `mistral`
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-22 14:47:34 [config.py:446] Replacing legacy 'type' key with 'rope_type'
INFO 04-22 14:47:46 [model.py:549] Resolved architecture: HunyuanImage3ForCausalMM
INFO 04-22 14:47:46 [model.py:1678] Using max model len 22800
INFO 04-22 14:47:46 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-22 14:47:46 [vllm.py:790] Asynchronous scheduling is enabled.
WARNING 04-22 14:47:46 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 04-22 14:47:46 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-22 14:47:46 [vllm.py:1025] Cudagraph is disabled under eager mode
WARNING 04-22 14:47:46 [cuda.py:199] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
INFO 04-22 14:47:46 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
INFO 04-22 14:47:46 [async_omni_engine.py:447] [AsyncOmniEngine] Stage 0 engine launch started
INFO 04-22 14:47:46 [stage_init_utils.py:385] [stage_init] Stage-1 set runtime devices: 2,3
(StageEngineCoreProc pid=140220) INFO 04-22 14:47:57 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='./hunyuan/HunyuanImage-3.0-Instruct', speculative_config=None, tokenizer='./hunyuan/HunyuanImage-3.0-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=22800, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=./hunyuan/HunyuanImage-3.0-Instruct, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(StageEngineCoreProc pid=140220) WARNING 04-22 14:47:57 [multiproc_executor.py:1014] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(StageEngineCoreProc pid=140220) INFO 04-22 14:47:57 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=11.38.176.233 (local), world_size=2, local_world_size=2
INFO 04-22 14:47:57 [multiproc_executor.py:138] Starting server...
INFO 04-22 14:48:08 [diffusion_worker.py:527] Worker 0 created result MessageQueue
INFO 04-22 14:48:08 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 14:48:08 [vllm.py:790] Asynchronous scheduling is enabled.
INFO 04-22 14:48:08 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 14:48:08 [vllm.py:790] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 04-22 14:48:08 [diffusion_worker.py:131] Worker 0: Initialized device and distributed environment.
INFO 04-22 14:48:08 [diffusion_worker.py:131] Worker 1: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-22 14:48:08 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-22 14:48:08 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-22 14:48:08 [parallel_state.py:630] SP group details for rank 1: sp_group=[1], ulysses_group=[1], ring_group=[1]
INFO 04-22 14:48:08 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-22 14:48:08 [config.py:446] Replacing legacy 'type' key with 'rope_type'
INFO 04-22 14:48:08 [config.py:446] Replacing legacy 'type' key with 'rope_type'
INFO 04-22 14:48:08 [pipeline_hunyuan_image3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-22 14:48:08 [pipeline_hunyuan_image3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-22 14:48:09 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-22 14:48:09 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-22 14:48:09 [layer.py:396] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
INFO 04-22 14:48:09 [unquantized.py:165] FlashInfer MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
INFO 04-22 14:48:09 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(Worker pid=141094) INFO 04-22 14:48:09 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57511 backend=nccl
(Worker pid=141095) INFO 04-22 14:48:09 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57511 backend=nccl
(Worker pid=141094) INFO 04-22 14:48:09 [pynccl.py:111] vLLM is using nccl==2.27.5
Multi-thread loading shards:   0% Completed | 0/32 [00:00<?, ?it/s]
(Worker pid=141094) INFO 04-22 14:48:11 [parallel_state.py:1716] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=141095) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=141095) INFO 04-22 14:48:11 [hunyuan_image3.py:887] Successfully processed 1 image(s). Final tensor shapes: {'vit_pixel_values': (1, 1024, 768), 'vit_pixel_attention_mask': (1, 1024), 'vit_spatial_shapes': (1, 2), 'vae_pixel_values': (1, 3, 1024, 1024), 'vae_token_grid_hw': (1, 2), 'base_size': (1,), 'ratio_index': (1,)}
(Worker pid=141094) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=141095) INFO 04-22 14:48:11 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=sender
(Worker pid=141095) INFO 04-22 14:48:11 [factory.py:46] Created connector: SharedMemoryConnector
(Worker pid=141095) INFO 04-22 14:48:11 [kv_transfer_manager.py:333] Sender connector eagerly initialized
(Worker_TP1 pid=141095) WARNING 04-22 14:48:11 [base.py:188] [LLM Worker 1] Sleep Mode DISABLED.
(Worker_TP1 pid=141095) WARNING 04-22 14:48:11 [base.py:188] [LLM Worker 1] Sleep Mode DISABLED.
(Worker pid=141094) INFO 04-22 14:48:12 [hunyuan_image3.py:887] Successfully processed 1 image(s). Final tensor shapes: {'vit_pixel_values': (1, 1024, 768), 'vit_pixel_attention_mask': (1, 1024), 'vit_spatial_shapes': (1, 2), 'vae_pixel_values': (1, 3, 1024, 1024), 'vae_token_grid_hw': (1, 2), 'base_size': (1,), 'ratio_index': (1,)}
(Worker pid=141094) INFO 04-22 14:48:12 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=sender
(Worker pid=141094) INFO 04-22 14:48:12 [factory.py:46] Created connector: SharedMemoryConnector
(Worker pid=141094) INFO 04-22 14:48:12 [kv_transfer_manager.py:333] Sender connector eagerly initialized
(Worker_TP0 pid=141094) WARNING 04-22 14:48:12 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(Worker_TP0 pid=141094) WARNING 04-22 14:48:12 [base.py:188] [LLM Worker 0] Sleep Mode DISABLED.
(Worker_TP0 pid=141094) INFO 04-22 14:48:12 [gpu_model_runner.py:4735] Starting to load model ./hunyuan/HunyuanImage-3.0-Instruct...
Multi-thread loading shards:   3% Completed | 1/32 [00:01<00:38,  1.23s/it]
(Worker_TP0 pid=141094) INFO 04-22 14:48:12 [cuda.py:334] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker_TP0 pid=141094) INFO 04-22 14:48:12 [unquantized.py:186] Using TRITON backend for Unquantized MoE
Multi-thread loading shards:   6% Completed | 2/32 [00:02<00:39,  1.32s/it]
(Worker_TP1 pid=141095) INFO 04-22 14:48:13 [hunyuan_image3.py:1332] Replaced 32 rotary embeddings with HunyuanImage3RotaryEmbedding (interleaved 2D RoPE, head_dim=128, rope_theta=10000.0)
(Worker_TP0 pid=141094) INFO 04-22 14:48:13 [hunyuan_image3.py:1332] Replaced 32 rotary embeddings with HunyuanImage3RotaryEmbedding (interleaved 2D RoPE, head_dim=128, rope_theta=10000.0)
Loading safetensors checkpoint shards:   0% Completed | 0/32 [00:00<?, ?it/s]
Multi-thread loading shards:   9% Completed | 3/32 [00:05<01:05,  2.24s/it]
Loading safetensors checkpoint shards:   3% Completed | 1/32 [00:03<01:40,  3.23s/it]
Multi-thread loading shards:  12% Completed | 4/32 [00:07<00:53,  1.91s/it]
Loading safetensors checkpoint shards:   6% Completed | 2/32 [00:05<01:22,  2.75s/it]
Multi-thread loading shards:  16% Completed | 5/32 [00:08<00:46,  1.73s/it]
Multi-thread loading shards:  19% Completed | 6/32 [00:10<00:41,  1.60s/it]
Loading safetensors checkpoint shards:   9% Completed | 3/32 [00:08<01:16,  2.64s/it]
Multi-thread loading shards:  22% Completed | 7/32 [00:11<00:37,  1.52s/it]
Multi-thread loading shards:  25% Completed | 8/32 [00:12<00:35,  1.47s/it]
Loading safetensors checkpoint shards:  12% Completed | 4/32 [00:10<01:10,  2.52s/it]
Multi-thread loading shards:  28% Completed | 9/32 [00:14<00:32,  1.43s/it]
Multi-thread loading shards:  31% Completed | 10/32 [00:15<00:31,  1.43s/it]
Loading safetensors checkpoint shards:  16% Completed | 5/32 [00:12<01:07,  2.48s/it]
Multi-thread loading shards:  34% Completed | 11/32 [00:16<00:29,  1.39s/it]
Multi-thread loading shards:  38% Completed | 12/32 [00:18<00:26,  1.32s/it]
Loading safetensors checkpoint shards:  19% Completed | 6/32 [00:15<01:03,  2.45s/it]
Multi-thread loading shards:  41% Completed | 13/32 [00:19<00:24,  1.27s/it]
Multi-thread loading shards:  44% Completed | 14/32 [00:20<00:22,  1.25s/it]
Loading safetensors checkpoint shards:  22% Completed | 7/32 [00:17<01:00,  2.43s/it]
Multi-thread loading shards:  47% Completed | 15/32 [00:21<00:20,  1.23s/it]
Multi-thread loading shards:  50% Completed | 16/32 [00:22<00:19,  1.21s/it]
Loading safetensors checkpoint shards:  25% Completed | 8/32 [00:19<00:56,  2.36s/it]
Multi-thread loading shards:  53% Completed | 17/32 [00:24<00:18,  1.21s/it]
Loading safetensors checkpoint shards:  28% Completed | 9/32 [00:22<00:53,  2.32s/it]
Multi-thread loading shards:  56% Completed | 18/32 [00:25<00:17,  1.27s/it]
Multi-thread loading shards:  59% Completed | 19/32 [00:26<00:16,  1.25s/it]
Loading safetensors checkpoint shards:  31% Completed | 10/32 [00:24<00:50,  2.28s/it]
Multi-thread loading shards:  62% Completed | 20/32 [00:27<00:15,  1.27s/it]
Multi-thread loading shards:  66% Completed | 21/32 [00:29<00:14,  1.30s/it]
Loading safetensors checkpoint shards:  34% Completed | 11/32 [00:26<00:47,  2.24s/it]
Multi-thread loading shards:  69% Completed | 22/32 [00:30<00:12,  1.30s/it]
Loading safetensors checkpoint shards:  38% Completed | 12/32 [00:28<00:44,  2.21s/it]
Multi-thread loading shards:  72% Completed | 23/32 [00:31<00:11,  1.30s/it]
Multi-thread loading shards:  75% Completed | 24/32 [00:33<00:10,  1.34s/it]
Loading safetensors checkpoint shards:  41% Completed | 13/32 [00:30<00:42,  2.22s/it]
Multi-thread loading shards:  78% Completed | 25/32 [00:34<00:09,  1.39s/it]
Loading safetensors checkpoint shards:  44% Completed | 14/32 [00:32<00:39,  2.19s/it]
Multi-thread loading shards:  81% Completed | 26/32 [00:36<00:08,  1.36s/it]
Multi-thread loading shards:  84% Completed | 27/32 [00:37<00:06,  1.34s/it]
Loading safetensors checkpoint shards:  47% Completed | 15/32 [00:35<00:36,  2.15s/it]
Multi-thread loading shards:  88% Completed | 28/32 [00:38<00:05,  1.28s/it]
Loading safetensors checkpoint shards:  50% Completed | 16/32 [00:37<00:34,  2.15s/it]
Multi-thread loading shards:  91% Completed | 29/32 [00:40<00:04,  1.38s/it]
Multi-thread loading shards:  94% Completed | 30/32 [00:41<00:02,  1.43s/it]
Loading safetensors checkpoint shards:  53% Completed | 17/32 [00:39<00:31,  2.12s/it]
INFO 04-22 14:48:54 [diffusion_model_runner.py:142] Model loading took 79.7585 GiB and 45.696062 seconds
INFO 04-22 14:48:54 [diffusion_model_runner.py:147] Model runner: Model loaded successfully.
INFO 04-22 14:48:54 [diffusion_model_runner.py:188] Model runner: Initialization complete.
INFO 04-22 14:48:54 [diffusion_worker.py:183] Worker 1: Process-scoped GPU memory after model loading: 0.00 GiB.
INFO 04-22 14:48:54 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:1, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-22 14:48:54 [diffusion_worker.py:95] Worker 1: Initialization complete.
INFO 04-22 14:48:54 [diffusion_worker.py:687] Worker 1: Scheduler loop started.
INFO 04-22 14:48:54 [diffusion_worker.py:597] Worker 1 ready to receive requests via shared memory
Multi-thread loading shards:  97% Completed | 31/32 [00:43<00:01,  1.62s/it]
Loading safetensors checkpoint shards:  56% Completed | 18/32 [00:41<00:28,  2.06s/it]
Multi-thread loading shards: 100% Completed | 32/32 [00:45<00:00,  1.60s/it]
Multi-thread loading shards: 100% Completed | 32/32 [00:45<00:00,  1.42s/it]

INFO 04-22 14:48:56 [diffusers_loader.py:324] Loading weights took 45.43 seconds
Loading safetensors checkpoint shards:  59% Completed | 19/32 [00:42<00:25,  1.98s/it]
INFO 04-22 14:48:57 [diffusion_model_runner.py:142] Model loading took 79.7585 GiB and 48.908604 seconds
INFO 04-22 14:48:57 [diffusion_model_runner.py:147] Model runner: Model loaded successfully.
INFO 04-22 14:48:57 [diffusion_model_runner.py:188] Model runner: Initialization complete.
INFO 04-22 14:48:57 [diffusion_worker.py:183] Worker 0: Process-scoped GPU memory after model loading: 0.00 GiB.
INFO 04-22 14:48:57 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-22 14:48:57 [diffusion_worker.py:95] Worker 0: Initialization complete.
INFO 04-22 14:48:57 [diffusion_worker.py:687] Worker 0: Scheduler loop started.
INFO 04-22 14:48:57 [diffusion_worker.py:597] Worker 0 ready to receive requests via shared memory
INFO 04-22 14:48:58 [diffusion_engine.py:446] dummy run to warm up the model
INFO 04-22 14:48:58 [kv_transfer_manager.py:1268] Rank-aware KV receive: rank 1 independently receiving (from_tp=2, to_tp=2)
INFO 04-22 14:48:58 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=receiver
INFO 04-22 14:48:58 [factory.py:46] Created connector: SharedMemoryConnector
INFO 04-22 14:48:58 [kv_transfer_manager.py:1010] Wait for KV cache for request dummy_req_id from stage 0 to 1 via 1 key(s)...
INFO 04-22 14:48:58 [kv_transfer_manager.py:1268] Rank-aware KV receive: rank 0 independently receiving (from_tp=2, to_tp=2)
INFO 04-22 14:48:58 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=receiver
INFO 04-22 14:48:58 [factory.py:46] Created connector: SharedMemoryConnector
INFO 04-22 14:48:58 [kv_transfer_manager.py:1010] Wait for KV cache for request dummy_req_id from stage 0 to 1 via 1 key(s)...
Loading safetensors checkpoint shards:  62% Completed | 20/32 [00:44<00:22,  1.90s/it]
Loading safetensors checkpoint shards:  66% Completed | 21/32 [00:46<00:20,  1.82s/it]
Loading safetensors checkpoint shards:  69% Completed | 22/32 [00:47<00:17,  1.77s/it]
Loading safetensors checkpoint shards:  72% Completed | 23/32 [00:49<00:15,  1.73s/it]
Loading safetensors checkpoint shards:  75% Completed | 24/32 [00:51<00:13,  1.72s/it]
Loading safetensors checkpoint shards:  78% Completed | 25/32 [00:52<00:11,  1.69s/it]
Loading safetensors checkpoint shards:  81% Completed | 26/32 [00:54<00:09,  1.66s/it]
Loading safetensors checkpoint shards:  84% Completed | 27/32 [00:56<00:08,  1.63s/it]
Loading safetensors checkpoint shards:  88% Completed | 28/32 [00:57<00:06,  1.63s/it]
Loading safetensors checkpoint shards:  91% Completed | 29/32 [00:59<00:04,  1.65s/it]
Loading safetensors checkpoint shards:  94% Completed | 30/32 [01:00<00:03,  1.64s/it]
Loading safetensors checkpoint shards:  97% Completed | 31/32 [01:03<00:01,  1.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 32/32 [01:05<00:00,  1.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 32/32 [01:05<00:00,  2.03s/it]
(Worker_TP0 pid=141094) 
(Worker_TP0 pid=141094) INFO 04-22 14:49:19 [default_loader.py:384] Loading weights took 65.14 seconds
(Worker_TP0 pid=141094) INFO 04-22 14:49:19 [gpu_model_runner.py:4820] Model loading took 79.09 GiB memory and 66.695218 seconds
(Worker_TP0 pid=141094) INFO 04-22 14:49:20 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_TP0 pid=141094) WARNING 04-22 14:49:21 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1536,device_name=NVIDIA_H20.json
(Worker_TP0 pid=141094) INFO 04-22 14:49:24 [base.py:163] Available KV cache memory: 5.24 GiB (profiling fallback)
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:24 [kv_cache_utils.py:1319] GPU KV cache size: 85,888 tokens
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:24 [kv_cache_utils.py:1324] Maximum concurrency for 22,800 tokens per request: 3.77x
(Worker_TP0 pid=141094) 2026-04-22 14:49:24,733 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=141095) 2026-04-22 14:49:24,733 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=141095) 2026-04-22 14:49:24,985 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP0 pid=141094) 2026-04-22 14:49:24,985 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:25 [core.py:283] init engine (profile, create kv cache, warmup model) took 5.36 seconds
(StageEngineCoreProc pid=140220) WARNING 04-22 14:49:26 [scheduler.py:180] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=140220) `torch_dtype` is deprecated! Use `dtype` instead!
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:26 [hunyuan_image3.py:887] Successfully processed 1 image(s). Final tensor shapes: {'vit_pixel_values': (1, 1024, 768), 'vit_pixel_attention_mask': (1, 1024), 'vit_spatial_shapes': (1, 2), 'vae_pixel_values': (1, 3, 1024, 1024), 'vae_token_grid_hw': (1, 2), 'base_size': (1,), 'ratio_index': (1,)}
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:27 [vllm.py:790] Asynchronous scheduling is enabled.
(StageEngineCoreProc pid=140220) WARNING 04-22 14:49:27 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
INFO 04-22 14:49:27 [async_omni_engine.py:464] [AsyncOmniEngine] Stage 0 engine startup completed
(StageEngineCoreProc pid=140220) WARNING 04-22 14:49:27 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:27 [vllm.py:1025] Cudagraph is disabled under eager mode
(StageEngineCoreProc pid=140220) INFO 04-22 14:49:27 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
ERROR 04-22 14:49:28 [kv_transfer_manager.py:1111] Timeout waiting for KV cache for request dummy_req_id after 30.0s
ERROR 04-22 14:49:28 [kv_transfer_manager.py:1111] Timeout waiting for KV cache for request dummy_req_id after 30.0s
  0%|                                                                                                                                                 | 0/1 [00:00<?, ?it/s]WARNING 04-22 14:49:30 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=32,N=3072,device_name=NVIDIA_H20.json
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.85s/it]
INFO 04-22 14:49:33 [diffusion_model_runner.py:213] Peak GPU memory (this request): 93.12 GB reserved, 89.08 GB allocated, 4.04 GB pool overhead (4.3%)
INFO 04-22 14:49:33 [stage_diffusion_proc.py:67] StageDiffusionProc initialized with model: ./hunyuan/HunyuanImage-3.0-Instruct
INFO 04-22 14:49:33 [stage_diffusion_client.py:144] [StageDiffusionClient] Stage-1 initialized (owns_process=True, batch_size=1)
INFO 04-22 14:49:33 [async_omni_engine.py:791] [AsyncOmniEngine] Stage 1 initialized (diffusion, batch_size=1)
INFO 04-22 14:49:33 [stage_engine_core_client.py:131] [StageEngineCoreClient] Stage-0 initializing EngineCore
INFO 04-22 14:49:33 [stage_engine_core_client.py:171] [StageEngineCoreClient] Stage-0 EngineCore running
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 04-22 14:49:35 [hunyuan_image3.py:887] Successfully processed 1 image(s). Final tensor shapes: {'vit_pixel_values': (1, 1024, 768), 'vit_pixel_attention_mask': (1, 1024), 'vit_spatial_shapes': (1, 2), 'vae_pixel_values': (1, 3, 1024, 1024), 'vae_token_grid_hw': (1, 2), 'base_size': (1,), 'ratio_index': (1,)}
INFO 04-22 14:49:35 [async_omni_engine.py:670] [AsyncOmniEngine] Stage 0 initialized
INFO 04-22 14:49:35 [orchestrator.py:187] [Orchestrator] Starting event loop
INFO 04-22 14:49:35 [async_omni_engine.py:358] [AsyncOmniEngine] Orchestrator ready with 2 stages
INFO 04-22 14:49:35 [omni_base.py:159] [Omni] AsyncOmniEngine initialized in 121.50 seconds
INFO 04-22 14:49:35 [omni_base.py:178] [Omni] Initialized with 2 stages for model ./hunyuan/HunyuanImage-3.0-Instruct
  Num stages: 2
  Initialization complete!

Generation Parameters:
  Inference steps: 50
  Guidance scale: 5.0
  Seed: 42
============================================================

Starting IT2I generation...
WARNING 04-22 14:49:35 [input_processor.py:235] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat().
INFO 04-22 14:49:35 [hunyuan_image3.py:887] Successfully processed 1 image(s). Final tensor shapes: {'vit_pixel_values': (1, 1024, 768), 'vit_pixel_attention_mask': (1, 1024), 'vit_spatial_shapes': (1, 2), 'vae_pixel_values': (1, 3, 1024, 1024), 'vae_token_grid_hw': (1, 2), 'base_size': (1,), 'ratio_index': (1,)}
INFO 04-22 14:49:36 [orchestrator.py:859] [Orchestrator] _handle_add_request: stage=0 req=0_969eda72-d283-476e-9ff6-34bb29196427 prompt_type=OmniEngineCoreRequest original_prompt_type=dict final_stage=1 num_sampling_params=2
INFO 04-22 14:49:36 [stage_engine_core_client.py:227] [StageEngineCoreClient] Stage-0 adding request: 0_969eda72-d283-476e-9ff6-34bb29196427
Processed prompts:   0%|                                                                                                                              | 0/1 [00:00<?, ?it/s](Worker_TP0 pid=141094) WARNING 04-22 14:49:36 [gpu_model_runner.py:386] additional_information on request data is deprecated, use model_intermediate_buffer
INFO 04-22 14:50:06 [hunyuan_image3.py:78] [ar2diffusion] Request 0: AR generated 343 tokens, text length=0, target size=1024x1024
INFO 04-22 14:50:06 [orchestrator.py:681] [Orchestrator] ar2diffusion req=0_969eda72-d283-476e-9ff6-34bb29196427 wall_time=0.690ms stage=0->1
INFO 04-22 14:50:06 [kv_transfer_manager.py:1268] Rank-aware KV receive: rank 1 independently receiving (from_tp=2, to_tp=2)
INFO 04-22 14:50:06 [kv_transfer_manager.py:713] Sender info updated: host=11.38.176.233, base_port=50151, adjusted_port=50167 (local_rank=1)
INFO 04-22 14:50:06 [kv_transfer_manager.py:1010] Wait for KV cache for request 0_969eda72-d283-476e-9ff6-34bb29196427 from stage 0 to 1 via 1 key(s)...
INFO 04-22 14:50:06 [kv_transfer_manager.py:1268] Rank-aware KV receive: rank 0 independently receiving (from_tp=2, to_tp=2)
INFO 04-22 14:50:06 [kv_transfer_manager.py:713] Sender info updated: host=11.38.176.233, base_port=50151, adjusted_port=50151 (local_rank=0)
INFO 04-22 14:50:06 [kv_transfer_manager.py:1010] Wait for KV cache for request 0_969eda72-d283-476e-9ff6-34bb29196427 from stage 0 to 1 via 1 key(s)...
(Worker_TP1 pid=141095) INFO 04-22 14:50:06 [kv_transfer_manager.py:906] KV cache serialized for 0_969eda72-d283-476e-9ff6-34bb29196427 in 695.6 ms
(Worker_TP0 pid=141094) INFO 04-22 14:50:07 [kv_transfer_manager.py:906] KV cache serialized for 0_969eda72-d283-476e-9ff6-34bb29196427 in 765.2 ms
(Worker_TP1 pid=141095) INFO 04-22 14:50:07 [kv_transfer_manager.py:920] KV transfer OK: 0_969eda72-d283-476e-9ff6-34bb29196427, 439295033 bytes across 1 key(s), 0.530s, 789.7 MB/s
(Worker_TP0 pid=141094) INFO 04-22 14:50:07 [kv_transfer_manager.py:920] KV transfer OK: 0_969eda72-d283-476e-9ff6-34bb29196427, 439295033 bytes across 1 key(s), 0.586s, 714.4 MB/s
/usr/local/lib/python3.12/dist-packages/vllm_omni/distributed/omni_connectors/kv_transfer_manager.py:257: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  return torch.frombuffer(tensor_data_mv, dtype=torch.uint8, offset=offset, count=nbytes)
INFO 04-22 14:50:08 [kv_transfer_manager.py:1100] Successfully received KV cache for 0_969eda72-d283-476e-9ff6-34bb29196427, 439295033 bytes across 1 key(s), wait=1.765s, link=634.9ms
/usr/local/lib/python3.12/dist-packages/vllm_omni/distributed/omni_connectors/kv_transfer_manager.py:257: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  return torch.frombuffer(tensor_data_mv, dtype=torch.uint8, offset=offset, count=nbytes)
INFO 04-22 14:50:08 [kv_transfer_manager.py:1100] Successfully received KV cache for 0_969eda72-d283-476e-9ff6-34bb29196427, 439295033 bytes across 1 key(s), wait=1.931s, link=800.1ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:00<00:00,  1.21s/it]
INFO 04-22 14:51:10 [diffusion_model_runner.py:213] Peak GPU memory (this request): 93.12 GB reserved, 90.10 GB allocated, 3.02 GB pool overhead (3.2%)
INFO 04-22 14:51:10 [diffusion_engine.py:127] Generation completed successfully.
INFO 04-22 14:51:10 [diffusion_engine.py:174] Post-processing completed in 0.0000 seconds
INFO 04-22 14:51:10 [diffusion_engine.py:177] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=64084.63 ms, postprocess=0.00 ms, total=64084.82 ms
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:34<00:00, 94.40s/it]INFO 04-22 14:51:10 [omni_base.py:251] [Summary] {}
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:34<00:00, 94.40s/it]

Processing outputs...

[DiT Stage Output]
  Saved edited image to: ./output_it2i/07/cat_edited_0.png
  Size: 1024x1024

============================================================
IT2I generation complete!
============================================================

INFO 04-22 14:51:10 [async_omni_engine.py:1723] [AsyncOmniEngine] Shutting down Orchestrator
INFO 04-22 14:51:10 [orchestrator.py:247] [Orchestrator] Received shutdown signal
INFO 04-22 14:51:10 [orchestrator.py:1163] [Orchestrator] Shutting down all stages
(Worker_TP0 pid=141094) INFO 04-22 14:51:10 [multiproc_executor.py:764] Parent process exited, terminating worker queues
(Worker_TP1 pid=141095) INFO 04-22 14:51:10 [multiproc_executor.py:859] WorkerProc shutting down.
(Worker_TP0 pid=141094) INFO 04-22 14:51:10 [multiproc_executor.py:859] WorkerProc shutting down.
INFO 04-22 14:51:14 [orchestrator.py:1167] [Orchestrator] Stage 0 shut down
INFO 04-22 14:51:14 [diffusion_worker.py:637] Worker 0: Received shutdown message
INFO 04-22 14:51:14 [diffusion_worker.py:658] event loop terminated.
INFO 04-22 14:51:14 [diffusion_worker.py:637] Worker 1: Received shutdown message
INFO 04-22 14:51:14 [diffusion_worker.py:658] event loop terminated.
INFO 04-22 14:51:14 [diffusion_worker.py:695] Worker 0: Shutdown complete.
INFO 04-22 14:51:14 [diffusion_worker.py:695] Worker 1: Shutdown complete.
INFO 04-22 14:51:18 [orchestrator.py:1167] [Orchestrator] Stage 1 shut down

======================================
Test complete! Check outputs in: ./output_it2i/07
======================================
~:

Expected Behavior

When using Transformers.py for HunyuanImage-3.0 inference:

  • ✅ Runs successfully
  • ✅ Model correctly reads image information
  • ✅ Successfully generates expected results

Working Scenario - Execution Log

# Paste the complete log from running Transformers.py here

~:python Transformers.py 
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [01:14<00:00,  2.34s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'HunyuanImage3TokenizerFast'.
==================================================
Model input info:
           token shape: torch.Size([1, 1238])
            context[0]: <|startoftext|>You are an advanced multimodal model whose core mission is to analyze user intent and generate high-quality text and images.

#### Four Core Capabilities
1.  **Text-to-Text (T2T):** Generate coherent text responses from text prompts.
2.  **Text-to-Image (T2I):** Generate high-quality images from text prompts.
3.  **Text & Image to Text (TI2T):** Generate accurate text responses based on a combination of images and text.
4.  **Text & Image to Image (TI2I):** Generate modified images based on a reference image and editing instructions.

---
### Image Generation Protocol (for T2I & TI2I)
You will operate in one of two modes, determined by the user's starting tag:
#### **<recaption> Mode (Prompt Rewriting)**:
*   **Trigger:** Input begins with `<recaption>`.
*   **Task:** Immediately rewrite the user's text into a structured, objective, and detail-rich professional-grade prompt.
*   **Output:** Output only the rewritten prompt within `<recaption>` tags: `<recaption>Rewritten professional-grade prompt</recaption>`

#### **<think> Mode (Think + Rewrite)**:
*   **Trigger:** Input begins with `<think>`.
*   **Task:** First, conduct a structured analysis of the request within `<think>` tags. Then, output the professional prompt, rewritten based on the analysis, within `<recaption>` tags.
*   **Output:** Strictly adhere to the format: `<think>Analysis process</think><recaption>Rewritten prompt</recaption>`

---
### Execution Standards and Guidelines
#### **`<think>` Phase: Analysis Guidelines**
**For T2I (New Image Generation):**
Deconstruct the user's request into the following core visual components:
*   **Subject:** Key features of the main character/object, including appearance, pose, expression, and emotion.
*   **Composition:** Camera angle, lens type, and layout.
*   **Environment/Background:** The setting, time of day, weather, and background elements.
*   **Lighting:** Technical details such as light source type, direction, and quality.
*   **Color Palette:** The dominant hues and overall color scheme.
*   **Style/Quality:** The artistic style, clarity, depth of field, and other technical details.
*   **Text:** Identify any text to be rendered in the image, including its content, style, and position.
*   **Details:** Small elements that add narrative depth and realism.

**For TI2I (Image Editing):**
Adopt a task-diagnostic approach:
1.  **Diagnose Task:** Identify the edit type and analyze key requirements.
2.  **Prioritize Analysis:**
    *   **Adding:** Analyze the new element's position and appearance, ensuring seamless integration with the original image's lighting, shadows, and style.
    *   **Removing:** Identify the target for removal and determine how to logically fill the resulting space using surrounding textures and lighting.
    *   **Modifying:** Analyze what to change and what it should become, while emphasizing which elements must remain unchanged.
    *   **Style Transfer:** Deconstruct the target style into specific features (e.g., brushstrokes, color palette) and apply them to the original image.
    *   **Text Editing:** Ensure correct content and format. Consider the text's visual style (e.g., font, color, material) and how it adapts to the surface's perspective, curvature, and lighting.
    *   **Reference Editing:** Extract specific visual elements (e.g., appearance, posture, composition, lines, depth) from the reference image to generate an image that aligns with the text description while also incorporating the referenced content.
    *   **Inferential Editing:** Identify vague requests (e.g., "make it more professional") and translate them into concrete visual descriptions.

#### `<recaption>` Phase: Professional-Grade Prompt Generation Rules
**General Rewriting Principles (for T2I & TI2I):**
1.  **Structure & Logic:** Start with a global description. Use positional words (e.g., "foreground", "background") to define the layout.
2.  **Absolute Objectivity:** Avoid subjective terms. Convey aesthetics through precise descriptions of color, light, shadow, and materials.
3.  **Physical & Logical Consistency:** Ensure all descriptions adhere to the laws of physics and common sense.
4.  **Fidelity to User Intent:** Preserve the user's core concepts, subjects, and attributes. Text to be rendered in the image **must be enclosed in double quotes ("")**.
5.  **Camera & Resolution:** Translate camera parameters into descriptions of visual effects. Convert resolution information into natural language.

**T2I-Specific Guidelines:**
*   **Style Adherence & Inference:** Strictly follow the specified style. If none is given, infer the most appropriate style and detail it using professional terminology.
*   **Style Detailing:**
    *   **Photography/Realism:** Use professional photography terms to describe lighting, lens effects, and material textures.
    *   **Painting/Illustration:** Specify the art movement or medium's characteristics.
    *   **UI/Design:** Objectively describe the final product. Define layout, elements, and typography. Text content must be specific and unambiguous.

**TI2I-Specific Guidelines:**
*   **Preserve Unchanged Elements:** Emphasize elements that **remain unchanged**. Unless explicitly instructed, never alter a character's identity/appearance, the core background, camera angle, or overall style.
*   **Clear Editing Instructions:**
    *   **Replacement:** Use the logic "**replace B with A**," and provide a detailed description of A.
    *   **Addition:** Clearly state what to add, where, and what it looks like.
*   **Unambiguous Referencing:** Avoid vague references (e.g., "that person"). Use specific descriptions of appearance.

User: 给图中的猫换个品种

Assistant: <think>
             do_sample: True
        max_new_tokens: 2048
                 top_k: 1024
                 top_p: 0.95
           temperature: 0.6
    repetition_penalty: 1.0
--------------------------------------------------
用户的指令是“给图中的猫换个品种”,这是一个开放性的编辑请求。参考图中的猫是一只典型的短毛虎斑猫,毛色以棕色和黑色条纹为主。为了使编辑效果显著且符合逻辑,我需要构思一个与原图在毛色、花纹和体型上都有明确对比的新品种。

首先,我需要将“换个品种”这个抽象概念具体化。我构思的目标品种应该具有鲜明的视觉特征,以便与原图的虎斑猫形成强烈反差。因此,我决定引入一个高对比度的色彩方案——将原本的棕色系替换为深灰色与白色的组合。这种色彩对比是让变化一目了然的最有效方式。

其次,仅仅改变颜色是不够的,品种的差异体现在毛发质地和身体形态上。为了进一步强化“新品种”的感觉,我需要细化毛发的描述。我设想新猫的毛发不是短而贴身的,而是长而蓬松的,这会立刻改变它的轮廓,使其显得更丰满、更有体积感。同时,为了保持面部特征的一致性,我需要明确指出哪些部分应该保留,例如眼睛的颜色和形状,这样可以确保编辑的核心主体依然是“一只猫”,而不是一只完全陌生的动物。

接着,为了让这个新形象更加生动和真实,我需要为其添加一些能体现其生活习性的细节。一个佩戴项圈和铃铛的设定,能含蓄地表明它是一只家养宠物,这为图像增添了故事感和可信度。

最后,为了使编辑的焦点完全集中在猫本身,我必须明确指出哪些元素应该保持不变。这包括猫的姿势、它所在的平面(白色表面)以及背景环境(模糊的室内场景)。通过限定这些不变的元素,可以确保AI的修改仅限于猫的品种特征,从而生成一个既符合指令又保持了原始场景和谐性的高质量图像。</think><recaption>一只长毛猫趴在一个白色的平面上。这只猫的毛色是深灰色和白色的组合。它的背部、头顶和耳朵主要是深灰色,而胸部、下巴和鼻子周围则是纯白色。猫的毛发很长,尤其是在颈部周围,形成了一圈蓬松的鬃毛。它有一双明亮的、杏仁状的绿色眼睛,瞳孔是黑色的。它的鼻子是淡粉色的,嘴巴紧闭。几根长长的白色胡须从它的口鼻两侧伸出。猫的耳朵是三角形的,内侧有浅色的毛发。在它毛茸茸的脖子上,可以看到一个深色的项圈,上面挂着一个圆形的金色小铃铛。猫的身体姿态是放松的,前爪可能收拢在身体下方,但被长毛遮挡。背景是一个模糊的室内环境,可以看到一个白色的垂直物体,可能是门框或墙壁的边缘。</recaption><answer><boi><img_size_1024><img_ratio_18>
Generation completed in 195.39 seconds.
==================================================
Model input info:
       token shape: torch.Size([2, 5853])
        context[0]: <|startoftext|>You are an advanced multimodal model whose core mission is to analyze user intent and generate high-quality text and images.

#### Four Core Capabilities
1.  **Text-to-Text (T2T):** Generate coherent text responses from text prompts.
2.  **Text-to-Image (T2I):** Generate high-quality images from text prompts.
3.  **Text & Image to Text (TI2T):** Generate accurate text responses based on a combination of images and text.
4.  **Text & Image to Image (TI2I):** Generate modified images based on a reference image and editing instructions.

---
### Image Generation Protocol (for T2I & TI2I)
You will operate in one of two modes, determined by the user's starting tag:
#### **<recaption> Mode (Prompt Rewriting)**:
*   **Trigger:** Input begins with `<recaption>`.
*   **Task:** Immediately rewrite the user's text into a structured, objective, and detail-rich professional-grade prompt.
*   **Output:** Output only the rewritten prompt within `<recaption>` tags: `<recaption>Rewritten professional-grade prompt</recaption>`

#### **<think> Mode (Think + Rewrite)**:
*   **Trigger:** Input begins with `<think>`.
*   **Task:** First, conduct a structured analysis of the request within `<think>` tags. Then, output the professional prompt, rewritten based on the analysis, within `<recaption>` tags.
*   **Output:** Strictly adhere to the format: `<think>Analysis process</think><recaption>Rewritten prompt</recaption>`

---
### Execution Standards and Guidelines
#### **`<think>` Phase: Analysis Guidelines**
**For T2I (New Image Generation):**
Deconstruct the user's request into the following core visual components:
*   **Subject:** Key features of the main character/object, including appearance, pose, expression, and emotion.
*   **Composition:** Camera angle, lens type, and layout.
*   **Environment/Background:** The setting, time of day, weather, and background elements.
*   **Lighting:** Technical details such as light source type, direction, and quality.
*   **Color Palette:** The dominant hues and overall color scheme.
*   **Style/Quality:** The artistic style, clarity, depth of field, and other technical details.
*   **Text:** Identify any text to be rendered in the image, including its content, style, and position.
*   **Details:** Small elements that add narrative depth and realism.

**For TI2I (Image Editing):**
Adopt a task-diagnostic approach:
1.  **Diagnose Task:** Identify the edit type and analyze key requirements.
2.  **Prioritize Analysis:**
    *   **Adding:** Analyze the new element's position and appearance, ensuring seamless integration with the original image's lighting, shadows, and style.
    *   **Removing:** Identify the target for removal and determine how to logically fill the resulting space using surrounding textures and lighting.
    *   **Modifying:** Analyze what to change and what it should become, while emphasizing which elements must remain unchanged.
    *   **Style Transfer:** Deconstruct the target style into specific features (e.g., brushstrokes, color palette) and apply them to the original image.
    *   **Text Editing:** Ensure correct content and format. Consider the text's visual style (e.g., font, color, material) and how it adapts to the surface's perspective, curvature, and lighting.
    *   **Reference Editing:** Extract specific visual elements (e.g., appearance, posture, composition, lines, depth) from the reference image to generate an image that aligns with the text description while also incorporating the referenced content.
    *   **Inferential Editing:** Identify vague requests (e.g., "make it more professional") and translate them into concrete visual descriptions.

#### `<recaption>` Phase: Professional-Grade Prompt Generation Rules
**General Rewriting Principles (for T2I & TI2I):**
1.  **Structure & Logic:** Start with a global description. Use positional words (e.g., "foreground", "background") to define the layout.
2.  **Absolute Objectivity:** Avoid subjective terms. Convey aesthetics through precise descriptions of color, light, shadow, and materials.
3.  **Physical & Logical Consistency:** Ensure all descriptions adhere to the laws of physics and common sense.
4.  **Fidelity to User Intent:** Preserve the user's core concepts, subjects, and attributes. Text to be rendered in the image **must be enclosed in double quotes ("")**.
5.  **Camera & Resolution:** Translate camera parameters into descriptions of visual effects. Convert resolution information into natural language.

**T2I-Specific Guidelines:**
*   **Style Adherence & Inference:** Strictly follow the specified style. If none is given, infer the most appropriate style and detail it using professional terminology.
*   **Style Detailing:**
    *   **Photography/Realism:** Use professional photography terms to describe lighting, lens effects, and material textures.
    *   **Painting/Illustration:** Specify the art movement or medium's characteristics.
    *   **UI/Design:** Objectively describe the final product. Define layout, elements, and typography. Text content must be specific and unambiguous.

**TI2I-Specific Guidelines:**
*   **Preserve Unchanged Elements:** Emphasize elements that **remain unchanged**. Unless explicitly instructed, never alter a character's identity/appearance, the core background, camera angle, or overall style.
*   **Clear Editing Instructions:**
    *   **Replacement:** Use the logic "**replace B with A**," and provide a detailed description of A.
    *   **Addition:** Clearly state what to add, where, and what it looks like.
*   **Unambiguous Referencing:** Avoid vague references (e.g., "that person"). Use specific descriptions of appearance.

User: 给图中的猫换个品种

Assistant: <think>用户的指令是“给图中的猫换个品种”,这是一个开放性的编辑请求。参考图中的猫是一只典型的短毛虎斑猫,毛色以棕色和黑色条纹为主。为了使编辑效果显著且符合逻辑,我需要构思一个与原图在毛色、花纹和体型上都有明确对比的新品种。

首先,我需要将“换个品种”这个抽象概念具体化。我构思的目标品种应该具有鲜明的视觉特征,以便与原图的虎斑猫形成强烈反差。因此,我决定引入一个高对比度的色彩方案——将原本的棕色系替换为深灰色与白色的组合。这种色彩对比是让变化一目了然的最有效方式。

其次,仅仅改变颜色是不够的,品种的差异体现在毛发质地和身体形态上。为了进一步强化“新品种”的感觉,我需要细化毛发的描述。我设想新猫的毛发不是短而贴身的,而是长而蓬松的,这会立刻改变它的轮廓,使其显得更丰满、更有体积感。同时,为了保持面部特征的一致性,我需要明确指出哪些部分应该保留,例如眼睛的颜色和形状,这样可以确保编辑的核心主体依然是“一只猫”,而不是一只完全陌生的动物。

接着,为了让这个新形象更加生动和真实,我需要为其添加一些能体现其生活习性的细节。一个佩戴项圈和铃铛的设定,能含蓄地表明它是一只家养宠物,这为图像增添了故事感和可信度。

最后,为了使编辑的焦点完全集中在猫本身,我必须明确指出哪些元素应该保持不变。这包括猫的姿势、它所在的平面(白色表面)以及背景环境(模糊的室内场景)。通过限定这些不变的元素,可以确保AI的修改仅限于猫的品种特征,从而生成一个既符合指令又保持了原始场景和谐性的高质量图像。</think><recaption>一只长毛猫趴在一个白色的平面上。这只猫的毛色是深灰色和白色的组合。它的背部、头顶和耳朵主要是深灰色,而胸部、下巴和鼻子周围则是纯白色。猫的毛发很长,尤其是在颈部周围,形成了一圈蓬松的鬃毛。它有一双明亮的、杏仁状的绿色眼睛,瞳孔是黑色的。它的鼻子是淡粉色的,嘴巴紧闭。几根长长的白色胡须从它的口鼻两侧伸出。猫的耳朵是三角形的,内侧有浅色的毛发。在它毛茸茸的脖子上,可以看到一个深色的项圈,上面挂着一个圆形的金色小铃铛。猫的身体姿态是放松的,前爪可能收拢在身体下方,但被长毛遮挡。背景是一个模糊的室内环境,可以看到一个白色的垂直物体,可能是门框或墙壁的边缘。</recaption><answer><boi><img_size_1024><img_ratio_18><timestep>[<img>]{4032}<eoi></answer>


              seed: [42]
        image_size: ['1152x896']
       infer_steps: 50
    guidance_scale: 2.5
        flow_shift: 3.0
--------------------------------------------------
***use_taylor_cache: False, cache_dic: None
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:16<00:00,  2.73s/it]
Generation completed in 137.69 seconds.
~:

Comparison Analysis

Item it2i_inference.py Transformers.py
Image Reading ❌ Failed ✅ Success
Generation Result ❌ Incorrect ✅ Correct
Execution Status ⚠️ Runs but produces wrong output ✅ Normal

Configuration Context

My hunyuan_it2i_4gpu.yaml configuration is adapted from the PR's original setup:

  • Original PR Configuration: 8x L40s GPUs
  • My Configuration: 4x H20 96GB GPUs

The main differences due to this hardware change:

  • GPU count: 8 → 4
  • GPU model: L40s → H20
  • Tensor parallelism and resource allocation adjusted accordingly

Request for Support

I would appreciate clarification on the following questions:

  1. Is the current PR ready for IT2I inference with image understanding?

    • Can it2i_inference.py properly handle and understand input images in the current implementation?
  2. Could my 4-GPU setup with 2-GPU AR stage cause this issue?

    • In my configuration, I use 2 GPUs for the AR stage (instead of 4)
    • To avoid OOM, I configured max_num_batched_tokens: 8192 (instead of the default 32768)
    • Could this configuration lead to image information loss? (Though I have doubts about this being the root cause)

Additional Information

  • Related configuration file: hunyuan_it2i_4gpu.yaml ,which is almost the same as offered in this PR
# HunyuanImage-3.0 IT2I Configuration for 4x H20-96G GPUs
# Based on official 8-GPU L40-48G config, adapted for 4-GPU H20-96G
# Stage 0 (AR): GPUs 0,1 - BF16 precision
# Stage 1 (DiT): GPUs 2,3 - FP8 quantization

stage_args:
  # ============================================================================
  # Stage 0: AR Model - Autoregressive Generation
  # ============================================================================
  - stage_id: 0
    stage_type: llm
    runtime:
      process: true
      devices: "0,1"  # First 2 GPUs (changed from 0,1,2,3)
      max_batch_size: 1
      requires_multimodal_data: true  # AR needs the input image
    engine_args:
      model_stage: AR
      model_arch: HunyuanImage3ForCausalMM
      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.95  # Match official 8xL40S config - H20 has good memory management
      enforce_eager: true
      trust_remote_code: true
      engine_output_type: latent  # AR outputs latent for DiT
      enable_prefix_caching: false
      max_num_batched_tokens: 8192  # Reduced to leave room for VAE encoder (6GB needed)
      tensor_parallel_size: 2  # Changed from 4 to 2
      pipeline_parallel_size: 1
      omni_kv_config:
        need_send_cache: true  # AR sends KV cache to DiT stage
      hf_overrides:
        rope_parameters:
          mrope_section: [0, 32, 32]
          rope_type: default
    is_comprehension: false
    final_output: false
    default_sampling_params:
      temperature: 0.6
      top_p: 0.95
      top_k: 1024
      max_tokens: 8192
      stop_token_ids: [127957]  # <|endoftext|>
      detokenize: false

  # ============================================================================
  # Stage 1: DiT (Diffusion) - Denoising + VAE Decode
  # ============================================================================
  - stage_id: 1
    stage_type: diffusion
    runtime:
      process: true
      devices: "2,3"  # Last 2 GPUs (changed from 4,5,6,7)
      max_batch_size: 1
      requires_multimodal_data: true  # May need condition images
    engine_args:
      model_stage: dit
      model_arch: HunyuanImage3ForCausalMM
      enforce_eager: true
      trust_remote_code: true
      distributed_executor_backend: "mp"
      parallel_config:
        tensor_parallel_size: 2  # Changed from 4 to 2
        enable_expert_parallel: true
      omni_kv_config:
        need_recv_cache: true
    engine_input_source: [0]  # Input from AR stage
    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.hunyuan_image3.ar2diffusion
    final_output: true
    final_output_type: image
    default_sampling_params:
      num_inference_steps: 50
      guidance_scale: 2.5

# ============================================================================
# Runtime Configuration - Define Stage Connections
# ============================================================================
runtime:
  enabled: true
  edges:
    - from: 0  # AR → Diffusion
      to: 1
  • Related scripts: it2i_inference.py,
"""
HunyuanImage-3.0-Instruct Image-to-Image (IT2I) Inference Script
Optimized for 4x H20-96G GPUs

This script performs image-to-image generation using HunyuanImage-3.0-Instruct model.
Pipeline: Input Image + Text Prompt → AR Stage → DiT Stage → Edited Image

Usage:
    python it2i_inference.py --image-path input.png --prompt "Make the sky sunset orange"
    python it2i_inference.py --image-path input.jpg --prompt "Turn into a watercolor painting" --steps 100
"""

import argparse
import os
from pathlib import Path

from PIL import Image
from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import get_system_prompt
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniPromptType

# task → (sys_type, bot_task, trigger_tag)
_TASK_PRESETS = {
    "think": ("en_unified", "think", "<think>"),
    "recaption": ("en_unified", "recaption", "<recaption>"),
}


def build_it2i_prompt(
    user_prompt: str,
    task: str = "think",  # "think" or "recaption"
    sys_type: str = "en_unified",
) -> str:
    """
    Build IT2I prompt with HunyuanImage-3.0 pretrain template format.
    
    Format: <|startoftext|>{system_prompt}<img>{trigger_tag}{user_prompt}
    """
    preset_sys_type, bot_task, trigger_tag = _TASK_PRESETS[task]
    effective_sys_type = sys_type or preset_sys_type
    system_prompt = get_system_prompt(effective_sys_type, bot_task, None)
    sys_text = system_prompt.strip() if system_prompt else ""
    
    parts = ["<|startoftext|>"]
    if sys_text:
        parts.append(sys_text)
    parts.append("<img>")  # Image placeholder - IT2I always has image input
    parts.append(trigger_tag)
    parts.append(user_prompt)
    
    return "".join(parts)


def parse_args():
    parser = argparse.ArgumentParser(
        description="HunyuanImage-3.0 IT2I Inference on 4x H20-96G GPUs"
    )
    
    # Required arguments
    parser.add_argument(
        "--image-path",
        type=str,
        required=True,
        help="Path to input image (PNG/JPG)",
    )
    parser.add_argument(
        "--prompt",
        type=str,
        required=True,
        help="Text prompt describing desired image edits",
    )
    
    # Model configuration
    parser.add_argument(
        "--model",
        type=str,
        default="tencent/HunyuanImage-3.0-Instruct",
        help="Model name or local path",
    )
    parser.add_argument(
        "--stage-config",
        type=str,
        default="hunyuan_it2i_4gpu.yaml",
        help="Stage configuration YAML file",
    )
    
    # Generation parameters
    parser.add_argument(
        "--steps",
        type=int,
        default=50,
        help="Number of diffusion inference steps (default: 50)",
    )
    parser.add_argument(
        "--guidance-scale",
        type=float,
        default=5.0,
        help="Classifier-free guidance scale (default: 5.0)",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=42,
        help="Random seed for reproducibility",
    )
    
    # Prompt configuration
    parser.add_argument(
        "--task",
        type=str,
        choices=["think", "recaption"],
        default="think",
        help="IT2I task mode: 'think' (CoT reasoning) or 'recaption' (image description)",
    )
    parser.add_argument(
        "--sys-type",
        type=str,
        default="en_unified",
        help="System prompt type (default: en_unified)",
    )
    
    # Output configuration
    parser.add_argument(
        "--output",
        type=str,
        default="./outputs",
        help="Output directory to save generated images",
    )
    parser.add_argument(
        "--output-name",
        type=str,
        default=None,
        help="Output filename (default: auto-generated from input)",
    )
    
    # Omni configuration
    parser.add_argument(
        "--init-timeout",
        type=int,
        default=600,
        help="Initialization timeout in seconds (default: 600)",
    )
    parser.add_argument(
        "--enforce-eager",
        action="store_true",
        help="Disable torch.compile for debugging",
    )
    parser.add_argument(
        "--log-stats",
        action="store_true",
        help="Enable detailed logging statistics",
    )
    
    return parser.parse_args()


def main():
    args = parse_args()
    
    # Validate input image
    if not os.path.exists(args.image_path):
        raise FileNotFoundError(f"Input image not found: {args.image_path}")
    
    # Create output directory
    os.makedirs(args.output, exist_ok=True)
    
    # Load input image
    print(f"\n{'='*60}")
    print("Loading input image...")
    input_image = Image.open(args.image_path).convert("RGB")
    print(f"  Image path: {args.image_path}")
    print(f"  Image size: {input_image.width}x{input_image.height}")
    print(f"  Image mode: {input_image.mode}")
    
    # Build prompt
    formatted_prompt = build_it2i_prompt(
        user_prompt=args.prompt,
        task=args.task,
        sys_type=args.sys_type,
    )
    
    print(f"\nPrompt Configuration:")
    print(f"  User prompt: {args.prompt}")
    print(f"  Task mode: {args.task}")
    print(f"  System type: {args.sys_type}")
    
    # Initialize Omni
    print(f"\nInitializing Omni with 4x H20-96G GPUs...")
    print(f"  Model: {args.model}")
    print(f"  Stage config: {args.stage_config}")
    print(f"  Init timeout: {args.init_timeout}s")
    
    omni = Omni(
        model=args.model,
        mode="text-to-image",
        stage_configs_path=args.stage_config,
        log_stats=args.log_stats,
        init_timeout=args.init_timeout,
        enforce_eager=args.enforce_eager,
    )
    
    print(f"  Num stages: {omni.num_stages}")
    print(f"  Initialization complete!")
    
    # Prepare prompt dict
    prompt_dict: OmniPromptType = {
        "prompt": formatted_prompt,
        "modalities": ["image"],
        "multi_modal_data": {"image": input_image},
        "mm_processor_kwargs": {
            "infer_align_image_size": True,  # Align output to input image size
        },
        "height": input_image.height,
        "width": input_image.width,
    }
    
    # Configure sampling parameters
    params_list = list(omni.default_sampling_params_list)
    for sp in params_list:
        if isinstance(sp, OmniDiffusionSamplingParams):
            sp.num_inference_steps = args.steps
            sp.guidance_scale = args.guidance_scale
            if args.seed is not None:
                sp.seed = args.seed
    
    print(f"\nGeneration Parameters:")
    print(f"  Inference steps: {args.steps}")
    print(f"  Guidance scale: {args.guidance_scale}")
    print(f"  Seed: {args.seed}")
    print(f"{'='*60}\n")
    
    # Generate
    print("Starting IT2I generation...")
    omni_outputs = list(omni.generate(prompts=[prompt_dict], sampling_params_list=params_list))
    
    # Process outputs
    print("\nProcessing outputs...")
    
    for req_output in omni_outputs:
        # Check for text output (AR stage reasoning)
        ro = getattr(req_output, "request_output", None)
        if ro and getattr(ro, "outputs", None):
            txt = "".join(getattr(o, "text", "") or "" for o in ro.outputs)
            if txt:
                print(f"\n[AR Stage Output]")
                print(f"{txt[:500]}{'...' if len(txt) > 500 else ''}")
        
        # Extract and save images (DiT stage output)
        images = getattr(req_output, "images", None)
        if not images and ro and hasattr(ro, "images"):
            images = ro.images
        
        if images:
            for j, img in enumerate(images):
                # Generate output filename
                if args.output_name:
                    output_filename = args.output_name
                else:
                    input_stem = Path(args.image_path).stem
                    output_filename = f"{input_stem}_edited_{j}.png"
                
                save_path = os.path.join(args.output, output_filename)
                img.save(save_path)
                print(f"\n[DiT Stage Output]")
                print(f"  Saved edited image to: {save_path}")
                print(f"  Size: {img.width}x{img.height}")
        else:
            print("\n[Warning] No images generated!")
    
    print(f"\n{'='*60}")
    print("IT2I generation complete!")
    print(f"{'='*60}\n")


if __name__ == "__main__":
    main()

Transformers.py

from transformers import AutoModelForCausalLM

model_id = "./HunyuanImage-3.0-Instruct"

kwargs = dict(
    attn_implementation="sdpa", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",
    moe_drop_tokens=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)

# 修复:手动设置 model_version
if not hasattr(model.config, 'model_version'):
    model.config.model_version = "3.0"  # 根据模型版本设置

model.load_tokenizer(model_id)

# Image-to-Image generation (TI2I)
prompt = "给图中的猫换个品种"

input_img1 = "cat.png"
imgs_input = [input_img1]

cot_text, samples = model.generate_image(
    prompt=prompt,
    image=None,
    seed=42,
    image_size="auto",
    use_system_prompt="en_unified",
    bot_task="think_recaption",  # Use "think_recaption" for reasoning and enhancement
    # infer_align_image_size=True,  # Align output image size to input image size
    infer_align_image_size=False,  # Align output image size to input image size
    diff_infer_steps=50, 
    verbose=2
)

# Save the generated image
samples[0].save("2.png")

kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 22, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.
@kechengliu97
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed repro — the symptom you described (text length=0 in the [ar2diffusion] log, DiT ignoring the input image and producing an unrelated picture) is a known bug in the AR → DiT bridge and is fixed in PR #2949.

Root cause. hunyuan_image3_it2i.yaml (and the 4-GPU variant you're using) sets detokenize: false on the AR stage to avoid the cost of streaming text during generation. With that flag output.text is empty, so the old ar2diffusion() handed an empty ar_generated_text to the diffusion pipeline. The DiT then built its joint sequence via apply_chat_template with no user-text section at all → the model "sees" only the system prompt + (image) + empty assistant content, and the image tokens are effectively conditioning-free. Hence the output is unrelated to both the edit prompt and the input image, matching exactly the text length=0 line in your log.

Fix. In vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py::ar2diffusion, when output.text is empty but output.token_ids is non-empty we now lazily load the AR tokenizer (via AutoTokenizer, cached per model path) and decode the tokens on the fly:

if not generated_text and generated_token_ids:
    tokenizer = _resolve_ar_tokenizer(stage_list[source_stage_id])
    if tokenizer is not None:
        generated_text = tokenizer.decode(list(generated_token_ids), skip_special_tokens=False)

So no change is required in your hunyuan_it2i_4gpu.yaml or it2i_inference.py; after pulling the fix the same command just works.

Verification on the PR #2949 branch (4× L20X, TP=2 per stage, official assets/demo_instruct_imgs/input_0_0.png + "新年宠物海报" prompt):

metric before after
[ar2diffusion] log text length=0 text length=749
edited image unrelated to input (model ignores image) correctly reflects both the input image and the edit instruction

The same PR also fixes the KV-reuse path's CFG companion (L_pos / L_neg mismatch), so if you want the ~2× denoise speed-up you can switch the stage config to hunyuan_image3_it2i_kv_reuse.yaml once the PR lands.

Closing the loop here — please try the PR branch (or wait for merge) and let me know if your scenario is fully resolved.

@Bounty-hunter
Copy link
Copy Markdown
Contributor

Thanks for the detailed repro — the symptom you described (text length=0 in the [ar2diffusion] log, DiT ignoring the input image and producing an unrelated picture) is a known bug in the AR → DiT bridge and is fixed in PR #2949.

Root cause. hunyuan_image3_it2i.yaml (and the 4-GPU variant you're using) sets detokenize: false on the AR stage to avoid the cost of streaming text during generation. With that flag output.text is empty, so the old ar2diffusion() handed an empty ar_generated_text to the diffusion pipeline. The DiT then built its joint sequence via apply_chat_template with no user-text section at all → the model "sees" only the system prompt + (image) + empty assistant content, and the image tokens are effectively conditioning-free. Hence the output is unrelated to both the edit prompt and the input image, matching exactly the text length=0 line in your log.

Fix. In vllm_omni/model_executor/stage_input_processors/hunyuan_image3.py::ar2diffusion, when output.text is empty but output.token_ids is non-empty we now lazily load the AR tokenizer (via AutoTokenizer, cached per model path) and decode the tokens on the fly:

if not generated_text and generated_token_ids:
    tokenizer = _resolve_ar_tokenizer(stage_list[source_stage_id])
    if tokenizer is not None:
        generated_text = tokenizer.decode(list(generated_token_ids), skip_special_tokens=False)

So no change is required in your hunyuan_it2i_4gpu.yaml or it2i_inference.py; after pulling the fix the same command just works.

Verification on the PR #2949 branch (4× L20X, TP=2 per stage, official assets/demo_instruct_imgs/input_0_0.png + "新年宠物海报" prompt):

metric before after
[ar2diffusion] log text length=0 text length=749
edited image unrelated to input (model ignores image) correctly reflects both the input image and the edit instruction
The same PR also fixes the KV-reuse path's CFG companion (L_pos / L_neg mismatch), so if you want the ~2× denoise speed-up you can switch the stage config to hunyuan_image3_it2i_kv_reuse.yaml once the PR lands.

Closing the loop here — please try the PR branch (or wait for merge) and let me know if your scenario is fully resolved.

Additionally issues:
(1) The AR outputs have precision problems — there are cases where the generated results keep repeating. This can also be reproduced when running the i2t task.

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 23, 2026
…T2I)

Enables the AR (language) stage to share its prefilled text KV cache with
the DiT (diffusion) stage, so the DiT no longer re-encodes the prompt from
scratch.  End-to-end denoise time for a 1216x832 IT2I request drops from
~57 s to ~27 s on 4xL20X (TP=2 for both stages) while preserving the
image quality of the non-reuse path.

What's in this change
---------------------

* Diffusion pipeline (`pipeline_hunyuan_image3.py`)
  - New `_forward_with_kv_reuse()` path that injects AR-produced K/V into
    each layer's `ImageKVCacheManager` and runs every denoising step as a
    non-first step (`kv_injected=True` / `first_step=False`).
  - Builds the correct `[BOS|sys|user|cot] + [<boi>|<img_size>|<img_ratio>]
    + [<timestep>|img*N] + [<eoi>]` token layout; zero-pads the three DiT
    special tokens the AR doesn't emit and masks them in attention.
  - Reads `sequence_template` from `generation_config` (defaults to
    `"instruct"` for HunyuanImage-3.0-Instruct) instead of hard-coding
    `"pretrain"`, matching the checkpoint's training distribution.

* Transformer (`hunyuan_image3_transformer.py`)
  - `ImageKVCacheManager.inject_prompt_kv_cache()` prepares
    `image_kv_cache_map` in the exact layout `_save_image_kv_caches()`
    produces, including pos/neg branches, special-token pad and eoi slot,
    so `_update_image_kv_caches()` works unchanged on subsequent steps.
  - `forward(..., kv_injected=...)` propagates the flag to every layer's
    attention call.

* CFG companion (`stage_input_processors/hunyuan_image3.py`)
  - `expand_cfg_prompts()` now mirrors the positive prompt's structure
    (same system prompt, same image, same assistant/trigger) with the
    user text replaced by `<cfg>`.  This fixes the `L_pos=6833 / L_neg=1`
    degeneracy that produced visibly degraded images (PSNR 6.5 dB); with
    the fix the KV-reuse output closely matches the non-reuse baseline
    (PSNR ~9.7 dB, consistent with the residue seen in the official
    reference implementation).
  - `collect_cfg_kv_caches()` retrieves the companion KV via
    `OmniKVTransferManager` and attaches it as
    `sampling_params.cfg_text_past_key_values`.
  - `ar2diffusion()` forwards `ar_generated_text` plus user metadata
    (system prompt, height/width, multi-modal data) to the DiT, and now
    lazily decodes the AR tokens via `AutoTokenizer` when `detokenize:
    false` on the AR stage leaves `output.text` empty.  Without this
    fallback the DiT silently received an empty CoT string and dropped
    the image conditioning entirely — the "text length=0, image ignored"
    symptom reported on vllm-project#2590 for `it2i_inference.py` +
    `hunyuan_it2i_4gpu.yaml`.

* Entry point (`examples/offline_inference/hunyuan_image3/end2end.py`)
  - Unified `build_prompt()` for all modalities using the Instruct chat
    template (`<|startoftext|>{sys}\n\nUser: [<img>]{q}\n\nAssistant:
    [trigger]`); removes the earlier pretrain-vs-instruct split that
    silently drifted from the model's training distribution.
  - New `img2img` / `img2text` branches plumb multi-modal data and
    `use_system_prompt` through to both stages.

* Stage configs
  - Adds `hunyuan_image3_it2i_kv_reuse.yaml` with `need_send_cache` on
    stage-0 (AR), `need_recv_cache` on stage-1 (DiT), the CFG companion
    expand/collect hooks, and `requires_multimodal_data=true` so the
    source image is forwarded to the DiT for VAE conditioning.
  - Updates existing `hunyuan_image3_{i2t,it2i,t2t,moe}.yaml` to declare
    the new `need_send_cache` / `need_recv_cache` fields so the
    non-reuse paths stay consistent with the transport layer changes.

* Transport
  - `kv_transfer_manager.py` exposes the per-request receive call used by
    `collect_cfg_kv_caches`.
  - `mooncake_transfer_engine_connector.py` small adjustments for the
    cross-node KV-reuse path.

* Worker / misc
  - `diffusion_worker.py`: disable cuDNN at device-init time to work
    around `CUDNN_STATUS_NOT_INITIALIZED` on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to PyTorch native impl.
  - `rope.py`: guard the optional `flash_attn.ops.triton.rotary` import
    so an ABI-incompatible flash-attn install does not break startup.

Validation
----------

Hardware: 4xNVIDIA L20X (143 GB), driver 570.133.20, TP=2 per stage.
Prompt / image: official `assets/demo_instruct_imgs/input_0_0.png` with
the "新年宠物海报" prompt from `run_demo_instruct.sh`; seed 42,
50 inference steps, guidance 5.0.

No-reuse baseline:
  - `[ar2diffusion] Request 0: AR generated 424 tokens, text length=749`
    (was `text length=0` before the tokenizer-fallback fix)
  - DiT denoise           56.9 s
  - Image saved           1216x832, reflects both the input image and
                          the edit prompt (no more "image ignored").

KV-reuse (`hunyuan_image3_it2i_kv_reuse.yaml`):
  - CFG companion KV      407 MB transferred (1019 MB/s)
  - Primary KV            445 MB transferred (1039 MB/s)
  - `L_pos=6793, L_neg=6214` (was `L_neg=1` before the CFG fix)
  - DiT denoise+inject    27.7 s  (2.05x speed-up vs baseline denoise)
  - PSNR vs baseline      9.71 dB, MAE 53.1, |diff|<=50 on 66.6% pixels;
                          matches the residue seen in the reference
                          implementation where KV-reuse is reported as
                          visually indistinguishable from the non-reuse
                          path.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
@kechengliu97 kechengliu97 deleted the ar-reuse branch April 24, 2026 01:37
kechengliu97 added a commit to kechengliu97/vllm-omni that referenced this pull request Apr 25, 2026
Adds the image-to-image (image + text edit instruction -> edited image)
path for HunyuanImage-3.0-Instruct:

* Diffusion pipeline (pipeline_hunyuan_image3.py)
  - Source-image VAE + ViT encode via _encode_cond_image().
  - img2img forward path: threads batch_cond_image_info through
    prepare_model_inputs() so cond_vae_images/cond_vit_images/vit_kwargs
    reach the denoiser.
  - Unified helpers for PIL / tensor / path image loading and joint
    image-info serialisation.
  - Switch to 'instruct' sequence_template (read from generation_config)
    instead of hard-coded 'pretrain'; AR text prefix matches checkpoint
    training distribution.

* Transformer (hunyuan_image3_transformer.py)
  - LightProjector.forward() for the ViT aligner used in IT2I
    source-image conditioning.

* Stage input processor (stage_input_processors/hunyuan_image3.py)
  - ar2diffusion() bridges AR output -> DiT input: forwards
    ar_generated_text (with AutoTokenizer fallback when detokenize=false
    on the AR stage), multi_modal_data, use_system_prompt and sampling
    params. Fixes vllm-project#2590 'IT2I model ignores image' regression.

* Entry point (examples/offline_inference/hunyuan_image3/end2end.py)
  - Unified build_prompt() using the Instruct chat template for all
    tasks and modalities.
  - New img2img / img2text branches plumb multi_modal_data and
    use_system_prompt through to both stages.

* Stage configs
  - hunyuan_image3_{i2t,it2i,t2t}.yaml: declare runtime defaults and
    per-edge window_size/max_inflight for serial AR -> DiT execution.

* Worker / misc
  - diffusion_worker.py: disable cuDNN at device-init time to work
    around CUDNN_STATUS_NOT_INITIALIZED on certain driver / cuDNN
    combinations; VAE 3D convolutions fall back to the PyTorch native
    implementation.
  - rope.py: guard the optional flash_attn.ops.triton.rotary import so
    an ABI-incompatible flash-attn install does not break startup.

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants