Skip to content

Feat/Add HunyuanImage-3.0-Instruct ar part support:#2713

Merged
hsliuustc0106 merged 6 commits intovllm-project:mainfrom
TaffyOfficial:feat/hunyuan-image3-model
Apr 16, 2026
Merged

Feat/Add HunyuanImage-3.0-Instruct ar part support:#2713
hsliuustc0106 merged 6 commits intovllm-project:mainfrom
TaffyOfficial:feat/hunyuan-image3-model

Conversation

@TaffyOfficial
Copy link
Copy Markdown
Contributor

@TaffyOfficial TaffyOfficial commented Apr 13, 2026

Purpose

HunyuanImage-3.0-Instruct is a multimodal model by Tencent that supports image understanding (I2T), image editing (IT2I), and text-to-image generation (T2I). The initial model registration was added in #759, but only included the basic model files and a single T2I diffusion config. The AR side was missing critical runtime logic — sampling-based decoding produced empty outputs, image tokens used wrong attention type, and there were no stage configs for I2T/IT2I/T2T pipelines.

This PR fills those gaps to make the AR side actually functional.

Depends on #2712 (rename fix).

Custom sampler with logits processors

The official HunyuanImage-3.0 AR model generates tokens in a fixed internal sequence: <think></think><recaption></recaption><boi> → size token → ratio token. This is enforced by two logits processors inside the official generate_image():

  • _StageTransitionLogitsProcessor: checks which phase the model is in and forces the next special token at phase boundaries (e.g. after </think>, force <recaption>)
  • _ConditionalSliceVocabLogitsProcessor: after a size token is emitted, masks the vocabulary to only allow ratio tokens

Without these processors, sampling-based decoding (temperature=0.6, top_k=1024, top_p=0.95) breaks immediately — the model samples </answer> or <|endoftext|> as the first token and outputs an empty string. Greedy decoding happened to work by luck but doesn't match official behavior.

This PR ports both processors into HunyuanImage3ForConditionalGeneration.sample() with prefer_model_sampler=True, so vLLM-Omni's standard sampling pipeline calls our custom sampler before applying temperature/top_k/top_p. Also adds ratio-token EOS forcing: once a ratio token is selected, the next token is forced to EOS, matching the official behavior where all ratio tokens are final_stop_tokens.

Bidirectional attention for image tokens

HunyuanImage-3.0 uses bidirectional (non-causal) attention for image token positions — text tokens remain causal. This is controlled by cond_token_attn_type: "joint_full" in the HF config. vLLM-Omni routes this through ModelConfig.is_mm_prefix_lm, but hunyuan_image_3_moe was not in the allowlist. This PR patches is_mm_prefix_lm to include it. Without this fix, image tokens only attend to preceding tokens, which degrades image understanding quality.

The patch also fixes a cached_property.__get__ crash in vllm 0.19.0+ pydantic dataclasses by using __dict__ access instead of attribute access.

Stage configs

  • hunyuan_image3_i2t.yaml: Image-to-Text — single LLM stage, TP4, gpu_memory_utilization=0.95
  • hunyuan_image3_t2t.yaml: Text-to-Text — single LLM stage for pure text generation

For IT2I, a test script has been added, but full end-to-end validation is not yet possible. The current DiT side does not fully consume or validate the AR-produced content path yet, so this PR only lands the bridging logic and test scaffolding for follow-up integration.

AR→DiT bridge for IT2I

stage_input_processors/hunyuan_image3.py provides ar2diffusion(): takes the AR stage's output (latent token IDs + prompt text), decodes the token IDs back to continuous latent vectors via the AR model's embedding table, concatenates with text embeddings from the tokenizer, and packages everything into the format DiT expects.

I2T output alignment with official HF model

Verified vLLM-Omni I2T output against the official tencent/HunyuanImage-3.0-Instruct HF model (greedy decoding, 4×H800, bf16, SDPA):

  • prompt_utils.py: aligned build_prompt with official instruct template (\n\nUser: ... \n\nAssistant: format, trigger_tag after Assistant)
  • hunyuan_image3.py: removed <answer>/</answer> from blocked_token_ids in comprehension mode — the model naturally generates these, blocking them breaks output
  • i2t.yaml / t2t.yaml: temperature 0.6→0.0 (greedy), added 128026 (</answer>) to stop_token_ids

Test Plan

GPU: 4 × H800 (80GB), TP4

pytest tests/e2e/offline_inference/test_hunyuanimage3_i2t.py -v -m advanced_model
pytest tests/e2e/offline_inference/test_hunyuanimage3_t2i.py -v -m advanced_model

Note: test files are in the follow-up PR. Tests were run against the full stack on the GPU server.

image image

Accuracy Verification

The official HunyuanImage-3.0-Instruct model itself is non-deterministic across processes under bf16 multi-GPU inference. We established this baseline first to set realistic expectations for vLLM-Omni alignment.

Official HF model cross-process baseline

Same code, same device_map, two separate runs:

Scenario Result
Intra-process (two consecutive model.generate) 466 tokens, 100% match
Cross-process (model reload, fixed device_map) baseline 638 tokens vs verify 458 tokens, first 34 match (~5% agreement)

Root cause: bf16 NCCL all-reduce floating-point accumulation order is non-deterministic across processes; greedy argmax amplifies tiny numerical differences into token-level divergence.

vLLM-Omni vs official HF

I2T (Image-to-Text):

  • input_ids: exact match (6364 tokens)
  • Output: semantically correct, structurally coherent (accurately describes image content)
  • First 30+ output tokens match the official model

T2I AR (Text-to-Image, AR stage only):

  • Official: 908 tokens, vLLM-Omni: 920 tokens
  • First 121 tokens identical, then diverge (187/908 = 20.6% overall match)
  • This is significantly better than the official model's own cross-process reproducibility (~5%)

Alignment criteria

  1. input_ids exactly match the official model
  2. First 30+ output tokens match
  3. Output is semantically correct and structurally coherent
  4. Token-level exact match is not expected — this is inherent to bf16 multi-GPU inference, not a code bug

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@TaffyOfficial TaffyOfficial changed the title Feat/hunyuan image3 model 【wip】Feat/Add HunyuanImage-3.0-Instruct model support: Apr 13, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

WIP PR marked with 【wip】prefix. Preliminary comments only:

  1. Pre-commit and build checks not visible in status rollup. Ensure these pass before requesting full review.

  2. Test plan mentions 'test files are in follow-up PR'. For a feature PR adding model support, tests should be in the same PR unless there's a specific reason for splitting.

  3. 994 LOC across 21 files is substantial. Consider running L3 tests locally and adding results once WIP status is removed:

https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/test_guide/#l3-level--l4-level

Full review when WIP status removed and gates pass.

@TaffyOfficial TaffyOfficial changed the title 【wip】Feat/Add HunyuanImage-3.0-Instruct model support: Feat/Add HunyuanImage-3.0-Instruct model support: Apr 13, 2026
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch 2 times, most recently from bbe22a2 to 98f58dc Compare April 13, 2026 11:49
@TaffyOfficial TaffyOfficial changed the title Feat/Add HunyuanImage-3.0-Instruct model support: 【wip】Feat/Add HunyuanImage-3.0-Instruct model support: Apr 15, 2026
@TaffyOfficial TaffyOfficial changed the title 【wip】Feat/Add HunyuanImage-3.0-Instruct model support: 【wip】Feat/Add HunyuanImage-3.0-Instruct ar part support: Apr 15, 2026
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch from a2b8592 to 08b4274 Compare April 15, 2026 04:21
@TaffyOfficial TaffyOfficial changed the title 【wip】Feat/Add HunyuanImage-3.0-Instruct ar part support: Feat/Add HunyuanImage-3.0-Instruct ar part support: Apr 15, 2026
…, stage configs

Add custom sampler with logits processors for AR stage transitions.
Ports official _StageTransitionLogitsProcessor and
_ConditionalSliceVocabLogitsProcessor into sample() with
prefer_model_sampler=True, enabling sampling-based decoding
(temperature=0.6, top_k=1024, top_p=0.95) with correct
think→recaption→ratio stage transitions.

- hunyuan_image3.py: custom sample() with stage transition, ratio
  restriction, comprehension token blocking, ratio EOS forcing
- patch.py: extend is_mm_prefix_lm for bidirectional attention on
  image tokens (hunyuan_image_3_moe model type). Use __dict__ access
  for cached_property compat with vllm 0.19.0+ pydantic dataclasses
- Stage configs: hunyuan_image3_i2t.yaml (single LLM, TP4),
  hunyuan_image3_it2i.yaml (2-stage AR→DiT), hunyuan_image3_t2t.yaml
- stage_input_processors/hunyuan_image3.py: ar2diffusion() bridge
- Delete hunyuan_image3_moe.yaml (replaced by split per-task configs)
- Update test_hunyuanimage3_text2img.py to use hunyuan_image3_t2i.yaml

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch 2 times, most recently from bbbae43 to 7efb959 Compare April 15, 2026 07:44
- build_prompt: add instruct template (\n\nUser: ...\n\nAssistant: )
- hunyuan_image3.py: unblock <answer>/<\/answer> tokens so model can
  follow its natural generation pattern
- i2t/t2t YAML: temperature=0.0 for greedy decoding, add </answer>
  (128026) to stop_token_ids

Verified on 4xH800: input_ids match official baseline exactly (6364
tokens), greedy output is self-consistent within same process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch from 7efb959 to 2b7e4ab Compare April 15, 2026 07:56
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@nussejzz PTAL

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing hunyuan_image3_t2i.yaml -- the test change references a file that doesn't exist in this PR.

LOCAL_CLIP_PATH = "openai/clip-vit-base-patch32"
REPO_ROOT = Path(__file__).resolve().parents[3]
STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_moe.yaml"
STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_t2i.yaml"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hunyuan_image3_t2i.yaml doesn't exist in this PR. The deleted hunyuan_image3_moe.yaml is replaced by i2t/it2i/t2t configs, but there's no t2i config for pure text-to-image. This test will fail with FileNotFoundError.

Copy link
Copy Markdown
Contributor Author

@TaffyOfficial TaffyOfficial Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

This file has been merged into the library along with # 2712, #2712

logits[req_idx].fill_(min_score)
logits[req_idx, max_id] = 0

def _clear_transition_state(self, req_idx: int) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clear_transition_state is defined but never called. With max_num_seqs=1 this is harmless (req_idx=0 gets reused), but it will leak entries if batching is ever enabled. Can you hook it into the request-finish path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now deleted _transition state and made the phase transition logic stateless, so there is no need to clean up every request state when the request is completed. The next mandatory token is derived from the token history decoded at each step.

]

self._sampler: Sampler | None = None
self._eos_token_id: int = 127957 # <|endoftext|>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded EOS token ID. Should this come from tokenizer.eos_token_id or the HF config? If the tokenizer changes this will silently break.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to tokenizer.eos_token_id

Comment thread vllm_omni/patch.py Outdated
_orig_cp = _OriginalModelConfig.__dict__.get("is_mm_prefix_lm")
if _orig_cp is not _patched_cp:
# Our assignment above should have replaced it, but just in case
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code -- _patched_cp was already assigned above, so this branch is never taken. Remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove now

TaffyOfficial added 2 commits April 15, 2026 17:18
…zer eos_token_id, hook _clear_transition_state

Signed-off-by: TaffyOfficial <2324465096@qq.com>
… devices, harden patch.py comments

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch from 49002a7 to e1d7bab Compare April 15, 2026 10:11
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 update now

@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 15, 2026
@Kyr1e666
Copy link
Copy Markdown

nice work, when will this pr be merged?

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 16, 2026
images = mm_data.get("images")
if images:
pil_image = images[0] if isinstance(images, list) else images
if pil_image is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use else here directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the same pattern as glm_image.py (L249-256). The multimodal data may arrive as "image" (single PIL Image) or "images"(list), depending on how the input was constructed. The fallback handles both formats. We can't use a simple else here because pil_image may come from either source, and the final guard covers both paths.

…ne_args in it2i.yaml

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests — regression tests for the core AR sampler logic (stage transitions, ratio restriction, comprehension blocking) should ship with this PR, not a follow-up.


for req_idx in range(logits.shape[0]):
decoded_tokens: list[int] = (
sampling_metadata.output_token_ids[req_idx] if req_idx < len(sampling_metadata.output_token_ids) else []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample() loops per-request over logits.shape[0] with in-place mutation. Fine with max_num_seqs: 1 (which all YAML configs use), but the method signature implies batch support it doesn't correctly handle. Add an assertion or document the constraint.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert logits.shape[0] == 1 at the top of sample(). All stage configs enforce max_num_seqs: 1; this makes the constraint explicit and fails loudly if violated.

or history has diverged from the expected forced sequence.
"""
for i in range(len(decoded_tokens) - 1, -1, -1):
trigger = decoded_tokens[i]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_forced_token scans all decoded tokens backwards every step — O(n²) across decode steps. Acceptable now given short generation lengths, but track for optimization if sequence lengths grow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Generation lengths for this model are bounded (~900 tokens for T2I AR, ~2048 for I2T), so the scan cost is negligible in practice. Added a note in the docstring. If sequence lengths grow significantly we can cache the last trigger position

Comment thread vllm_omni/patch.py
return True
model_type = getattr(self.hf_config, "model_type", "")
return model_type in _OMNI_MM_PREFIX_LM_MODELS

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this __set_name__ dance fails silently (e.g., vLLM changes the descriptor), the model falls back to unpatched is_mm_prefix_lm — bidirectional attention breaks with no error. Add a sanity check at import time that the patch is actually active.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an import-time assertion that verifies the patched cached_property is actually installed on ModelConfig. If vLLM changes the descriptor, this will fail at import rather than silently falling back.

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

Added tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_sampler.py with regression tests for all core sampler paths:

  • _get_forced_token stage transitions: trigger → forced sequence, partial completion, full completion, diverged history stops forcing, later trigger takes precedence
  • Comprehension mode blocking (I2T/T2T): image-generation tokens (, <img_size_*>, ratio tokens) are masked to -inf, text token unaffected
  • Ratio restriction: after <img_size_*>, only ratio tokens (<img_ratio_0>–<img_ratio_32> + extras) retain their logits, all other vocab masked
  • Force EOS after ratio: once a ratio token is selected, only EOS is allowed — prevents the ratio token loop that was fixed earlier

… unit tests

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-model branch from 823d247 to fffba50 Compare April 16, 2026 11:27
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 update now

@hsliuustc0106 hsliuustc0106 merged commit c3ca5da into vllm-project:main Apr 16, 2026
8 checks passed
@Kyr1e666
Copy link
Copy Markdown

hi can you give a example for hunyuan-image3-instruct it2i vllm_omni infer? thank you!

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

hi can you give a example for hunyuan-image3-instruct it2i vllm_omni infer? thank you!

The AR-to-DIT connection hasn't been established yet. We need to wait for #2590 to be merged before the IT2I process can actually proceed.

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Co-authored-by: TaffyOfficial <2324465096@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants