Feat/Add HunyuanImage-3.0-Instruct ar part support:#2713
Feat/Add HunyuanImage-3.0-Instruct ar part support:#2713hsliuustc0106 merged 6 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
WIP PR marked with 【wip】prefix. Preliminary comments only:
https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/test_guide/#l3-level--l4-level Full review when WIP status removed and gates pass. |
bbe22a2 to
98f58dc
Compare
a2b8592 to
08b4274
Compare
…, stage configs Add custom sampler with logits processors for AR stage transitions. Ports official _StageTransitionLogitsProcessor and _ConditionalSliceVocabLogitsProcessor into sample() with prefer_model_sampler=True, enabling sampling-based decoding (temperature=0.6, top_k=1024, top_p=0.95) with correct think→recaption→ratio stage transitions. - hunyuan_image3.py: custom sample() with stage transition, ratio restriction, comprehension token blocking, ratio EOS forcing - patch.py: extend is_mm_prefix_lm for bidirectional attention on image tokens (hunyuan_image_3_moe model type). Use __dict__ access for cached_property compat with vllm 0.19.0+ pydantic dataclasses - Stage configs: hunyuan_image3_i2t.yaml (single LLM, TP4), hunyuan_image3_it2i.yaml (2-stage AR→DiT), hunyuan_image3_t2t.yaml - stage_input_processors/hunyuan_image3.py: ar2diffusion() bridge - Delete hunyuan_image3_moe.yaml (replaced by split per-task configs) - Update test_hunyuanimage3_text2img.py to use hunyuan_image3_t2i.yaml Signed-off-by: TaffyOfficial <2324465096@qq.com>
bbbae43 to
7efb959
Compare
- build_prompt: add instruct template (\n\nUser: ...\n\nAssistant: ) - hunyuan_image3.py: unblock <answer>/<\/answer> tokens so model can follow its natural generation pattern - i2t/t2t YAML: temperature=0.0 for greedy decoding, add </answer> (128026) to stop_token_ids Verified on 4xH800: input_ids match official baseline exactly (6364 tokens), greedy output is self-consistent within same process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
7efb959 to
2b7e4ab
Compare
|
@nussejzz PTAL |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Missing hunyuan_image3_t2i.yaml -- the test change references a file that doesn't exist in this PR.
| LOCAL_CLIP_PATH = "openai/clip-vit-base-patch32" | ||
| REPO_ROOT = Path(__file__).resolve().parents[3] | ||
| STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_moe.yaml" | ||
| STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_t2i.yaml" |
There was a problem hiding this comment.
hunyuan_image3_t2i.yaml doesn't exist in this PR. The deleted hunyuan_image3_moe.yaml is replaced by i2t/it2i/t2t configs, but there's no t2i config for pure text-to-image. This test will fail with FileNotFoundError.
There was a problem hiding this comment.
This file has been merged into the library along with # 2712, #2712
| logits[req_idx].fill_(min_score) | ||
| logits[req_idx, max_id] = 0 | ||
|
|
||
| def _clear_transition_state(self, req_idx: int) -> None: |
There was a problem hiding this comment.
_clear_transition_state is defined but never called. With max_num_seqs=1 this is harmless (req_idx=0 gets reused), but it will leak entries if batching is ever enabled. Can you hook it into the request-finish path?
There was a problem hiding this comment.
I have now deleted _transition state and made the phase transition logic stateless, so there is no need to clean up every request state when the request is completed. The next mandatory token is derived from the token history decoded at each step.
| ] | ||
|
|
||
| self._sampler: Sampler | None = None | ||
| self._eos_token_id: int = 127957 # <|endoftext|> |
There was a problem hiding this comment.
Hardcoded EOS token ID. Should this come from tokenizer.eos_token_id or the HF config? If the tokenizer changes this will silently break.
There was a problem hiding this comment.
I switched this to tokenizer.eos_token_id
| _orig_cp = _OriginalModelConfig.__dict__.get("is_mm_prefix_lm") | ||
| if _orig_cp is not _patched_cp: | ||
| # Our assignment above should have replaced it, but just in case | ||
| pass |
There was a problem hiding this comment.
Dead code -- _patched_cp was already assigned above, so this branch is never taken. Remove it.
…zer eos_token_id, hook _clear_transition_state Signed-off-by: TaffyOfficial <2324465096@qq.com>
… devices, harden patch.py comments Signed-off-by: TaffyOfficial <2324465096@qq.com>
49002a7 to
e1d7bab
Compare
|
@hsliuustc0106 update now |
|
nice work, when will this pr be merged? |
| images = mm_data.get("images") | ||
| if images: | ||
| pil_image = images[0] if isinstance(images, list) else images | ||
| if pil_image is not None: |
There was a problem hiding this comment.
Can we use else here directly?
There was a problem hiding this comment.
This follows the same pattern as glm_image.py (L249-256). The multimodal data may arrive as "image" (single PIL Image) or "images"(list), depending on how the input was constructed. The fallback handles both formats. We can't use a simple else here because pil_image may come from either source, and the final guard covers both paths.
…ne_args in it2i.yaml Signed-off-by: TaffyOfficial <2324465096@qq.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
Missing tests — regression tests for the core AR sampler logic (stage transitions, ratio restriction, comprehension blocking) should ship with this PR, not a follow-up.
|
|
||
| for req_idx in range(logits.shape[0]): | ||
| decoded_tokens: list[int] = ( | ||
| sampling_metadata.output_token_ids[req_idx] if req_idx < len(sampling_metadata.output_token_ids) else [] |
There was a problem hiding this comment.
sample() loops per-request over logits.shape[0] with in-place mutation. Fine with max_num_seqs: 1 (which all YAML configs use), but the method signature implies batch support it doesn't correctly handle. Add an assertion or document the constraint.
There was a problem hiding this comment.
Added assert logits.shape[0] == 1 at the top of sample(). All stage configs enforce max_num_seqs: 1; this makes the constraint explicit and fails loudly if violated.
| or history has diverged from the expected forced sequence. | ||
| """ | ||
| for i in range(len(decoded_tokens) - 1, -1, -1): | ||
| trigger = decoded_tokens[i] |
There was a problem hiding this comment.
_get_forced_token scans all decoded tokens backwards every step — O(n²) across decode steps. Acceptable now given short generation lengths, but track for optimization if sequence lengths grow.
There was a problem hiding this comment.
Acknowledged. Generation lengths for this model are bounded (~900 tokens for T2I AR, ~2048 for I2T), so the scan cost is negligible in practice. Added a note in the docstring. If sequence lengths grow significantly we can cache the last trigger position
| return True | ||
| model_type = getattr(self.hf_config, "model_type", "") | ||
| return model_type in _OMNI_MM_PREFIX_LM_MODELS | ||
|
|
There was a problem hiding this comment.
If this __set_name__ dance fails silently (e.g., vLLM changes the descriptor), the model falls back to unpatched is_mm_prefix_lm — bidirectional attention breaks with no error. Add a sanity check at import time that the patch is actually active.
There was a problem hiding this comment.
Added an import-time assertion that verifies the patched cached_property is actually installed on ModelConfig. If vLLM changes the descriptor, this will fail at import rather than silently falling back.
|
Added tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_sampler.py with regression tests for all core sampler paths:
|
… unit tests Signed-off-by: TaffyOfficial <2324465096@qq.com>
823d247 to
fffba50
Compare
|
@hsliuustc0106 update now |
|
hi can you give a example for hunyuan-image3-instruct it2i vllm_omni infer? thank you! |
The AR-to-DIT connection hasn't been established yet. We need to wait for #2590 to be merged before the IT2I process can actually proceed. |
Signed-off-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Purpose
HunyuanImage-3.0-Instruct is a multimodal model by Tencent that supports image understanding (I2T), image editing (IT2I), and text-to-image generation (T2I). The initial model registration was added in #759, but only included the basic model files and a single T2I diffusion config. The AR side was missing critical runtime logic — sampling-based decoding produced empty outputs, image tokens used wrong attention type, and there were no stage configs for I2T/IT2I/T2T pipelines.
This PR fills those gaps to make the AR side actually functional.
Depends on #2712 (rename fix).
Custom sampler with logits processors
The official HunyuanImage-3.0 AR model generates tokens in a fixed internal sequence:
<think>→</think>→<recaption>→</recaption>→<boi>→ size token → ratio token. This is enforced by two logits processors inside the officialgenerate_image():_StageTransitionLogitsProcessor: checks which phase the model is in and forces the next special token at phase boundaries (e.g. after</think>, force<recaption>)_ConditionalSliceVocabLogitsProcessor: after a size token is emitted, masks the vocabulary to only allow ratio tokensWithout these processors, sampling-based decoding (temperature=0.6, top_k=1024, top_p=0.95) breaks immediately — the model samples
</answer>or<|endoftext|>as the first token and outputs an empty string. Greedy decoding happened to work by luck but doesn't match official behavior.This PR ports both processors into
HunyuanImage3ForConditionalGeneration.sample()withprefer_model_sampler=True, so vLLM-Omni's standard sampling pipeline calls our custom sampler before applying temperature/top_k/top_p. Also adds ratio-token EOS forcing: once a ratio token is selected, the next token is forced to EOS, matching the official behavior where all ratio tokens arefinal_stop_tokens.Bidirectional attention for image tokens
HunyuanImage-3.0 uses bidirectional (non-causal) attention for image token positions — text tokens remain causal. This is controlled by
cond_token_attn_type: "joint_full"in the HF config. vLLM-Omni routes this throughModelConfig.is_mm_prefix_lm, buthunyuan_image_3_moewas not in the allowlist. This PR patchesis_mm_prefix_lmto include it. Without this fix, image tokens only attend to preceding tokens, which degrades image understanding quality.The patch also fixes a
cached_property.__get__crash in vllm 0.19.0+ pydantic dataclasses by using__dict__access instead of attribute access.Stage configs
hunyuan_image3_i2t.yaml: Image-to-Text — single LLM stage, TP4, gpu_memory_utilization=0.95hunyuan_image3_t2t.yaml: Text-to-Text — single LLM stage for pure text generationFor IT2I, a test script has been added, but full end-to-end validation is not yet possible. The current DiT side does not fully consume or validate the AR-produced content path yet, so this PR only lands the bridging logic and test scaffolding for follow-up integration.
AR→DiT bridge for IT2I
stage_input_processors/hunyuan_image3.pyprovidesar2diffusion(): takes the AR stage's output (latent token IDs + prompt text), decodes the token IDs back to continuous latent vectors via the AR model's embedding table, concatenates with text embeddings from the tokenizer, and packages everything into the format DiT expects.I2T output alignment with official HF model
Verified vLLM-Omni I2T output against the official
tencent/HunyuanImage-3.0-InstructHF model (greedy decoding, 4×H800, bf16, SDPA):prompt_utils.py: alignedbuild_promptwith official instruct template (\n\nUser: ... \n\nAssistant:format, trigger_tag after Assistant)hunyuan_image3.py: removed<answer>/</answer>from blocked_token_ids in comprehension mode — the model naturally generates these, blocking them breaks outputi2t.yaml/t2t.yaml: temperature 0.6→0.0 (greedy), added128026(</answer>) to stop_token_idsTest Plan
GPU: 4 × H800 (80GB), TP4
Note: test files are in the follow-up PR. Tests were run against the full stack on the GPU server.
Accuracy Verification
The official HunyuanImage-3.0-Instruct model itself is non-deterministic across processes under bf16 multi-GPU inference. We established this baseline first to set realistic expectations for vLLM-Omni alignment.
Official HF model cross-process baseline
Same code, same
device_map, two separate runs:model.generate)device_map)Root cause: bf16 NCCL all-reduce floating-point accumulation order is non-deterministic across processes; greedy argmax amplifies tiny numerical differences into token-level divergence.
vLLM-Omni vs official HF
I2T (Image-to-Text):
T2I AR (Text-to-Image, AR stage only):
Alignment criteria