[Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template by TaffyOfficial · Pull Request #3243 · vllm-project/vllm-omni

TaffyOfficial · 2026-04-29T06:58:39Z

Summary

Multi-part fix for offline AR output of tencent/HunyuanImage-3.0-Instruct
in vllm-omni:

Part A — Structural fix. The current main branch produces:

IT2I greedy: 2267 chars with no <think> analysis, just "image_1 × 6"
death-loop repetition
IT2I sampling: 8188 chars of "吐舌头吐舌头吐舌头..." complete garbage

Both modes are essentially unusable. Root cause: build_prompt() in
examples/offline_inference/hunyuan_image3/end2end.py uses a
"pretrain-style" concatenation that misplaces the <think> trigger
relative to the user prompt, putting user instructions inside the
model's "thinking section". Under any decoding (greedy or sampling)
the model collapses into degenerate fixed points. Fix: switch to the
instruct chat template with the trigger after the Assistant: prefix,
matching what HF's apply_general_template(..., sequence_template="instruct")
emits, plus a build_prompt_tokens() segment-by-segment tokenizer to
bypass cross-segment BPE merges.

Part B — Precision alignment with HF reference (fp32 MoE router).
After Part A the output is structurally correct, but a lower-amplitude
divergence from HF remained. Investigation traced this to MoE routing
precision.

hunyuan3.0_ins/modeling_hunyuan_image_3.py runs the router in fp32
on purpose:

# HunyuanTopKGate.__init__ (line 1102)
self.wg = nn.Linear(hidden_size, num_experts, bias=False, dtype=torch.float32)

# HunyuanTopKGate.forward (line 1114)
if self.wg.weight.dtype == torch.float32:
    hidden_states = hidden_states.float()

# HunyuanMoE.forward (line 1204) — explicit defense against AMP cast
with torch.autocast('cuda', enabled=False):
    topk_weights, topk_idx = self.gate(hidden_states, topk_impl='easy')

# HunyuanTopKGate.easy_topk (line 1132)
gates = F.softmax(logits, dim=1)                  # fp32
topk_weight_1, idx = torch.topk(gates, moe_topk)  # fp32
weight_sums = clamp(topk_weight_1.sum(...), min=1e-8)
topk_weight = topk_weight_1 / weight_sums         # fp32
# only AFTER routing decision is fixed, cast back to bf16:
topk_weights = topk_weights.to(hidden_states.dtype)

vLLM's stock HunYuanSparseMoeBlock builds the gate as a default-dtype
(bf16) ReplicatedLinear and lets SharedFusedMoE's topk_softmax
CUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and
32 MoE layers, bf16 quantization can flip top-k boundary decisions on
close routing scores → wrong expert MLPs are applied, the resulting
hidden states diverge, the divergence cascades through the KV cache,
and the eventual decoded token differs from HF.

Fix: add HunyuanImage3SparseMoeBlock, a subclass + post-init module
replacement that mirrors the stock block 1:1 except:

Router gate is ReplicatedLinear(..., params_dtype=torch.float32),
so the mlp.gate.wg.weight checkpoint values (stored bf16) are
upcast into a fp32 parameter on load.
forward() casts hidden states to fp32 before the gate matmul,
does softmax / topk / clamp+divide renormalization in fp32, then
casts the topk weights back to model dtype, exactly mirroring HF's
easy_topk math.
fp32-routed (topk_weights, topk_indices) are packed into the
router_logits slot and SharedFusedMoE is built with
custom_routing_function=_hunyuan_image3_unpack_packed_topk, so
the bf16 topk_softmax CUDA op is bypassed entirely.

HunyuanImage3ForConditionalGeneration._patch_moe_blocks() walks the
already-built model.layers, pops each old experts' static-forward-
context registration, frees the old MoE block's GPU buffers (otherwise
the transient old+new allocation OOMs near the gpu_memory_utilization
cap on the 80B model with TP=2), then installs the new block. Runs
inside __init__ so it takes effect before weight loading.

Implementation strategy: subclass + post-init module replacement, not
monkey-patch. No vLLM upstream files are modified. The replacement is
local to vllm_omni/model_executor/models/hunyuan_image3/hunyuan_image3.py,
so other models that consume HunYuanSparseMoeBlock are unaffected.

Part C — Stop AR-only output at </think> for i2t/t2t. The
i2t.yaml and t2t.yaml stage configs are is_comprehension: true, final_output_type: text (AR-only text, no DiT). They should mirror
HF's bot_task="think" which terminates the response at </think>.
The previous stop_token_ids: [127957, 128026] (<|endoftext|>,
</answer>) assumed the model would naturally stop at </answer>
once _StageTransitionLogitsProcessor is gated off (it only fires in
generation mode, not comprehension mode). In practice the
instruct-tuned model continues into a <recaption> section out of
trained habit and never emits </answer>. Add </think>
(128024) to the stop_token_ids for i2t and t2t. it2i.yaml is
intentionally not changed — the AR→DiT pipeline needs the AR
stage to emit the full <think><answer><boi><img_size_*><img_ratio_*>
sequence for DiT to decode image latents.

After this PR:

IT2I greedy AR-only (is_comprehension=true): 482 chars, full
<think> analysis terminating at </think>, structurally identical
to HF baseline (no extra recaption section, no "tongue" hallucination)
IT2I greedy think+recaption AR (is_comprehension=false, AR portion
of full IT2I pipeline): 811 chars (think 482 + recap 329),
structurally identical to HF model.generate_image(bot_task="think_recaption")
AR portion (think 448 + recap 329 — exact same length as omni)
IT2I sampling: 2751 chars, full think + recaption-style content
T2T greedy: numbered structured output matching HF baseline structure

Verified on remote 2× L20 with tencent/HunyuanImage-3.0-Instruct and
transformers==4.57.1.

Commits

commit	what
`d71981e7`	Siglip2 transformers≥5.x list compat
`d360569a`	T2T `build_prompt` instruct format (early)
`27083f9c`	Unify all chat tasks under instruct template, trigger after `Assistant:`
`ea809348`	`A:` → `Assistant:` to match HF tokenizer's instruct conv-roles output
`7bd429ed`	New `build_prompt_tokens()` returns `list[int]`; bypass cross-segment BPE merges
`3d415e17`	docs: timestep slot uses `<img>` placeholder + injected `timestep_emb(0)` is embedding-equivalent to HF's `<timestep>` token
`8a1a4af9`	docs: image preprocessing already aligned with HF (resize/crop math + VAE normalize identical)
`41d29432`	Defer VAE pixel dtype cast from `process_image` to `_vae_encode` boundary (`vae_pixel_values` fp32 byte-identical with HF, mean=0.157296)
`31c2fa56`	`HunyuanImage3SparseMoeBlock`: fp32 router matching HF (gate weight fp32 + fp32 softmax/topk + clamp(min=1e-8) renorm + bypasses bf16 `topk_softmax` CUDA op via `custom_routing_function`). 32/32 MoE layers replaced via `_patch_moe_blocks()`.
`07d8cf0d`	Add `</think>` (128024) to stop_token_ids in `hunyuan_image3_i2t.yaml` and `hunyuan_image3_t2t.yaml` so AR-only comprehension output terminates after the analysis section, matching HF's `bot_task="think"`.

(History rewritten on 2026-04-30 to remove unrelated CI commits that
got pulled in via merges; see force-push at hash 0413c2c2..07d8cf0d.
The 10 commits above are the full PR scope: 4 files changed, +449 / -24 lines.)

Test plan

Reproduce on 2× L20 with tencent/HunyuanImage-3.0-Instruct,
transformers==4.57.1, enforce_eager=true.

Test inputs:

T2T: "Describe the Eiffel Tower in detail. What makes it architecturally significant?"
IT2I: assets/demo_instruct_imgs/input_0_0.png + "新年宠物海报，Q版圆润..."

Mode 1: IT2I greedy AR-only (think only)

	main branch	This PR	HF baseline (`bot_task="think"`)	gap
chars	2267 (death loop, no `<think>`)	482	466	+16 / +3.4%
structure	"image_1 × 6" repeats	full `<think>` analysis	full `<think>` analysis	match
hallucinations	"image_1" loop	none	none	match

Mode 2: IT2I greedy think + recaption AR (this is the AR portion of `model.generate_image(bot_task="think_recaption")`)

	omni this PR (`is_comprehension=false`)	HF official `generate_image()` AR	gap
total chars	811 (think 482 + recap 329)	777 (think 448 + recap 329)	+34 / +4.4%
recaption length	329 chars	329 chars	identical
think first divergence	char 64	char 64	both byte-identical first 64 chars / ~21 tokens

The recaption sections being exactly the same length demonstrates
that omni's _StageTransitionLogitsProcessor is operationally
equivalent to HF generate_image()'s stage_transitions parameter.
The remaining +34 chars sits in the <think> section and is the same
class of BF16 reduction-noise drift as Mode 1's +16 chars.

Outputs

Full text outputs are saved at it2i_t2t_outputs/ for review (this
includes the rejected test methodology — file E_HF_auto_recaption_*.txt
— kept as a cautionary example with the README explaining why it's not
comparable).

File F (HF official `generate_image()` AR portion, 777 chars)

Full text — click to expand

用户希望将一张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。这张参考图展示了一只坐在木质地板上的金毛幼犬，背景是户外的花丛。原始指令非常具体，要求添加特定的标题文字、改变背景、调整构图以及应用特定的艺术风格。首先，我需要分析文字部分的添加，主标题"新年快乐汪"需要采用Q版圆润且可爱的字体，副标题"HAPPY NEW YEAR"则应作为补充。其次，背景需要从户外的自然环境彻底切换到室内的房间门口，这涉及到场景的完全重构。构图上，指令提到了鱼眼镜头和近景特写，这意味着画面中心的小狗头部会因为透视效果而显得更加圆润和突出，而背景中的门框和室内陈设会呈现出向四周弯曲的视觉特征。小狗本身的配饰也需要改变，从原来的粉色项圈换成更具节日气息的红色围巾和红色毛线帽。最后，整体风格要模拟宝丽莱相纸的质感，带有胶片颗粒感和复古的写实主义风格。在改写指令时，我需要将这些抽象的风格描述转化为具体的视觉细节，确保生成的图像能够准确体现这些变化，同时保持小狗原本那种歪头微笑的可爱神态。</think><recaption>将参考图中的金毛幼犬制作成一张复古风格的新年宠物海报。首先，将背景从户外的木质地板和花丛替换为室内的房间门口，背景中应包含白色的门框和室内模糊的家具陈设。对画面应用鱼眼镜头效果，使中心的小狗头部显得更加圆润可爱，且背景呈现出向边缘弯曲的透视感。在画面顶部添加大号的Q版圆润艺术字标题"新年快乐汪"，并在其下方添加较小的英文副标题"HAPPY NEW YEAR"。给小狗戴上一顶红色的针织毛线帽，并在脖子上围上一条厚实的红色围巾，同时保留它原本歪头微笑、吐着舌头的可爱表情。最后，为整张图片添加宝丽莱相纸的白色边框，并赋予其胶片摄影的质感，包括细腻的打印颗粒感和轻微的复古色调，使画面呈现出一种超写实的怀旧艺术感。</recaption>

File D (omni this PR `is_comprehension=false` AR, 811 chars)

Full text — click to expand

用户希望将一张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。这张参考图展示了一只坐在木质地板上的金毛幼犬，背景是户外的蒲公英花丛。原始指令非常具体，要求添加特定的标题文字、改变背景、调整构图以及应用特定的艺术风格。首先，我需要分析文字部分的添加，主标题"新年快乐汪"需要采用Q版圆润且可爱的字体，副标题"HAPPY NEW YEAR"则应作为补充。其次，背景需要从户外的自然景观彻底切换到室内的房间门口，这涉及到场景的完全重构。构图上，指令提到了鱼眼镜头和近景特写，这意味着画面中心的小狗头部会因为透视效果而显得更加圆润和突出，而背景中的门框和室内陈设会呈现出向四周弯曲的视觉特征。在主体细节方面，小狗需要佩戴红色的毛线帽和红色的围巾，这不仅增加了节日感，也呼应了标题中的红色元素。最后，整体风格被定义为宝丽莱相纸、胶片摄影和复古感，这意味着图像需要具备明显的颗粒感、柔和的色彩饱和度以及相纸特有的白色边框。综合这些要求，我需要构建一个详细的指令，指导模型如何从参考图出发，通过改变背景、添加配饰、调整构图和滤镜风格，最终生成一张符合所有描述的新年海报。请基于参考图中的金毛幼犬，将其改造成一张复古胶片风格的新年宠物海报。首先，将背景从户外的木质地板和蒲公英花丛替换为室内的房间门口场景，背景中应包含白色的门框和室内模糊的家具陈设。在构图上，采用鱼眼镜头效果，使画面中心的小狗头部呈现出圆润的特写感，身体比例相应缩小。为小狗添加节日配饰：在头顶戴一顶红色的针织毛线帽，脖子上围一条厚实的红色针织围巾。在图像上方添加两行文字，第一行是主标题"新年快乐汪"，使用圆润、带有描边的可爱艺术字体；第二行是副标题"HAPPY NEW YEAR"，字体稍小，位于主标题下方。最后，为整张图片应用宝丽莱相纸的视觉风格，包括宽大的白色相纸边框、明显的胶片颗粒感、柔和的复古色调以及轻微的暗角效果，确保小狗的绒毛细节依然清晰可见。

Side-by-side recaption comparison (independent third-party review)

The two recaption sections were given to a separate LLM (DeepSeek) for
neutral side-by-side review. Verbatim verdict:

这两段描述的核心创意和最终效果差别非常小，本质上是在说同一张海报。
如果非要对比细节，第一段（omni D）比第二段（HF F）多了以下几点明确要求：

身体比例：第一段明确写了"身体比例相应缩小"，第二段只提了头部圆润、背景弯曲。

文字描边：第一段要求主标题"带有描边的可爱艺术字体"，第二段只说"大号Q版圆润艺术字"。

暗角：第一段明确提到"轻微的暗角效果"，第二段没有单独强调。

绒毛细节：第一段特别要求"确保小狗的绒毛细节依然清晰可见"，第二段只笼统地说"超写实的怀旧艺术感"。

除此之外，背景（室内门口+白门框+模糊家具）、鱼眼镜头效果、红色针织帽+围巾、
歪头吐舌头微笑、两行标题文字（新年快乐汪 + HAPPY NEW YEAR）、宝丽莱白边框+
胶片颗粒+复古色调——这些核心要素完全一致。

结论：如果你按第二段描述生成，结果可能缺少文字描边、轻微暗角和极致的
绒毛清晰度，但整体观感八九不离十。可以认为是同一份需求的两个措辞版本，差别不大。

In other words: omni's recaption is slightly more verbose / more
specific in three or four optional rendering details but covers the
exact same image-edit intent as HF's reference, and would produce a
visually equivalent image when fed to DiT.

Known limitations (intentionally not fixed in this PR)

The remaining +16/+34 chars gap to HF baseline sits in BF16 reduction
noise across the few decoded tokens that follow slightly different
code paths in vllm vs transformers. Confirmed not fixable at the
prompt / preprocessing / routing layer:

✅ Prompt format — fixed
✅ BPE cross-segment merge — fixed by build_prompt_tokens() (text
input byte-identical to HF apply_chat_template, 1227 leading
tokens verified)
✅ transformers version — verified omni output is byte-identical
between 4.57.1 and 5.6.2 (same input, same yaml, same seed)
✅ Pixel preprocessing — vae_pixel_values fp32-byte-identical with
HF (mean=0.157296 both sides) after 41d29432
✅ Image token routing (<timestep> slot) — embedding-layer
equivalent via injected timestep_emb(0) (single-token swap
regression-tested: swapping placeholder breaks output)
✅ Attention backend — FA = SDPA byte-identical for greedy
✅ MoE routing precision — fp32 gate + fp32 softmax/topk +
clamp(min=1e-8) renormalization now matches HF exactly via subclass
(31c2fa56)
✅ AR-only output termination — stops at </think> matching HF's
bot_task="think" (07d8cf0d)
✅ Stage-transition logic — omni _StageTransitionLogitsProcessor
proven equivalent to HF generate_image()'s stage_transitions
parameter (Mode 2 recaption length identical: 329 vs 329 chars)

Confirmed remaining differences sit in implementation layers vllm-omni
overrides for performance:

vllm's PagedAttention KV cache vs HF's contiguous cache (BF16
reduction order differs)
vllm-omni's Triton fused MoE expert MLP (the per-expert MLP
compute, separate from the routing fixed in this PR) vs HF's python
loop — BF16 reduction order in the weighted sum differs
vllm's Sampler (vllm/v1/sample/sampler.py) vs transformers'
_sample() — different RNG primitives, different logits-processor
ordering, different top_k implementation. Even with identical seeds,
these produce non-aligned token sequences. This is an architectural
property of vllm, not a regression — enforce_eager=true only
disables compile/CUDAGraphs, not these.
TP=2 all-reduce reduction order
TP-sharded fp32 gate matmul (a different reduction tree than HF's
single-GPU full-rank matmul)

Reaching byte-identical alignment with HF would require replacing
vllm-omni's forward implementation with HF's — defeats the purpose of
vllm-omni. This PR provides functionally correct output with all key
elements covered, structurally aligned with HF (terminates at
</think> for AR-only mode, identical-length recaption section for
think+recaption mode), and with the largest fixable deterministic
precision gap (MoE router) closed.

What this PR does NOT claim

Earlier related PRs (#2713 testing first-30-tokens, #2986 smoke tests
with len > 0 assertion) accepted significant divergence as
"BF16 / GPU non-determinism". This PR's investigation showed those
prior tests bypassed the buggy build_prompt path entirely (using
HF-derived input_ids directly) and used assertions too weak to detect
death-loop output. We do not repeat that pattern. We acknowledge:

Per-token logit divergence under vllm forward exists (architectural,
not removable here)
Sampling outputs are not seed-portable across vllm and transformers
T2T greedy is the strongest canary for prompt-format regressions
(no DiT to mask AR text quality) — recommend adding to CI

The PR's contract: functionally correct, structurally aligned, MoE
routing precision-aligned with HF reference, AR-only output terminates
at </think> matching HF, think+recaption AR portion produces
identical-length recaption section as HF reference, with greedy IT2I
within +3.4% (think only) / +4.4% (think+recap) length of HF baseline.
Not byte-identical (impossible without replacing vllm's attention /
expert MLP / sampler).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 80e0237f31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T07:03:28Z

+    if init_timeout is not None:
+        generate_params_kwargs["init_timeout"] = init_timeout


Forward init timeout when creating GEBench generate server

This branch sets init_timeout=1800 for multi-GPU GEBench runs, but that value is only stored in OmniServerParams and never reaches the actual OmniServer invocation in AccuracyServerConfig.generate_server (which only prepends --stage-init-timeout). In practice, those runs still use the CLI default --init-timeout (600s), so large HunyuanImage startup can still timeout despite this override, causing the new nightly path to fail intermittently.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T07:03:28Z

+    if devices_opt:
+        num_devices = len([d for d in devices_opt.split(",") if d.strip()])
+        extra_args = ["--tensor-parallel-size", str(num_devices)]


Parse extra server args independently of device overrides

--gebench-extra-server-args is only consumed inside the if devices_opt: block, so passing extra server flags without --gebench-devices is silently ignored. That makes the new option unreliable for single-GPU smoke runs (for example, users cannot pass required flags like --trust-remote-code unless they also set a device list), which is an unexpected behavioral regression for this CLI surface.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-29T07:03:28Z

+                    for snap in sorted(os.listdir(snapshots)):
+                        candidate = os.path.join(snapshots, snap, "tokenizer.json")
+                        if os.path.isfile(candidate):
+                            tokenizer_file = candidate
+                            break


Select newest tokenizer snapshot in HF cache fallback

The fallback loader picks the first lexicographically sorted snapshot directory and stops, which can select an older cached revision when multiple snapshots exist. In that failure path, tokenizer/model revision mismatch can change token IDs and prompt formatting, reintroducing unstable or incorrect AR behavior; selecting the active/latest snapshot (e.g., by mtime or refs) avoids this stale-cache regression.

Useful? React with 👍 / 👎.

Bounty-hunter · 2026-04-29T09:31:25Z

 }


+def build_prompt_tokens(


can we define the common func to construct prompt template, and used in AR， DIT and end2end.py? current DIT construct template with func apply_chat_template

TaffyOfficial · 2026-04-30T04:03:53Z

On the remaining +16 chars vs HF baseline

For anyone wondering why this PR doesn't fully byte-align with HF
(482 chars vs HF 466 chars), this is not a remaining bug —
it's unavoidable BF16 reduction-noise drift inherent to vllm's
tensor-parallel + paged-KV execution model.

What is aligned

The first 52 chars (~17–18 generated tokens) are byte-identical
with HF, proving the prompt path / preprocessing / fp32 MoE routing
are now all correctly aligned through prefill + the first ~18 decode
steps.

Where it diverges

At char 53, two equally-valid Chinese sentence continuations branch:

HF chose: 、背景有白色蒲公英的小狗，它正对着镜头开心地笑着…
omni chose: 的金毛幼犬，背景是户外的蒲公英花丛…

Both branches describe the same image and cover all prompt elements
(red knit hat + scarf, fisheye lens, polaroid frame, vintage filter).
Total length differs by +16 chars.

Why it's unavoidable

After ~18 decode steps the two systems' hidden states differ by
~1e-5..1e-3 BF16 ULPs — small in absolute terms but enough to flip
top-1 on a close logit margin (e.g. 、 vs 的). Once one greedy
token diverges, the rest of the sequence does too. Sources:

source	HF	vllm
KV cache	contiguous tensor	PagedAttention (16-token blocks)
Multi-GPU	layer-wise split, each matmul on one GPU	tensor-parallel: sharded matmul +
`all_reduce`
MoE expert MLP combine	python loop, serial accumulate	Triton fused kernel, different
accumulate order
Gate matmul reduction (already fp32)	full-rank single-GPU	TP-sharded, different reduction tree

|

All four are the same arithmetic op in a different reduction order.
BF16 addition is non-associative ((a+b)+c ≠ a+(b+c) at ~1e-3
magnitude), so each layer × each token contributes a tiny noise term.
Cumulated over 32 layers + 18 decode steps, it crosses the threshold
to flip a top-1 token.

This is a fundamental property of tensor-parallel paged-KV inference,
not specific to HunyuanImage3. Closing the +16 chars gap would
require replacing vllm's attention / fused-MoE expert MLP / sampler
with HF's reference implementations — defeats the purpose of using
vllm (which is 5–10× faster because of these choices).

What this PR delivers

gap	status
Death-loop garbage output (wrong prompt format)	✅ fixed (Part A)
BPE cross-segment merge	✅ fixed (`42c2f349`)
Pixel bf16-quant noise pre-VAE	✅ fixed (`80cbaa3f`)
BF16 router top-k flips	✅ fixed (`0413c2c2`, fp32 router matching HF)
AR-only output not stopping at `</think>`	✅ fixed (`40ac16cc`)
BF16 reduction-noise drift over decode steps	❌ architectural, not fixable here

Contract: functionally correct, structurally aligned, all known
fixable precision gaps closed. Per-token byte-identical output with
HF transformers is not achievable under vllm's execution model and
not promised.

…rmers>=5.x Siglip2ImageProcessorFast in transformers>=5.0 returns pixel_values, pixel_attention_mask, and spatial_shapes as lists of tensors/tuples instead of a single batched tensor. The old code called .squeeze(0) directly on the list, causing AttributeError at MultiModalBudget initialization (get_dummy_mm_inputs path) and crashing startup. Fix by stacking list elements into a tensor before squeezing: - pixel_values / pixel_attention_mask: torch.stack(list, dim=0) - spatial_shapes: torch.tensor(list, dtype=torch.long) since elements are tuples, not tensors Tested on transformers 5.6.2: both FA and SDPA backends initialize and produce identical T2T output after this fix. Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

The pretrain-style format (system_prompt + raw user_prompt) used by build_prompt() for task="t2t" leaves the model without an answer-start signal. With temperature=0.0 greedy decoding it falls into garbage repetition (e.g. "massive arches massive arches ..." ad infinitum). Use the instruct chat format `<|startoftext|>User: {prompt}\n\nA: ` which matches what the official HF AR baseline (mode="gen_text", sequence_template="instruct") emits via prepare_model_inputs(). With that format vllm-omni T2T produces structured numbered output with specific facts, comparable to the HF baseline. Verified on remote 2x L20 with HunyuanImage-3.0-Instruct. Signed-off-by: TaffyOfficial <2324465096@qq.com>

The pretrain-style prompt format was producing garbage output for greedy decoding across all chat-style tasks (T2T, I2T, IT2I, T2I): - T2T (no trigger): "massive arches massive arches ..." (infinite loop) - IT2I (<think>): repetitive "image_1 完整保留..." segments Root cause: trigger_tag and user_prompt order were inverted vs. what HunyuanImage3's tokenizer.apply_general_template emits for instruct sequence_template. The model was trained to see: <|startoftext|>{system?}\n\nUser: {<img>?}{user_prompt}\n\nA: {trigger?} so the trigger (e.g. <think>) sits AFTER the assistant prefix and the model continues from there. The previous build_prompt() concatenated trigger BEFORE user_prompt, which placed the user instructions inside the model's "thinking section" and broke greedy decoding. Fix: - T2T: bare instruct template, no system prompt (matches HF baseline) - t2i_vanilla: keep pretrain mode (it is the only task designed for it) - All others: instruct template with trigger after `\n\nA: ` Verified on remote 2x L20 with HunyuanImage-3.0-Instruct: IT2I greedy output is now coherent <think> analysis covering all key elements (matches HF AR baseline structure). Signed-off-by: TaffyOfficial <2324465096@qq.com>

HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as the bot role prefix in instruct sequence_template (verified by decoding HF prepare_model_inputs output with system_prompt=en_unified + image + bot_task=think: token 72803 = "Assistant"). Switch build_prompt() to use the full word so the AR prefill aligns with the official HF tokenization. Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107 reference implementation does the same; the previous T2T-specific branch was a workaround for an earlier prompt-format experiment). Note: BPE merge across user_prompt/Assistant boundary still produces 1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template keeps them separate. Full byte-identical alignment requires passing pre-tokenized prompt_token_ids — that path is supported by vllm-omni (OmniTokensPrompt) but not yet plumbed through build_prompt(). Signed-off-by: TaffyOfficial <2324465096@qq.com>

Adds build_prompt_tokens() that mirrors HF apply_chat_template's segment-by-segment tokenization. The previous build_prompt() returned a single string that the engine fed through tokenizer.encode() in one BPE pass, which merged tokens across segment boundaries (e.g. user_prompt ending in "。" + the conv separator "\n\n" -> single token id 3490 instead of HF's [1811, 271]). This shifted tokens at the user-text / Assistant: prefix boundary and made vllm-omni's input_ids drift from HF's by 1-2 tokens, causing greedy outputs to diverge after the very first generated token. Loads the model's tokenizer in main(), encodes each conversation segment independently (system prompt, "\n\n", "User: ", <img> placeholder, user_prompt, "\n\nAssistant: ", trigger tag) and passes the resulting list[int] to omni.generate() via the existing prompt_token_ids dict path (OmniSingletonPrompt already supports list[int] / OmniTokensPrompt — no engine-side changes needed). t2i_vanilla still uses the pretrain whole-string path because that mode has no chat-template segments. Verified on remote 2x L20: text portion of input_ids (first 1227 tokens, before the <img> placeholder) is now byte-identical to HF's prepare_model_inputs output. The trailer also matches: the previous "。\n\nAssistant: <think>" -> [3490, 32, 25, 220, 128023] becomes the HF-correct [1811, 271, 72803, 25, 220, 128023]. Note: build_prompt() is kept for backward compatibility but its docstring now warns about the BPE merge issue and points to build_prompt_tokens() as the replacement for HF-aligned inputs. Signed-off-by: TaffyOfficial <2324465096@qq.com>

The image expansion uses <img> (128006) at the timestep slot while HF's apply_chat_template uses the literal <timestep> (128017). Naively swapping the placeholder breaks output (model hallucinates additional images) because HF's modeling forward calls `instantiate_continuous_tokens` to *scatter-replace* the embedding at the <timestep> position with `timestep_emb(0)` for cond images — the wte embedding of <timestep> is irrelevant at runtime. vllm-omni's existing <img>-placeholder + multimodal-merger path already produces the same final hidden state at that position by shipping `timestep_emb(0)` at the head of `embed_multimodal()`'s combined_embeddings tensor. So the AR forward is numerically equivalent to HF; only the dumped input_ids differ at that one slot. Switching to <timestep> would require either a second PromptReplacement targeting 128017, or letting `PromptUpdateDetails.select_token_id` take a list of embed_token_ids. Both are deeper engine-level changes; out of scope for this fix. Add explanatory comments in `_get_prompt_updates` and `embed_multimodal` so future readers don't re-discover this rabbit hole and don't break it with naive cleanups. Verified on remote 2x L20: IT2I greedy output remains structurally correct (2167 chars, full <think> analysis covering all key elements, no image_2..N hallucinations). Signed-off-by: TaffyOfficial <2324465096@qq.com>

Audit of vllm-omni's process_image() against HF's image_processor: - resize/crop math: byte-for-byte identical to HF's `resize_and_crop` with crop_type="center". Same aspect-ratio preservation, same int(round(...)) ordering, same LANCZOS resampler, same crop region computation. - VAE PIL->tensor: identical. Both use transforms.Compose([ToTensor, Normalize([0.5], [0.5])]) — fully equivalent. - ViT processor: same Siglip2 processor class, but transformers version differs at runtime (vllm-omni venv = 5.6.2; HF baseline venv = 4.57.1). The Siglip2ImageProcessorFast normalization path changed between these versions, producing ~1 ULP differences in pixel values. This is a venv-pinning concern, not a code bug. - dtype cast: vllm-omni casts vae_pixel_values to bf16 here; HF stores fp32 and casts inside the encoder forward. Tried delaying the cast to mirror HF, but vllm-omni's _vae_encode runs fp32 input through a bf16-weighted conv3d which raises a dtype mismatch (HF avoids this by an explicit cast at the encoder boundary that vllm-omni does not have). Keep the existing cast and document the divergence — fixing it requires plumbing a cast into _vae_encode, out of scope for this PR. Net effect of this commit: comments only. No behavior change. The remaining numerical drift between vllm-omni and HF on image embeddings is bounded by the transformers version delta and the BF16 reduction-order noise floor; both are out of scope for code changes in this branch. Verified on remote 2x L20: IT2I greedy output unchanged (2167 chars, structurally aligned with HF). Signed-off-by: TaffyOfficial <2324465096@qq.com>

…essor `HunyuanImage3Processor.process_image` previously cast `vae_pixel_values` to model dtype (bf16) right after VAE preprocessing. HF keeps these as fp32 in `build_cond_images` and only casts inside the VAE forward, which preserves fp32 precision through the multimodal_data dict. Move the cast into `_vae_encode` (encoder boundary) and keep `vae_pixel_values` as fp32 in the processor. Verified pixel-level byte-identical with HF (fp32 mean=0.157296). Greedy IT2I output is unchanged (the VAE encoder's first conv casts to bf16 anyway, so the final latent is identical to before this fix), but this removes a ~7e-4 mean-abs-diff bf16 quantization error from `vae_pixel_values` and aligns the multimodal_data path with HF. Signed-off-by: TaffyOfficial <2324465096@qq.com>

HF's `HunyuanTopKGate` runs the router in fp32: `wg` is constructed as `nn.Linear(..., dtype=torch.float32)`, `hidden_states` is cast to fp32 before the matmul, the call is wrapped in `with torch.autocast('cuda', enabled=False)`, and `easy_topk` does `F.softmax` -> `torch.topk` -> divide by `clamp(weight_sums, min=1e-8)`, all in fp32. Only the resulting topk weights are cast to bf16 for the expert MLP combine. vLLM's stock `HunYuanSparseMoeBlock` builds the gate as a default-dtype (bf16) `ReplicatedLinear` and lets `SharedFusedMoE`'s `topk_softmax` CUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and 32 MoE layers, bf16 quantization can flip top-k boundary decisions on close routing scores -- wrong expert MLPs are applied, the resulting hidden states diverge, the divergence cascades through the KV cache, and the eventual decoded token differs from HF. Add `HunyuanImage3SparseMoeBlock`, a subclass that mirrors the stock block 1:1 except: 1. The router gate is `ReplicatedLinear(..., params_dtype=torch.float32)`, so the `mlp.gate.wg.weight` checkpoint values (stored bf16) are upcast into a fp32 parameter on load. 2. `forward()` casts hidden states to fp32 before the gate matmul, does softmax / topk / clamp+divide renormalization in fp32, then casts the topk weights back to model dtype, exactly mirroring HF's `easy_topk` math. 3. The fp32-routed (topk_weights, topk_indices) are packed into the `router_logits` slot and `SharedFusedMoE` is built with `custom_routing_function=_hunyuan_image3_unpack_packed_topk`, so the bf16 `topk_softmax` CUDA op is bypassed entirely. `HunyuanImage3ForConditionalGeneration._patch_moe_blocks` walks the already-built `model.layers`, pops each old experts' static-forward- context registration, frees the old MoE block's GPU buffers (otherwise the transient old+new allocation OOMs near the gpu_memory_utilization cap on the 80B model with TP=2), then installs the new block. Must run inside `__init__` so it takes effect before weight loading. Verified end-to-end on a single greedy IT2I prompt (`new year pet poster ...`): - 32/32 MoE layers replaced (logged as "Replaced 32 HunYuanSparseMoeBlock layers with HunyuanImage3SparseMoeBlock (fp32 router matching HF reference)"). - Output deterministically diverged from the bf16-routed run, exactly as expected from a routing-precision change. - Removed one observed hallucination ("dog sticking out tongue") that appeared in the bf16-routed output but not in HF's. Does not byte-align with HF (PagedAttention vs contiguous KV cache and sampler RNG path differences are independent architectural divergences documented in `memory/hf_omni_alignment_method.md`), but closes the single largest *fixable* deterministic precision gap remaining after the prompt / preprocessing / image-pipeline alignment fixes. Signed-off-by: TaffyOfficial <2324465096@qq.com>

The i2t and t2t stage configs are `is_comprehension: true, final_output_type: text` -- AR-only text output, no DiT image generation. They should mirror HF's `bot_task="think"` which terminates the response at `</think>`. The previous stop_token_ids `[127957, 128026]` (`<|endoftext|>`, `</answer>`) assumed the model would naturally stop at `</answer>` once `_StageTransitionLogitsProcessor` is gated off (it only fires in generation mode, not comprehension mode). In practice the instruct-tuned model continues into a `<recaption>` section out of trained habit and never emits `</answer>` (which is only meaningful after the full `<think>...</think><recaption>...</recaption><answer>` sequence the generation pipeline runs through). Add `</think>` (128024) to the stop_token_ids for i2t and t2t. This makes greedy IT2I AR-only output align with HF's `bot_task="think"` baseline: | | HF baseline | omni before | omni after | |-------------|------------:|------------:|-----------:| | chars | 466 | 811 | 482 | | bytes | 1354 | 2375 | 1416 | | sections | think | think+recap | think | | gap to HF | 0 | +345 | +16 | Length divergence collapses from +74% to +3.4%. The remaining +16 chars sits in BF16 reduction noise / sampler implementation differences (documented in `memory/hf_omni_alignment_method.md`) and cannot be closed without reimplementing vllm's attention. `hunyuan_image3_it2i.yaml` is intentionally NOT changed: the IT2I pipeline (`is_comprehension: false`, AR -> DiT) needs the AR stage to emit the full `<think>...<answer><boi><img_size_*><img_ratio_*>` sequence so that DiT can decode the image latents. Stopping at `</think>` there would break image generation. Update the existing comment in `HunyuanImage3ForConditionalGeneration.__init__` that incorrectly claimed comprehension mode would stop at `</answer>` or EOS, so future readers understand why we explicitly stop at `</think>` in the yaml. Verified end-to-end: greedy IT2I AR-only output now ends cleanly at the analysis section, byte-for-byte structurally aligned with HF's `bot_task="think"` output (only differs in BF16-noise-driven per-token text divergence, no extra recaption section). Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>

hsliuustc0106 · 2026-04-30T07:49:30Z

add regression test for this

Move build_prompt and build_prompt_tokens out of the example script into vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py so the AR-prefill prompt template has a single source of truth that downstream callers can reuse. The DiT pipeline keeps using TokenizerWrapper.apply_chat_template (which eagerly consumes JointImageInfo); prompt_utils targets the lighter client-side flow that uses an <img> placeholder + multi_modal_data. README is updated to describe the actual instruct chat template (the previous "pretrain template" wording was stale relative to the post-fix behavior introduced earlier in this PR) and to point at the new module. Addresses GH PR review comment requesting a common prompt-construction function shared across AR / DiT / end2end.py. Signed-off-by: TaffyOfficial <2324465096@qq.com>

…m-project#3243) Three layers of protection for the bug fixed in this PR: 1. Pure-logic structural tests (FakeTokenizer-based) verify: - The chat template framing (<|startoftext|> ... User: ... Assistant: ...) - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: ` (Part A regression: putting the trigger BEFORE user_prompt sends the model into a death-loop under greedy decoding). - <img> placeholder is positioned correctly for image-input tasks. - Each prompt segment is encoded in an isolated tokenizer.encode() call so cross-segment BPE merges cannot occur (the bug from commit 7bd429e). 2. AST-based wiring guard verifies that examples/.../end2end.py imports build_prompt_tokens from prompt_utils and does NOT redefine it locally. This protects the *delivery vector* of the original regression: the wrong template re-entered the example via a hand-rolled local builder that diverged from the canonical helper. 3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache) asserts that segment-by-segment build_prompt_tokens produces a STRICTLY different id sequence than tokenizer.encode(build_prompt(...)) for a `。`-ending prompt. If a future "simplification" replaces segment encode with full-string encode, the BPE-merge-bypass behavior is gone and this test fires. Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct tokenizer in HF cache): 19/19 passed including the real-tokenizer test. Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial · 2026-04-30T08:39:05Z

add regression test for this

down

CI ruff format check required collapsing short multi-line constructs (single-line assertions, single-line if conditions) onto one line. No semantic change; 19/19 tests still pass. Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial · 2026-04-30T09:21:00Z

@hsliuustc0106 need review

Gaohan123 · 2026-05-05T09:39:31Z

Please fix CI

PR vllm-project#3232 [Rebase] Rebase to vllm 0.20.0 folded `SharedFusedMoE` into `FusedMoE` and dropped the `vllm.model_executor.layers.fused_moe.shared_fused_moe` submodule, which broke pytest collection for tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_sampler.py with `ModuleNotFoundError: No module named 'vllm.model_executor.layers.fused_moe.shared_fused_moe'` across all four CI suites on this branch (simple-unit-test, diffusion-cache-backend-test, cuda-unit-test-with-single-card, cuda-unit-test-with-multi-cards). Mirrors the same minimal fix applied on cr/pr3107-rebased: - Wrap the legacy import in try/except and fall back to `FusedMoE as SharedFusedMoE`. `FusedMoE` now accepts `shared_experts=` directly and the call sites only use `make_expert_params_mapping` and `__init__(shared_experts=..., ...)`, both present on `FusedMoE`. - Drop `reduce_results=False` from the `SharedFusedMoE(...)` call — vllm 0.20 removed that parameter from `FusedMoE.__init__`. - Drop the manual `(routed, shared)` tuple merge and `tensor_model_parallel_all_reduce` post-processing in `HunyuanImage3SparseMoeBlock.forward`. vllm 0.20+ `FusedMoE` merges shared-experts internally and runs the TP all-reduce inside its forward, so the result is the already-combined, already-reduced tensor. Signed-off-by: zuiho <2324465096@qq.com>

Lint follow-up to c5089d1. The previous commit removed the call to `tensor_model_parallel_all_reduce` in `HunyuanImage3SparseMoeBlock.forward` (vllm 0.20+ FusedMoE runs the TP all-reduce internally) but left the symbol in the `from vllm.distributed import (...)` block, which `ruff` flags as unused (F401). Signed-off-by: zuiho <2324465096@qq.com>

TaffyOfficial · 2026-05-05T13:20:04Z

Please fix CI

fix now

Gaohan123

LGTM. Thanks

…m-project#3243) Three layers of protection for the bug fixed in this PR: 1. Pure-logic structural tests (FakeTokenizer-based) verify: - The chat template framing (<|startoftext|> ... User: ... Assistant: ...) - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: ` (Part A regression: putting the trigger BEFORE user_prompt sends the model into a death-loop under greedy decoding). - <img> placeholder is positioned correctly for image-input tasks. - Each prompt segment is encoded in an isolated tokenizer.encode() call so cross-segment BPE merges cannot occur (the bug from commit 7bd429e). 2. AST-based wiring guard verifies that examples/.../end2end.py imports build_prompt_tokens from prompt_utils and does NOT redefine it locally. This protects the *delivery vector* of the original regression: the wrong template re-entered the example via a hand-rolled local builder that diverged from the canonical helper. 3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache) asserts that segment-by-segment build_prompt_tokens produces a STRICTLY different id sequence than tokenizer.encode(build_prompt(...)) for a `。`-ending prompt. If a future "simplification" replaces segment encode with full-string encode, the BPE-merge-bypass behavior is gone and this test fires. Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct tokenizer in HF cache): 19/19 passed including the real-tokenizer test. Signed-off-by: TaffyOfficial <2324465096@qq.com>

Address PR vllm-project#3107 review (Bounty-hunter / Gaohan123) requesting AR-output-format and DiT-output-accuracy regression tests. Layout mirrors PR vllm-project#2949's split (CPU unit test under tests/diffusion/..., GPU accuracy test under tests/e2e/accuracy/...). CPU unit test tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py - test_ar_prefill_tokens_match_hf_apply_chat_template_for_it2i: asserts build_prompt_tokens (the AR-side prefill builder) is token-id-identical to HF tokenizer.apply_chat_template for the same (system, user_prompt, image) triple. Catches drift between the AR's input distribution and the model's training distribution -- the same failure mode PR vllm-project#3243 fixed for T2I. - test_dit_condition_image_preprocessing_byte_matches_ar_processor: asserts the diffusion-side _resize_and_crop_center produces byte-identical pixels to the AR-side HunyuanImage3Processor._resize_and_crop on the canonical resize targets. Direct response to Bounty-hunter's PR vllm-project#3107 review. Both tests gate on tencent/HunyuanImage-3.0-Instruct being in the local HF cache (no GPU/model weights required at runtime, just the tokenizer config + image processor). GPU accuracy test tests/e2e/accuracy/test_hunyuan_image3_it2i.py - test_hunyuan_image3_it2i_matches_hf_reference_psnr_40: drives vllm-omni's offline IT2I path through Omni and runs the official HF reference via AutoModelForCausalLM.generate_image, compared via the shared assert_similarity helper at PSNR>=40 dB and SSIM>=0.92. Marked full_model + skipif<8 GPUs; the threshold follows PR vllm-project#2949's review discussion (40 dB gives slack for TP=2 NCCL drift while still catching prompt/image-preprocessing bugs). Signed-off-by: zuiho-kai <wu15922848573@outlook.com>

… Instruct chat template (vllm-project#3243) Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com> Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com> Signed-off-by: zuiho <2324465096@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>

TaffyOfficial requested a review from hsliuustc0106 as a code owner April 29, 2026 06:58

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

Fishermanykx mentioned this pull request Apr 29, 2026

[Config] Add HunyuanImage3 deploy configs #3172

Merged

Bounty-hunter reviewed Apr 29, 2026

View reviewed changes

TaffyOfficial added 10 commits April 30, 2026 12:05

TaffyOfficial force-pushed the feature/hunyuan-t2t-sdpa-fa branch from 40ac16c to 07d8cf0 Compare April 30, 2026 04:07

fix(hunyuan_image3): import SharedFusedMoE

6978fd7

Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>

TaffyOfficial force-pushed the feature/hunyuan-t2t-sdpa-fa branch from 7756476 to 6978fd7 Compare April 30, 2026 05:54

TaffyOfficial changed the title ~~[WIP][Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template~~ [Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template Apr 30, 2026

Gaohan123 added this to the v0.20.0 milestone Apr 30, 2026

TaffyOfficial added 2 commits April 30, 2026 16:09

Gaohan123 added high priority high priority issue, needs to be done asap ready label to trigger buildkite CI labels Apr 30, 2026

Merge branch 'main' into feature/hunyuan-t2t-sdpa-fa

b298435

zuiho added 2 commits May 5, 2026 21:11

Gaohan123 approved these changes May 5, 2026

View reviewed changes

Gaohan123 merged commit 44cde33 into vllm-project:main May 5, 2026
8 checks passed

		if init_timeout is not None:
		generate_params_kwargs["init_timeout"] = init_timeout

		}


		def build_prompt_tokens(

Conversation

TaffyOfficial commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Test plan

Mode 1: IT2I greedy AR-only (think only)

Mode 2: IT2I greedy think + recaption AR (this is the AR portion of model.generate_image(bot_task="think_recaption"))

Outputs

File F (HF official generate_image() AR portion, 777 chars)

File D (omni this PR is_comprehension=false AR, 811 chars)

Side-by-side recaption comparison (independent third-party review)

Known limitations (intentionally not fixed in this PR)

What this PR does NOT claim

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Bounty-hunter Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial commented Apr 30, 2026

On the remaining +16 chars vs HF baseline

What is aligned

Where it diverges

Why it's unavoidable

What this PR delivers

Uh oh!

hsliuustc0106 commented Apr 30, 2026

Uh oh!

TaffyOfficial commented Apr 30, 2026

Uh oh!

TaffyOfficial commented Apr 30, 2026

Uh oh!

Gaohan123 commented May 5, 2026

Uh oh!

TaffyOfficial commented May 5, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TaffyOfficial commented Apr 29, 2026 •

edited

Loading

Mode 2: IT2I greedy think + recaption AR (this is the AR portion of `model.generate_image(bot_task="think_recaption")`)

File F (HF official `generate_image()` AR portion, 777 chars)

File D (omni this PR `is_comprehension=false` AR, 811 chars)