[Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template#3243
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80e0237f31
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if init_timeout is not None: | ||
| generate_params_kwargs["init_timeout"] = init_timeout |
There was a problem hiding this comment.
Forward init timeout when creating GEBench generate server
This branch sets init_timeout=1800 for multi-GPU GEBench runs, but that value is only stored in OmniServerParams and never reaches the actual OmniServer invocation in AccuracyServerConfig.generate_server (which only prepends --stage-init-timeout). In practice, those runs still use the CLI default --init-timeout (600s), so large HunyuanImage startup can still timeout despite this override, causing the new nightly path to fail intermittently.
Useful? React with 👍 / 👎.
| if devices_opt: | ||
| num_devices = len([d for d in devices_opt.split(",") if d.strip()]) | ||
| extra_args = ["--tensor-parallel-size", str(num_devices)] |
There was a problem hiding this comment.
Parse extra server args independently of device overrides
--gebench-extra-server-args is only consumed inside the if devices_opt: block, so passing extra server flags without --gebench-devices is silently ignored. That makes the new option unreliable for single-GPU smoke runs (for example, users cannot pass required flags like --trust-remote-code unless they also set a device list), which is an unexpected behavioral regression for this CLI surface.
Useful? React with 👍 / 👎.
| for snap in sorted(os.listdir(snapshots)): | ||
| candidate = os.path.join(snapshots, snap, "tokenizer.json") | ||
| if os.path.isfile(candidate): | ||
| tokenizer_file = candidate | ||
| break |
There was a problem hiding this comment.
Select newest tokenizer snapshot in HF cache fallback
The fallback loader picks the first lexicographically sorted snapshot directory and stops, which can select an older cached revision when multiple snapshots exist. In that failure path, tokenizer/model revision mismatch can change token IDs and prompt formatting, reintroducing unstable or incorrect AR behavior; selecting the active/latest snapshot (e.g., by mtime or refs) avoids this stale-cache regression.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
|
|
||
| def build_prompt_tokens( |
There was a problem hiding this comment.
can we define the common func to construct prompt template, and used in AR, DIT and end2end.py? current DIT construct template with func apply_chat_template
On the remaining +16 chars vs HF baselineFor anyone wondering why this PR doesn't fully byte-align with HF What is alignedThe first 52 chars (~17–18 generated tokens) are byte-identical Where it divergesAt char 53, two equally-valid Chinese sentence continuations branch:
Both branches describe the same image and cover all prompt elements Why it's unavoidableAfter ~18 decode steps the two systems' hidden states differ by
| All four are the same arithmetic op in a different reduction order. This is a fundamental property of tensor-parallel paged-KV inference, What this PR delivers
Contract: functionally correct, structurally aligned, all known |
…rmers>=5.x Siglip2ImageProcessorFast in transformers>=5.0 returns pixel_values, pixel_attention_mask, and spatial_shapes as lists of tensors/tuples instead of a single batched tensor. The old code called .squeeze(0) directly on the list, causing AttributeError at MultiModalBudget initialization (get_dummy_mm_inputs path) and crashing startup. Fix by stacking list elements into a tensor before squeezing: - pixel_values / pixel_attention_mask: torch.stack(list, dim=0) - spatial_shapes: torch.tensor(list, dtype=torch.long) since elements are tuples, not tensors Tested on transformers 5.6.2: both FA and SDPA backends initialize and produce identical T2T output after this fix. Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
The pretrain-style format (system_prompt + raw user_prompt) used by
build_prompt() for task="t2t" leaves the model without an answer-start
signal. With temperature=0.0 greedy decoding it falls into garbage
repetition (e.g. "massive arches massive arches ..." ad infinitum).
Use the instruct chat format `<|startoftext|>User: {prompt}\n\nA: `
which matches what the official HF AR baseline (mode="gen_text",
sequence_template="instruct") emits via prepare_model_inputs(). With
that format vllm-omni T2T produces structured numbered output with
specific facts, comparable to the HF baseline.
Verified on remote 2x L20 with HunyuanImage-3.0-Instruct.
Signed-off-by: TaffyOfficial <2324465096@qq.com>
The pretrain-style prompt format was producing garbage output for
greedy decoding across all chat-style tasks (T2T, I2T, IT2I, T2I):
- T2T (no trigger): "massive arches massive arches ..." (infinite loop)
- IT2I (<think>): repetitive "image_1 完整保留..." segments
Root cause: trigger_tag and user_prompt order were inverted vs. what
HunyuanImage3's tokenizer.apply_general_template emits for instruct
sequence_template. The model was trained to see:
<|startoftext|>{system?}\n\nUser: {<img>?}{user_prompt}\n\nA: {trigger?}
so the trigger (e.g. <think>) sits AFTER the assistant prefix and the
model continues from there. The previous build_prompt() concatenated
trigger BEFORE user_prompt, which placed the user instructions inside
the model's "thinking section" and broke greedy decoding.
Fix:
- T2T: bare instruct template, no system prompt (matches HF baseline)
- t2i_vanilla: keep pretrain mode (it is the only task designed for it)
- All others: instruct template with trigger after `\n\nA: `
Verified on remote 2x L20 with HunyuanImage-3.0-Instruct: IT2I greedy
output is now coherent <think> analysis covering all key elements
(matches HF AR baseline structure).
Signed-off-by: TaffyOfficial <2324465096@qq.com>
HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as the bot role prefix in instruct sequence_template (verified by decoding HF prepare_model_inputs output with system_prompt=en_unified + image + bot_task=think: token 72803 = "Assistant"). Switch build_prompt() to use the full word so the AR prefill aligns with the official HF tokenization. Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107 reference implementation does the same; the previous T2T-specific branch was a workaround for an earlier prompt-format experiment). Note: BPE merge across user_prompt/Assistant boundary still produces 1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template keeps them separate. Full byte-identical alignment requires passing pre-tokenized prompt_token_ids — that path is supported by vllm-omni (OmniTokensPrompt) but not yet plumbed through build_prompt(). Signed-off-by: TaffyOfficial <2324465096@qq.com>
Adds build_prompt_tokens() that mirrors HF apply_chat_template's segment-by-segment tokenization. The previous build_prompt() returned a single string that the engine fed through tokenizer.encode() in one BPE pass, which merged tokens across segment boundaries (e.g. user_prompt ending in "。" + the conv separator "\n\n" -> single token id 3490 instead of HF's [1811, 271]). This shifted tokens at the user-text / Assistant: prefix boundary and made vllm-omni's input_ids drift from HF's by 1-2 tokens, causing greedy outputs to diverge after the very first generated token. Loads the model's tokenizer in main(), encodes each conversation segment independently (system prompt, "\n\n", "User: ", <img> placeholder, user_prompt, "\n\nAssistant: ", trigger tag) and passes the resulting list[int] to omni.generate() via the existing prompt_token_ids dict path (OmniSingletonPrompt already supports list[int] / OmniTokensPrompt — no engine-side changes needed). t2i_vanilla still uses the pretrain whole-string path because that mode has no chat-template segments. Verified on remote 2x L20: text portion of input_ids (first 1227 tokens, before the <img> placeholder) is now byte-identical to HF's prepare_model_inputs output. The trailer also matches: the previous "。\n\nAssistant: <think>" -> [3490, 32, 25, 220, 128023] becomes the HF-correct [1811, 271, 72803, 25, 220, 128023]. Note: build_prompt() is kept for backward compatibility but its docstring now warns about the BPE merge issue and points to build_prompt_tokens() as the replacement for HF-aligned inputs. Signed-off-by: TaffyOfficial <2324465096@qq.com>
The image expansion uses <img> (128006) at the timestep slot while HF's apply_chat_template uses the literal <timestep> (128017). Naively swapping the placeholder breaks output (model hallucinates additional images) because HF's modeling forward calls `instantiate_continuous_tokens` to *scatter-replace* the embedding at the <timestep> position with `timestep_emb(0)` for cond images — the wte embedding of <timestep> is irrelevant at runtime. vllm-omni's existing <img>-placeholder + multimodal-merger path already produces the same final hidden state at that position by shipping `timestep_emb(0)` at the head of `embed_multimodal()`'s combined_embeddings tensor. So the AR forward is numerically equivalent to HF; only the dumped input_ids differ at that one slot. Switching to <timestep> would require either a second PromptReplacement targeting 128017, or letting `PromptUpdateDetails.select_token_id` take a list of embed_token_ids. Both are deeper engine-level changes; out of scope for this fix. Add explanatory comments in `_get_prompt_updates` and `embed_multimodal` so future readers don't re-discover this rabbit hole and don't break it with naive cleanups. Verified on remote 2x L20: IT2I greedy output remains structurally correct (2167 chars, full <think> analysis covering all key elements, no image_2..N hallucinations). Signed-off-by: TaffyOfficial <2324465096@qq.com>
Audit of vllm-omni's process_image() against HF's image_processor: - resize/crop math: byte-for-byte identical to HF's `resize_and_crop` with crop_type="center". Same aspect-ratio preservation, same int(round(...)) ordering, same LANCZOS resampler, same crop region computation. - VAE PIL->tensor: identical. Both use transforms.Compose([ToTensor, Normalize([0.5], [0.5])]) — fully equivalent. - ViT processor: same Siglip2 processor class, but transformers version differs at runtime (vllm-omni venv = 5.6.2; HF baseline venv = 4.57.1). The Siglip2ImageProcessorFast normalization path changed between these versions, producing ~1 ULP differences in pixel values. This is a venv-pinning concern, not a code bug. - dtype cast: vllm-omni casts vae_pixel_values to bf16 here; HF stores fp32 and casts inside the encoder forward. Tried delaying the cast to mirror HF, but vllm-omni's _vae_encode runs fp32 input through a bf16-weighted conv3d which raises a dtype mismatch (HF avoids this by an explicit cast at the encoder boundary that vllm-omni does not have). Keep the existing cast and document the divergence — fixing it requires plumbing a cast into _vae_encode, out of scope for this PR. Net effect of this commit: comments only. No behavior change. The remaining numerical drift between vllm-omni and HF on image embeddings is bounded by the transformers version delta and the BF16 reduction-order noise floor; both are out of scope for code changes in this branch. Verified on remote 2x L20: IT2I greedy output unchanged (2167 chars, structurally aligned with HF). Signed-off-by: TaffyOfficial <2324465096@qq.com>
…essor `HunyuanImage3Processor.process_image` previously cast `vae_pixel_values` to model dtype (bf16) right after VAE preprocessing. HF keeps these as fp32 in `build_cond_images` and only casts inside the VAE forward, which preserves fp32 precision through the multimodal_data dict. Move the cast into `_vae_encode` (encoder boundary) and keep `vae_pixel_values` as fp32 in the processor. Verified pixel-level byte-identical with HF (fp32 mean=0.157296). Greedy IT2I output is unchanged (the VAE encoder's first conv casts to bf16 anyway, so the final latent is identical to before this fix), but this removes a ~7e-4 mean-abs-diff bf16 quantization error from `vae_pixel_values` and aligns the multimodal_data path with HF. Signed-off-by: TaffyOfficial <2324465096@qq.com>
HF's `HunyuanTopKGate` runs the router in fp32: `wg` is constructed as
`nn.Linear(..., dtype=torch.float32)`, `hidden_states` is cast to fp32
before the matmul, the call is wrapped in
`with torch.autocast('cuda', enabled=False)`, and `easy_topk` does
`F.softmax` -> `torch.topk` -> divide by `clamp(weight_sums, min=1e-8)`,
all in fp32. Only the resulting topk weights are cast to bf16 for the
expert MLP combine.
vLLM's stock `HunYuanSparseMoeBlock` builds the gate as a default-dtype
(bf16) `ReplicatedLinear` and lets `SharedFusedMoE`'s `topk_softmax`
CUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and
32 MoE layers, bf16 quantization can flip top-k boundary decisions on
close routing scores -- wrong expert MLPs are applied, the resulting
hidden states diverge, the divergence cascades through the KV cache,
and the eventual decoded token differs from HF.
Add `HunyuanImage3SparseMoeBlock`, a subclass that mirrors the stock
block 1:1 except:
1. The router gate is `ReplicatedLinear(..., params_dtype=torch.float32)`,
so the `mlp.gate.wg.weight` checkpoint values (stored bf16) are
upcast into a fp32 parameter on load.
2. `forward()` casts hidden states to fp32 before the gate matmul,
does softmax / topk / clamp+divide renormalization in fp32, then
casts the topk weights back to model dtype, exactly mirroring HF's
`easy_topk` math.
3. The fp32-routed (topk_weights, topk_indices) are packed into the
`router_logits` slot and `SharedFusedMoE` is built with
`custom_routing_function=_hunyuan_image3_unpack_packed_topk`, so
the bf16 `topk_softmax` CUDA op is bypassed entirely.
`HunyuanImage3ForConditionalGeneration._patch_moe_blocks` walks the
already-built `model.layers`, pops each old experts' static-forward-
context registration, frees the old MoE block's GPU buffers (otherwise
the transient old+new allocation OOMs near the gpu_memory_utilization
cap on the 80B model with TP=2), then installs the new block. Must
run inside `__init__` so it takes effect before weight loading.
Verified end-to-end on a single greedy IT2I prompt
(`new year pet poster ...`):
- 32/32 MoE layers replaced (logged as
"Replaced 32 HunYuanSparseMoeBlock layers with
HunyuanImage3SparseMoeBlock (fp32 router matching HF reference)").
- Output deterministically diverged from the bf16-routed run, exactly
as expected from a routing-precision change.
- Removed one observed hallucination ("dog sticking out tongue") that
appeared in the bf16-routed output but not in HF's.
Does not byte-align with HF (PagedAttention vs contiguous KV cache and
sampler RNG path differences are independent architectural divergences
documented in `memory/hf_omni_alignment_method.md`), but closes the
single largest *fixable* deterministic precision gap remaining after
the prompt / preprocessing / image-pipeline alignment fixes.
Signed-off-by: TaffyOfficial <2324465096@qq.com>
The i2t and t2t stage configs are `is_comprehension: true, final_output_type: text` -- AR-only text output, no DiT image generation. They should mirror HF's `bot_task="think"` which terminates the response at `</think>`. The previous stop_token_ids `[127957, 128026]` (`<|endoftext|>`, `</answer>`) assumed the model would naturally stop at `</answer>` once `_StageTransitionLogitsProcessor` is gated off (it only fires in generation mode, not comprehension mode). In practice the instruct-tuned model continues into a `<recaption>` section out of trained habit and never emits `</answer>` (which is only meaningful after the full `<think>...</think><recaption>...</recaption><answer>` sequence the generation pipeline runs through). Add `</think>` (128024) to the stop_token_ids for i2t and t2t. This makes greedy IT2I AR-only output align with HF's `bot_task="think"` baseline: | | HF baseline | omni before | omni after | |-------------|------------:|------------:|-----------:| | chars | 466 | 811 | 482 | | bytes | 1354 | 2375 | 1416 | | sections | think | think+recap | think | | gap to HF | 0 | +345 | +16 | Length divergence collapses from +74% to +3.4%. The remaining +16 chars sits in BF16 reduction noise / sampler implementation differences (documented in `memory/hf_omni_alignment_method.md`) and cannot be closed without reimplementing vllm's attention. `hunyuan_image3_it2i.yaml` is intentionally NOT changed: the IT2I pipeline (`is_comprehension: false`, AR -> DiT) needs the AR stage to emit the full `<think>...<answer><boi><img_size_*><img_ratio_*>` sequence so that DiT can decode the image latents. Stopping at `</think>` there would break image generation. Update the existing comment in `HunyuanImage3ForConditionalGeneration.__init__` that incorrectly claimed comprehension mode would stop at `</answer>` or EOS, so future readers understand why we explicitly stop at `</think>` in the yaml. Verified end-to-end: greedy IT2I AR-only output now ends cleanly at the analysis section, byte-for-byte structurally aligned with HF's `bot_task="think"` output (only differs in BF16-noise-driven per-token text divergence, no extra recaption section). Signed-off-by: TaffyOfficial <2324465096@qq.com>
40ac16c to
07d8cf0
Compare
Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>
7756476 to
6978fd7
Compare
|
add regression test for this |
Move build_prompt and build_prompt_tokens out of the example script into vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py so the AR-prefill prompt template has a single source of truth that downstream callers can reuse. The DiT pipeline keeps using TokenizerWrapper.apply_chat_template (which eagerly consumes JointImageInfo); prompt_utils targets the lighter client-side flow that uses an <img> placeholder + multi_modal_data. README is updated to describe the actual instruct chat template (the previous "pretrain template" wording was stale relative to the post-fix behavior introduced earlier in this PR) and to point at the new module. Addresses GH PR review comment requesting a common prompt-construction function shared across AR / DiT / end2end.py. Signed-off-by: TaffyOfficial <2324465096@qq.com>
…m-project#3243) Three layers of protection for the bug fixed in this PR: 1. Pure-logic structural tests (FakeTokenizer-based) verify: - The chat template framing (<|startoftext|> ... User: ... Assistant: ...) - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: ` (Part A regression: putting the trigger BEFORE user_prompt sends the model into a death-loop under greedy decoding). - <img> placeholder is positioned correctly for image-input tasks. - Each prompt segment is encoded in an isolated tokenizer.encode() call so cross-segment BPE merges cannot occur (the bug from commit 7bd429e). 2. AST-based wiring guard verifies that examples/.../end2end.py imports build_prompt_tokens from prompt_utils and does NOT redefine it locally. This protects the *delivery vector* of the original regression: the wrong template re-entered the example via a hand-rolled local builder that diverged from the canonical helper. 3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache) asserts that segment-by-segment build_prompt_tokens produces a STRICTLY different id sequence than tokenizer.encode(build_prompt(...)) for a `。`-ending prompt. If a future "simplification" replaces segment encode with full-string encode, the BPE-merge-bypass behavior is gone and this test fires. Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct tokenizer in HF cache): 19/19 passed including the real-tokenizer test. Signed-off-by: TaffyOfficial <2324465096@qq.com>
down |
CI ruff format check required collapsing short multi-line constructs (single-line assertions, single-line if conditions) onto one line. No semantic change; 19/19 tests still pass. Signed-off-by: TaffyOfficial <2324465096@qq.com>
|
@hsliuustc0106 need review |
|
Please fix CI |
PR vllm-project#3232 [Rebase] Rebase to vllm 0.20.0 folded `SharedFusedMoE` into `FusedMoE` and dropped the `vllm.model_executor.layers.fused_moe.shared_fused_moe` submodule, which broke pytest collection for tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_sampler.py with `ModuleNotFoundError: No module named 'vllm.model_executor.layers.fused_moe.shared_fused_moe'` across all four CI suites on this branch (simple-unit-test, diffusion-cache-backend-test, cuda-unit-test-with-single-card, cuda-unit-test-with-multi-cards). Mirrors the same minimal fix applied on cr/pr3107-rebased: - Wrap the legacy import in try/except and fall back to `FusedMoE as SharedFusedMoE`. `FusedMoE` now accepts `shared_experts=` directly and the call sites only use `make_expert_params_mapping` and `__init__(shared_experts=..., ...)`, both present on `FusedMoE`. - Drop `reduce_results=False` from the `SharedFusedMoE(...)` call — vllm 0.20 removed that parameter from `FusedMoE.__init__`. - Drop the manual `(routed, shared)` tuple merge and `tensor_model_parallel_all_reduce` post-processing in `HunyuanImage3SparseMoeBlock.forward`. vllm 0.20+ `FusedMoE` merges shared-experts internally and runs the TP all-reduce inside its forward, so the result is the already-combined, already-reduced tensor. Signed-off-by: zuiho <2324465096@qq.com>
Lint follow-up to c5089d1. The previous commit removed the call to `tensor_model_parallel_all_reduce` in `HunyuanImage3SparseMoeBlock.forward` (vllm 0.20+ FusedMoE runs the TP all-reduce internally) but left the symbol in the `from vllm.distributed import (...)` block, which `ruff` flags as unused (F401). Signed-off-by: zuiho <2324465096@qq.com>
fix now |
…m-project#3243) Three layers of protection for the bug fixed in this PR: 1. Pure-logic structural tests (FakeTokenizer-based) verify: - The chat template framing (<|startoftext|> ... User: ... Assistant: ...) - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: ` (Part A regression: putting the trigger BEFORE user_prompt sends the model into a death-loop under greedy decoding). - <img> placeholder is positioned correctly for image-input tasks. - Each prompt segment is encoded in an isolated tokenizer.encode() call so cross-segment BPE merges cannot occur (the bug from commit 7bd429e). 2. AST-based wiring guard verifies that examples/.../end2end.py imports build_prompt_tokens from prompt_utils and does NOT redefine it locally. This protects the *delivery vector* of the original regression: the wrong template re-entered the example via a hand-rolled local builder that diverged from the canonical helper. 3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache) asserts that segment-by-segment build_prompt_tokens produces a STRICTLY different id sequence than tokenizer.encode(build_prompt(...)) for a `。`-ending prompt. If a future "simplification" replaces segment encode with full-string encode, the BPE-merge-bypass behavior is gone and this test fires. Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct tokenizer in HF cache): 19/19 passed including the real-tokenizer test. Signed-off-by: TaffyOfficial <2324465096@qq.com>
Address PR vllm-project#3107 review (Bounty-hunter / Gaohan123) requesting AR-output-format and DiT-output-accuracy regression tests. Layout mirrors PR vllm-project#2949's split (CPU unit test under tests/diffusion/..., GPU accuracy test under tests/e2e/accuracy/...). CPU unit test tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py - test_ar_prefill_tokens_match_hf_apply_chat_template_for_it2i: asserts build_prompt_tokens (the AR-side prefill builder) is token-id-identical to HF tokenizer.apply_chat_template for the same (system, user_prompt, image) triple. Catches drift between the AR's input distribution and the model's training distribution -- the same failure mode PR vllm-project#3243 fixed for T2I. - test_dit_condition_image_preprocessing_byte_matches_ar_processor: asserts the diffusion-side _resize_and_crop_center produces byte-identical pixels to the AR-side HunyuanImage3Processor._resize_and_crop on the canonical resize targets. Direct response to Bounty-hunter's PR vllm-project#3107 review. Both tests gate on tencent/HunyuanImage-3.0-Instruct being in the local HF cache (no GPU/model weights required at runtime, just the tokenizer config + image processor). GPU accuracy test tests/e2e/accuracy/test_hunyuan_image3_it2i.py - test_hunyuan_image3_it2i_matches_hf_reference_psnr_40: drives vllm-omni's offline IT2I path through Omni and runs the official HF reference via AutoModelForCausalLM.generate_image, compared via the shared assert_similarity helper at PSNR>=40 dB and SSIM>=0.92. Marked full_model + skipif<8 GPUs; the threshold follows PR vllm-project#2949's review discussion (40 dB gives slack for TP=2 NCCL drift while still catching prompt/image-preprocessing bugs). Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
… Instruct chat template (vllm-project#3243) Signed-off-by: zuiho <wu15922848573@outlook.com> Signed-off-by: TaffyOfficial <2324465096@qq.com> Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com> Signed-off-by: zuiho <2324465096@qq.com> Co-authored-by: TaffyOfficial <2324465096@qq.com> Co-authored-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>
Summary
Multi-part fix for offline AR output of
tencent/HunyuanImage-3.0-Instructin vllm-omni:
Part A — Structural fix. The current
mainbranch produces:<think>analysis, just "image_1 × 6"death-loop repetition
Both modes are essentially unusable. Root cause:
build_prompt()inexamples/offline_inference/hunyuan_image3/end2end.pyuses a"pretrain-style" concatenation that misplaces the
<think>triggerrelative to the user prompt, putting user instructions inside the
model's "thinking section". Under any decoding (greedy or sampling)
the model collapses into degenerate fixed points. Fix: switch to the
instruct chat template with the trigger after the
Assistant:prefix,matching what HF's
apply_general_template(..., sequence_template="instruct")emits, plus a
build_prompt_tokens()segment-by-segment tokenizer tobypass cross-segment BPE merges.
Part B — Precision alignment with HF reference (fp32 MoE router).
After Part A the output is structurally correct, but a lower-amplitude
divergence from HF remained. Investigation traced this to MoE routing
precision.
hunyuan3.0_ins/modeling_hunyuan_image_3.pyruns the router in fp32on purpose:
vLLM's stock
HunYuanSparseMoeBlockbuilds the gate as a default-dtype(bf16)
ReplicatedLinearand letsSharedFusedMoE'stopk_softmaxCUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and
32 MoE layers, bf16 quantization can flip top-k boundary decisions on
close routing scores → wrong expert MLPs are applied, the resulting
hidden states diverge, the divergence cascades through the KV cache,
and the eventual decoded token differs from HF.
Fix: add
HunyuanImage3SparseMoeBlock, a subclass + post-init modulereplacement that mirrors the stock block 1:1 except:
ReplicatedLinear(..., params_dtype=torch.float32),so the
mlp.gate.wg.weightcheckpoint values (stored bf16) areupcast into a fp32 parameter on load.
forward()casts hidden states to fp32 before the gate matmul,does softmax / topk / clamp+divide renormalization in fp32, then
casts the topk weights back to model dtype, exactly mirroring HF's
easy_topkmath.(topk_weights, topk_indices)are packed into therouter_logitsslot andSharedFusedMoEis built withcustom_routing_function=_hunyuan_image3_unpack_packed_topk, sothe bf16
topk_softmaxCUDA op is bypassed entirely.HunyuanImage3ForConditionalGeneration._patch_moe_blocks()walks thealready-built
model.layers, pops each old experts' static-forward-context registration, frees the old MoE block's GPU buffers (otherwise
the transient old+new allocation OOMs near the gpu_memory_utilization
cap on the 80B model with TP=2), then installs the new block. Runs
inside
__init__so it takes effect before weight loading.Implementation strategy: subclass + post-init module replacement, not
monkey-patch. No vLLM upstream files are modified. The replacement is
local to
vllm_omni/model_executor/models/hunyuan_image3/hunyuan_image3.py,so other models that consume
HunYuanSparseMoeBlockare unaffected.Part C — Stop AR-only output at
</think>for i2t/t2t. Thei2t.yaml and t2t.yaml stage configs are
is_comprehension: true, final_output_type: text(AR-only text, no DiT). They should mirrorHF's
bot_task="think"which terminates the response at</think>.The previous
stop_token_ids: [127957, 128026](<|endoftext|>,</answer>) assumed the model would naturally stop at</answer>once
_StageTransitionLogitsProcessoris gated off (it only fires ingeneration mode, not comprehension mode). In practice the
instruct-tuned model continues into a
<recaption>section out oftrained habit and never emits
</answer>. Add</think>(128024) to the stop_token_ids for i2t and t2t.
it2i.yamlisintentionally not changed — the AR→DiT pipeline needs the AR
stage to emit the full
<think><answer><boi><img_size_*><img_ratio_*>sequence for DiT to decode image latents.
After this PR:
is_comprehension=true): 482 chars, full<think>analysis terminating at</think>, structurally identicalto HF baseline (no extra recaption section, no "tongue" hallucination)
is_comprehension=false, AR portionof full IT2I pipeline): 811 chars (think 482 + recap 329),
structurally identical to HF
model.generate_image(bot_task="think_recaption")AR portion (think 448 + recap 329 — exact same length as omni)
Verified on remote 2× L20 with
tencent/HunyuanImage-3.0-Instructandtransformers==4.57.1.Commits
d71981e7d360569abuild_promptinstruct format (early)27083f9cAssistant:ea809348A:→Assistant:to match HF tokenizer's instruct conv-roles output7bd429edbuild_prompt_tokens()returnslist[int]; bypass cross-segment BPE merges3d415e17<img>placeholder + injectedtimestep_emb(0)is embedding-equivalent to HF's<timestep>token8a1a4af941d29432process_imageto_vae_encodeboundary (vae_pixel_valuesfp32 byte-identical with HF, mean=0.157296)31c2fa56HunyuanImage3SparseMoeBlock: fp32 router matching HF (gate weight fp32 + fp32 softmax/topk + clamp(min=1e-8) renorm + bypasses bf16topk_softmaxCUDA op viacustom_routing_function). 32/32 MoE layers replaced via_patch_moe_blocks().07d8cf0d</think>(128024) to stop_token_ids inhunyuan_image3_i2t.yamlandhunyuan_image3_t2t.yamlso AR-only comprehension output terminates after the analysis section, matching HF'sbot_task="think".(History rewritten on 2026-04-30 to remove unrelated CI commits that
got pulled in via merges; see force-push at hash
0413c2c2..07d8cf0d.The 10 commits above are the full PR scope: 4 files changed, +449 / -24 lines.)
Test plan
Reproduce on 2× L20 with
tencent/HunyuanImage-3.0-Instruct,transformers==4.57.1,enforce_eager=true.Test inputs:
"Describe the Eiffel Tower in detail. What makes it architecturally significant?"assets/demo_instruct_imgs/input_0_0.png+"新年宠物海报,Q版圆润..."Mode 1: IT2I greedy AR-only (think only)
bot_task="think")<think>)<think>analysis<think>analysisMode 2: IT2I greedy think + recaption AR (this is the AR portion of
model.generate_image(bot_task="think_recaption"))is_comprehension=false)generate_image()ARThe recaption sections being exactly the same length demonstrates
that omni's
_StageTransitionLogitsProcessoris operationallyequivalent to HF
generate_image()'sstage_transitionsparameter.The remaining +34 chars sits in the
<think>section and is the sameclass of BF16 reduction-noise drift as Mode 1's +16 chars.
Outputs
Full text outputs are saved at
it2i_t2t_outputs/for review (thisincludes the rejected test methodology — file
E_HF_auto_recaption_*.txt— kept as a cautionary example with the README explaining why it's not
comparable).
File F (HF official
generate_image()AR portion, 777 chars)Full text — click to expand
File D (omni this PR
is_comprehension=falseAR, 811 chars)Full text — click to expand
Side-by-side recaption comparison (independent third-party review)
The two recaption sections were given to a separate LLM (DeepSeek) for
neutral side-by-side review. Verbatim verdict:
In other words: omni's recaption is slightly more verbose / more
specific in three or four optional rendering details but covers the
exact same image-edit intent as HF's reference, and would produce a
visually equivalent image when fed to DiT.
Known limitations (intentionally not fixed in this PR)
The remaining +16/+34 chars gap to HF baseline sits in BF16 reduction
noise across the few decoded tokens that follow slightly different
code paths in vllm vs transformers. Confirmed not fixable at the
prompt / preprocessing / routing layer:
build_prompt_tokens()(textinput byte-identical to HF
apply_chat_template, 1227 leadingtokens verified)
between 4.57.1 and 5.6.2 (same input, same yaml, same seed)
vae_pixel_valuesfp32-byte-identical withHF (mean=0.157296 both sides) after
41d29432<timestep>slot) — embedding-layerequivalent via injected
timestep_emb(0)(single-token swapregression-tested: swapping placeholder breaks output)
clamp(min=1e-8) renormalization now matches HF exactly via subclass
(
31c2fa56)</think>matching HF'sbot_task="think"(07d8cf0d)_StageTransitionLogitsProcessorproven equivalent to HF
generate_image()'sstage_transitionsparameter (Mode 2 recaption length identical: 329 vs 329 chars)
Confirmed remaining differences sit in implementation layers vllm-omni
overrides for performance:
reduction order differs)
compute, separate from the routing fixed in this PR) vs HF's python
loop — BF16 reduction order in the weighted sum differs
vllm/v1/sample/sampler.py) vs transformers'_sample()— different RNG primitives, different logits-processorordering, different top_k implementation. Even with identical seeds,
these produce non-aligned token sequences. This is an architectural
property of vllm, not a regression —
enforce_eager=trueonlydisables compile/CUDAGraphs, not these.
single-GPU full-rank matmul)
Reaching byte-identical alignment with HF would require replacing
vllm-omni's forward implementation with HF's — defeats the purpose of
vllm-omni. This PR provides functionally correct output with all key
elements covered, structurally aligned with HF (terminates at
</think>for AR-only mode, identical-length recaption section forthink+recaption mode), and with the largest fixable deterministic
precision gap (MoE router) closed.
What this PR does NOT claim
Earlier related PRs (#2713 testing first-30-tokens, #2986 smoke tests
with
len > 0assertion) accepted significant divergence as"BF16 / GPU non-determinism". This PR's investigation showed those
prior tests bypassed the buggy
build_promptpath entirely (usingHF-derived input_ids directly) and used assertions too weak to detect
death-loop output. We do not repeat that pattern. We acknowledge:
not removable here)
(no DiT to mask AR text quality) — recommend adding to CI
The PR's contract: functionally correct, structurally aligned, MoE
routing precision-aligned with HF reference, AR-only output terminates
at
</think>matching HF, think+recaption AR portion producesidentical-length recaption section as HF reference, with greedy IT2I
within +3.4% (think only) / +4.4% (think+recap) length of HF baseline.
Not byte-identical (impossible without replacing vllm's attention /
expert MLP / sampler).