Skip to content

[Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template#3243

Merged
Gaohan123 merged 17 commits into
vllm-project:mainfrom
TaffyOfficial:feature/hunyuan-t2t-sdpa-fa
May 5, 2026
Merged

[Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template#3243
Gaohan123 merged 17 commits into
vllm-project:mainfrom
TaffyOfficial:feature/hunyuan-t2t-sdpa-fa

Conversation

@TaffyOfficial
Copy link
Copy Markdown
Contributor

@TaffyOfficial TaffyOfficial commented Apr 29, 2026

Summary

Multi-part fix for offline AR output of tencent/HunyuanImage-3.0-Instruct
in vllm-omni:

Part A — Structural fix. The current main branch produces:

  • IT2I greedy: 2267 chars with no <think> analysis, just "image_1 × 6"
    death-loop repetition
  • IT2I sampling: 8188 chars of "吐舌头吐舌头吐舌头..." complete garbage

Both modes are essentially unusable. Root cause: build_prompt() in
examples/offline_inference/hunyuan_image3/end2end.py uses a
"pretrain-style" concatenation that misplaces the <think> trigger
relative to the user prompt, putting user instructions inside the
model's "thinking section". Under any decoding (greedy or sampling)
the model collapses into degenerate fixed points. Fix: switch to the
instruct chat template with the trigger after the Assistant: prefix,
matching what HF's apply_general_template(..., sequence_template="instruct")
emits, plus a build_prompt_tokens() segment-by-segment tokenizer to
bypass cross-segment BPE merges.

Part B — Precision alignment with HF reference (fp32 MoE router).
After Part A the output is structurally correct, but a lower-amplitude
divergence from HF remained. Investigation traced this to MoE routing
precision
.

hunyuan3.0_ins/modeling_hunyuan_image_3.py runs the router in fp32
on purpose:

# HunyuanTopKGate.__init__ (line 1102)
self.wg = nn.Linear(hidden_size, num_experts, bias=False, dtype=torch.float32)

# HunyuanTopKGate.forward (line 1114)
if self.wg.weight.dtype == torch.float32:
    hidden_states = hidden_states.float()

# HunyuanMoE.forward (line 1204) — explicit defense against AMP cast
with torch.autocast('cuda', enabled=False):
    topk_weights, topk_idx = self.gate(hidden_states, topk_impl='easy')

# HunyuanTopKGate.easy_topk (line 1132)
gates = F.softmax(logits, dim=1)                  # fp32
topk_weight_1, idx = torch.topk(gates, moe_topk)  # fp32
weight_sums = clamp(topk_weight_1.sum(...), min=1e-8)
topk_weight = topk_weight_1 / weight_sums         # fp32
# only AFTER routing decision is fixed, cast back to bf16:
topk_weights = topk_weights.to(hidden_states.dtype)

vLLM's stock HunYuanSparseMoeBlock builds the gate as a default-dtype
(bf16) ReplicatedLinear and lets SharedFusedMoE's topk_softmax
CUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and
32 MoE layers, bf16 quantization can flip top-k boundary decisions on
close routing scores → wrong expert MLPs are applied, the resulting
hidden states diverge, the divergence cascades through the KV cache,
and the eventual decoded token differs from HF.

Fix: add HunyuanImage3SparseMoeBlock, a subclass + post-init module
replacement that mirrors the stock block 1:1 except:

  1. Router gate is ReplicatedLinear(..., params_dtype=torch.float32),
    so the mlp.gate.wg.weight checkpoint values (stored bf16) are
    upcast into a fp32 parameter on load.
  2. forward() casts hidden states to fp32 before the gate matmul,
    does softmax / topk / clamp+divide renormalization in fp32, then
    casts the topk weights back to model dtype, exactly mirroring HF's
    easy_topk math.
  3. fp32-routed (topk_weights, topk_indices) are packed into the
    router_logits slot and SharedFusedMoE is built with
    custom_routing_function=_hunyuan_image3_unpack_packed_topk, so
    the bf16 topk_softmax CUDA op is bypassed entirely.

HunyuanImage3ForConditionalGeneration._patch_moe_blocks() walks the
already-built model.layers, pops each old experts' static-forward-
context registration, frees the old MoE block's GPU buffers (otherwise
the transient old+new allocation OOMs near the gpu_memory_utilization
cap on the 80B model with TP=2), then installs the new block. Runs
inside __init__ so it takes effect before weight loading.

Implementation strategy: subclass + post-init module replacement, not
monkey-patch.
No vLLM upstream files are modified. The replacement is
local to vllm_omni/model_executor/models/hunyuan_image3/hunyuan_image3.py,
so other models that consume HunYuanSparseMoeBlock are unaffected.

Part C — Stop AR-only output at </think> for i2t/t2t. The
i2t.yaml and t2t.yaml stage configs are is_comprehension: true, final_output_type: text (AR-only text, no DiT). They should mirror
HF's bot_task="think" which terminates the response at </think>.
The previous stop_token_ids: [127957, 128026] (<|endoftext|>,
</answer>) assumed the model would naturally stop at </answer>
once _StageTransitionLogitsProcessor is gated off (it only fires in
generation mode, not comprehension mode). In practice the
instruct-tuned model continues into a <recaption> section out of
trained habit and never emits </answer>. Add </think>
(128024) to the stop_token_ids for i2t and t2t. it2i.yaml is
intentionally not changed — the AR→DiT pipeline needs the AR
stage to emit the full <think><answer><boi><img_size_*><img_ratio_*>
sequence for DiT to decode image latents.

After this PR:

  • IT2I greedy AR-only (is_comprehension=true): 482 chars, full
    <think> analysis terminating at </think>, structurally identical
    to HF baseline (no extra recaption section, no "tongue" hallucination)
  • IT2I greedy think+recaption AR (is_comprehension=false, AR portion
    of full IT2I pipeline): 811 chars (think 482 + recap 329),
    structurally identical to HF model.generate_image(bot_task="think_recaption")
    AR portion (think 448 + recap 329 — exact same length as omni)
  • IT2I sampling: 2751 chars, full think + recaption-style content
  • T2T greedy: numbered structured output matching HF baseline structure

Verified on remote 2× L20 with tencent/HunyuanImage-3.0-Instruct and
transformers==4.57.1.

Commits

commit what
d71981e7 Siglip2 transformers≥5.x list compat
d360569a T2T build_prompt instruct format (early)
27083f9c Unify all chat tasks under instruct template, trigger after Assistant:
ea809348 A:Assistant: to match HF tokenizer's instruct conv-roles output
7bd429ed New build_prompt_tokens() returns list[int]; bypass cross-segment BPE merges
3d415e17 docs: timestep slot uses <img> placeholder + injected timestep_emb(0) is embedding-equivalent to HF's <timestep> token
8a1a4af9 docs: image preprocessing already aligned with HF (resize/crop math + VAE normalize identical)
41d29432 Defer VAE pixel dtype cast from process_image to _vae_encode boundary (vae_pixel_values fp32 byte-identical with HF, mean=0.157296)
31c2fa56 HunyuanImage3SparseMoeBlock: fp32 router matching HF (gate weight fp32 + fp32 softmax/topk + clamp(min=1e-8) renorm + bypasses bf16 topk_softmax CUDA op via custom_routing_function). 32/32 MoE layers replaced via _patch_moe_blocks().
07d8cf0d Add </think> (128024) to stop_token_ids in hunyuan_image3_i2t.yaml and hunyuan_image3_t2t.yaml so AR-only comprehension output terminates after the analysis section, matching HF's bot_task="think".

(History rewritten on 2026-04-30 to remove unrelated CI commits that
got pulled in via merges; see force-push at hash 0413c2c2..07d8cf0d.
The 10 commits above are the full PR scope: 4 files changed, +449 / -24 lines.)

Test plan

Reproduce on 2× L20 with tencent/HunyuanImage-3.0-Instruct,
transformers==4.57.1, enforce_eager=true.

Test inputs:

  • T2T: "Describe the Eiffel Tower in detail. What makes it architecturally significant?"
  • IT2I: assets/demo_instruct_imgs/input_0_0.png + "新年宠物海报,Q版圆润..."

Mode 1: IT2I greedy AR-only (think only)

main branch This PR HF baseline (bot_task="think") gap
chars 2267 (death loop, no <think>) 482 466 +16 / +3.4%
structure "image_1 × 6" repeats full <think> analysis full <think> analysis match
hallucinations "image_1" loop none none match

Mode 2: IT2I greedy think + recaption AR (this is the AR portion of model.generate_image(bot_task="think_recaption"))

omni this PR (is_comprehension=false) HF official generate_image() AR gap
total chars 811 (think 482 + recap 329) 777 (think 448 + recap 329) +34 / +4.4%
recaption length 329 chars 329 chars identical
think first divergence char 64 char 64 both byte-identical first 64 chars / ~21 tokens

The recaption sections being exactly the same length demonstrates
that omni's _StageTransitionLogitsProcessor is operationally
equivalent to HF generate_image()'s stage_transitions parameter.
The remaining +34 chars sits in the <think> section and is the same
class of BF16 reduction-noise drift as Mode 1's +16 chars.

Outputs

Full text outputs are saved at it2i_t2t_outputs/ for review (this
includes the rejected test methodology — file E_HF_auto_recaption_*.txt
— kept as a cautionary example with the README explaining why it's not
comparable).

File F (HF official generate_image() AR portion, 777 chars)

Full text — click to expand

用户希望将一张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。这张参考图展示了一只坐在木质地板上的金毛幼犬,背景是户外的花丛。原始指令非常具体,要求添加特定的标题文字、改变背景、调整构图以及应用特定的艺术风格。首先,我需要分析文字部分的添加,主标题"新年快乐汪"需要采用Q版圆润且可爱的字体,副标题"HAPPY NEW YEAR"则应作为补充。其次,背景需要从户外的自然环境彻底切换到室内的房间门口,这涉及到场景的完全重构。构图上,指令提到了鱼眼镜头和近景特写,这意味着画面中心的小狗头部会因为透视效果而显得更加圆润和突出,而背景中的门框和室内陈设会呈现出向四周弯曲的视觉特征。小狗本身的配饰也需要改变,从原来的粉色项圈换成更具节日气息的红色围巾和红色毛线帽。最后,整体风格要模拟宝丽莱相纸的质感,带有胶片颗粒感和复古的写实主义风格。在改写指令时,我需要将这些抽象的风格描述转化为具体的视觉细节,确保生成的图像能够准确体现这些变化,同时保持小狗原本那种歪头微笑的可爱神态。</think><recaption>将参考图中的金毛幼犬制作成一张复古风格的新年宠物海报。首先,将背景从户外的木质地板和花丛替换为室内的房间门口,背景中应包含白色的门框和室内模糊的家具陈设。对画面应用鱼眼镜头效果,使中心的小狗头部显得更加圆润可爱,且背景呈现出向边缘弯曲的透视感。在画面顶部添加大号的Q版圆润艺术字标题"新年快乐汪",并在其下方添加较小的英文副标题"HAPPY NEW YEAR"。给小狗戴上一顶红色的针织毛线帽,并在脖子上围上一条厚实的红色围巾,同时保留它原本歪头微笑、吐着舌头的可爱表情。最后,为整张图片添加宝丽莱相纸的白色边框,并赋予其胶片摄影的质感,包括细腻的打印颗粒感和轻微的复古色调,使画面呈现出一种超写实的怀旧艺术感。</recaption>

File D (omni this PR is_comprehension=false AR, 811 chars)

Full text — click to expand

用户希望将一张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。这张参考图展示了一只坐在木质地板上的金毛幼犬,背景是户外的蒲公英花丛。原始指令非常具体,要求添加特定的标题文字、改变背景、调整构图以及应用特定的艺术风格。首先,我需要分析文字部分的添加,主标题"新年快乐汪"需要采用Q版圆润且可爱的字体,副标题"HAPPY NEW YEAR"则应作为补充。其次,背景需要从户外的自然景观彻底切换到室内的房间门口,这涉及到场景的完全重构。构图上,指令提到了鱼眼镜头和近景特写,这意味着画面中心的小狗头部会因为透视效果而显得更加圆润和突出,而背景中的门框和室内陈设会呈现出向四周弯曲的视觉特征。在主体细节方面,小狗需要佩戴红色的毛线帽和红色的围巾,这不仅增加了节日感,也呼应了标题中的红色元素。最后,整体风格被定义为宝丽莱相纸、胶片摄影和复古感,这意味着图像需要具备明显的颗粒感、柔和的色彩饱和度以及相纸特有的白色边框。综合这些要求,我需要构建一个详细的指令,指导模型如何从参考图出发,通过改变背景、添加配饰、调整构图和滤镜风格,最终生成一张符合所有描述的新年海报。请基于参考图中的金毛幼犬,将其改造成一张复古胶片风格的新年宠物海报。首先,将背景从户外的木质地板和蒲公英花丛替换为室内的房间门口场景,背景中应包含白色的门框和室内模糊的家具陈设。在构图上,采用鱼眼镜头效果,使画面中心的小狗头部呈现出圆润的特写感,身体比例相应缩小。为小狗添加节日配饰:在头顶戴一顶红色的针织毛线帽,脖子上围一条厚实的红色针织围巾。在图像上方添加两行文字,第一行是主标题"新年快乐汪",使用圆润、带有描边的可爱艺术字体;第二行是副标题"HAPPY NEW YEAR",字体稍小,位于主标题下方。最后,为整张图片应用宝丽莱相纸的视觉风格,包括宽大的白色相纸边框、明显的胶片颗粒感、柔和的复古色调以及轻微的暗角效果,确保小狗的绒毛细节依然清晰可见。

Side-by-side recaption comparison (independent third-party review)

The two recaption sections were given to a separate LLM (DeepSeek) for
neutral side-by-side review. Verbatim verdict:

这两段描述的核心创意和最终效果差别非常小,本质上是在说同一张海报。
如果非要对比细节,第一段(omni D)比第二段(HF F)多了以下几点明确要求:

  • 身体比例:第一段明确写了"身体比例相应缩小",第二段只提了头部圆润、背景弯曲。
  • 文字描边:第一段要求主标题"带有描边的可爱艺术字体",第二段只说"大号Q版圆润艺术字"。
  • 暗角:第一段明确提到"轻微的暗角效果",第二段没有单独强调。
  • 绒毛细节:第一段特别要求"确保小狗的绒毛细节依然清晰可见",第二段只笼统地说"超写实的怀旧艺术感"。

除此之外,背景(室内门口+白门框+模糊家具)、鱼眼镜头效果、红色针织帽+围巾、
歪头吐舌头微笑、两行标题文字(新年快乐汪 + HAPPY NEW YEAR)、宝丽莱白边框+
胶片颗粒+复古色调——这些核心要素完全一致。

结论:如果你按第二段描述生成,结果可能缺少文字描边、轻微暗角和极致的
绒毛清晰度,但整体观感八九不离十。可以认为是同一份需求的两个措辞版本,差别不大。

In other words: omni's recaption is slightly more verbose / more
specific in three or four optional rendering details
but covers the
exact same image-edit intent as HF's reference, and would produce a
visually equivalent image when fed to DiT.

Known limitations (intentionally not fixed in this PR)

The remaining +16/+34 chars gap to HF baseline sits in BF16 reduction
noise across the few decoded tokens that follow slightly different
code paths in vllm vs transformers. Confirmed not fixable at the
prompt / preprocessing / routing layer:

  • ✅ Prompt format — fixed
  • ✅ BPE cross-segment merge — fixed by build_prompt_tokens() (text
    input byte-identical to HF apply_chat_template, 1227 leading
    tokens verified)
  • ✅ transformers version — verified omni output is byte-identical
    between 4.57.1 and 5.6.2 (same input, same yaml, same seed)
  • ✅ Pixel preprocessing — vae_pixel_values fp32-byte-identical with
    HF (mean=0.157296 both sides) after 41d29432
  • ✅ Image token routing (<timestep> slot) — embedding-layer
    equivalent via injected timestep_emb(0) (single-token swap
    regression-tested: swapping placeholder breaks output)
  • ✅ Attention backend — FA = SDPA byte-identical for greedy
  • ✅ MoE routing precision — fp32 gate + fp32 softmax/topk +
    clamp(min=1e-8) renormalization now matches HF exactly via subclass
    (31c2fa56)
  • ✅ AR-only output termination — stops at </think> matching HF's
    bot_task="think" (07d8cf0d)
  • ✅ Stage-transition logic — omni _StageTransitionLogitsProcessor
    proven equivalent to HF generate_image()'s stage_transitions
    parameter (Mode 2 recaption length identical: 329 vs 329 chars)

Confirmed remaining differences sit in implementation layers vllm-omni
overrides for performance:

  • vllm's PagedAttention KV cache vs HF's contiguous cache (BF16
    reduction order differs)
  • vllm-omni's Triton fused MoE expert MLP (the per-expert MLP
    compute, separate from the routing fixed in this PR) vs HF's python
    loop — BF16 reduction order in the weighted sum differs
  • vllm's Sampler (vllm/v1/sample/sampler.py) vs transformers'
    _sample() — different RNG primitives, different logits-processor
    ordering, different top_k implementation. Even with identical seeds,
    these produce non-aligned token sequences. This is an architectural
    property of vllm, not a regression — enforce_eager=true only
    disables compile/CUDAGraphs, not these.
  • TP=2 all-reduce reduction order
  • TP-sharded fp32 gate matmul (a different reduction tree than HF's
    single-GPU full-rank matmul)

Reaching byte-identical alignment with HF would require replacing
vllm-omni's forward implementation with HF's — defeats the purpose of
vllm-omni. This PR provides functionally correct output with all key
elements covered, structurally aligned with HF (terminates at
</think> for AR-only mode, identical-length recaption section for
think+recaption mode), and with the largest fixable deterministic
precision gap (MoE router) closed.

What this PR does NOT claim

Earlier related PRs (#2713 testing first-30-tokens, #2986 smoke tests
with len > 0 assertion) accepted significant divergence as
"BF16 / GPU non-determinism". This PR's investigation showed those
prior tests bypassed the buggy build_prompt path entirely (using
HF-derived input_ids directly) and used assertions too weak to detect
death-loop output. We do not repeat that pattern. We acknowledge:

  1. Per-token logit divergence under vllm forward exists (architectural,
    not removable here)
  2. Sampling outputs are not seed-portable across vllm and transformers
  3. T2T greedy is the strongest canary for prompt-format regressions
    (no DiT to mask AR text quality) — recommend adding to CI

The PR's contract: functionally correct, structurally aligned, MoE
routing precision-aligned with HF reference, AR-only output terminates
at </think> matching HF, think+recaption AR portion produces
identical-length recaption section as HF reference
, with greedy IT2I
within +3.4% (think only) / +4.4% (think+recap) length of HF baseline.
Not byte-identical (impossible without replacing vllm's attention /
expert MLP / sampler).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 80e0237f31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/e2e/accuracy/conftest.py Outdated
Comment on lines +301 to +302
if init_timeout is not None:
generate_params_kwargs["init_timeout"] = init_timeout
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward init timeout when creating GEBench generate server

This branch sets init_timeout=1800 for multi-GPU GEBench runs, but that value is only stored in OmniServerParams and never reaches the actual OmniServer invocation in AccuracyServerConfig.generate_server (which only prepends --stage-init-timeout). In practice, those runs still use the CLI default --init-timeout (600s), so large HunyuanImage startup can still timeout despite this override, causing the new nightly path to fail intermittently.

Useful? React with 👍 / 👎.

Comment thread tests/e2e/accuracy/conftest.py Outdated
Comment on lines +332 to +334
if devices_opt:
num_devices = len([d for d in devices_opt.split(",") if d.strip()])
extra_args = ["--tensor-parallel-size", str(num_devices)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse extra server args independently of device overrides

--gebench-extra-server-args is only consumed inside the if devices_opt: block, so passing extra server flags without --gebench-devices is silently ignored. That makes the new option unreliable for single-GPU smoke runs (for example, users cannot pass required flags like --trust-remote-code unless they also set a device list), which is an unexpected behavioral regression for this CLI surface.

Useful? React with 👍 / 👎.

Comment on lines +63 to +67
for snap in sorted(os.listdir(snapshots)):
candidate = os.path.join(snapshots, snap, "tokenizer.json")
if os.path.isfile(candidate):
tokenizer_file = candidate
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Select newest tokenizer snapshot in HF cache fallback

The fallback loader picks the first lexicographically sorted snapshot directory and stops, which can select an older cached revision when multiple snapshots exist. In that failure path, tokenizer/model revision mismatch can change token IDs and prompt formatting, reintroducing unstable or incorrect AR behavior; selecting the active/latest snapshot (e.g., by mtime or refs) avoids this stale-cache regression.

Useful? React with 👍 / 👎.

}


def build_prompt_tokens(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define the common func to construct prompt template, and used in AR, DIT and end2end.py? current DIT construct template with func apply_chat_template

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

On the remaining +16 chars vs HF baseline

For anyone wondering why this PR doesn't fully byte-align with HF
(482 chars vs HF 466 chars), this is not a remaining bug
it's unavoidable BF16 reduction-noise drift inherent to vllm's
tensor-parallel + paged-KV execution model.

What is aligned

The first 52 chars (~17–18 generated tokens) are byte-identical
with HF
, proving the prompt path / preprocessing / fp32 MoE routing
are now all correctly aligned through prefill + the first ~18 decode
steps.

Where it diverges

At char 53, two equally-valid Chinese sentence continuations branch:

  • HF chose: 、背景有白色蒲公英的小狗,它正对着镜头开心地笑着…
  • omni chose: 的金毛幼犬,背景是户外的蒲公英花丛…

Both branches describe the same image and cover all prompt elements
(red knit hat + scarf, fisheye lens, polaroid frame, vintage filter).
Total length differs by +16 chars.

Why it's unavoidable

After ~18 decode steps the two systems' hidden states differ by
~1e-5..1e-3 BF16 ULPs — small in absolute terms but enough to flip
top-1 on a close logit margin (e.g. vs ). Once one greedy
token diverges, the rest of the sequence does too. Sources:

source HF vllm
KV cache contiguous tensor PagedAttention (16-token blocks)
Multi-GPU layer-wise split, each matmul on one GPU tensor-parallel: sharded matmul +
all_reduce
MoE expert MLP combine python loop, serial accumulate Triton fused kernel, different
accumulate order
Gate matmul reduction (already fp32) full-rank single-GPU TP-sharded, different reduction tree

|

All four are the same arithmetic op in a different reduction order.
BF16 addition is non-associative ((a+b)+c ≠ a+(b+c) at ~1e-3
magnitude), so each layer × each token contributes a tiny noise term.
Cumulated over 32 layers + 18 decode steps, it crosses the threshold
to flip a top-1 token.

This is a fundamental property of tensor-parallel paged-KV inference,
not specific to HunyuanImage3. Closing the +16 chars gap would
require replacing vllm's attention / fused-MoE expert MLP / sampler
with HF's reference implementations — defeats the purpose of using
vllm (which is 5–10× faster because of these choices).

What this PR delivers

gap status
Death-loop garbage output (wrong prompt format) ✅ fixed (Part A)
BPE cross-segment merge ✅ fixed (42c2f349)
Pixel bf16-quant noise pre-VAE ✅ fixed (80cbaa3f)
BF16 router top-k flips ✅ fixed (0413c2c2, fp32 router matching HF)
AR-only output not stopping at </think> ✅ fixed (40ac16cc)
BF16 reduction-noise drift over decode steps ❌ architectural, not fixable here

Contract: functionally correct, structurally aligned, all known
fixable precision gaps closed
. Per-token byte-identical output with
HF transformers is not achievable under vllm's execution model and
not promised.

TaffyOfficial added 10 commits April 30, 2026 12:05
…rmers>=5.x

Siglip2ImageProcessorFast in transformers>=5.0 returns pixel_values,
pixel_attention_mask, and spatial_shapes as lists of tensors/tuples
instead of a single batched tensor. The old code called .squeeze(0)
directly on the list, causing AttributeError at MultiModalBudget
initialization (get_dummy_mm_inputs path) and crashing startup.

Fix by stacking list elements into a tensor before squeezing:
- pixel_values / pixel_attention_mask: torch.stack(list, dim=0)
- spatial_shapes: torch.tensor(list, dtype=torch.long) since elements
  are tuples, not tensors

Tested on transformers 5.6.2: both FA and SDPA backends initialize
and produce identical T2T output after this fix.

Signed-off-by: zuiho <wu15922848573@outlook.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
The pretrain-style format (system_prompt + raw user_prompt) used by
build_prompt() for task="t2t" leaves the model without an answer-start
signal. With temperature=0.0 greedy decoding it falls into garbage
repetition (e.g. "massive arches massive arches ..." ad infinitum).

Use the instruct chat format `<|startoftext|>User: {prompt}\n\nA: `
which matches what the official HF AR baseline (mode="gen_text",
sequence_template="instruct") emits via prepare_model_inputs(). With
that format vllm-omni T2T produces structured numbered output with
specific facts, comparable to the HF baseline.

Verified on remote 2x L20 with HunyuanImage-3.0-Instruct.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
The pretrain-style prompt format was producing garbage output for
greedy decoding across all chat-style tasks (T2T, I2T, IT2I, T2I):

- T2T (no trigger):   "massive arches massive arches ..." (infinite loop)
- IT2I (<think>):     repetitive "image_1 完整保留..." segments

Root cause: trigger_tag and user_prompt order were inverted vs. what
HunyuanImage3's tokenizer.apply_general_template emits for instruct
sequence_template. The model was trained to see:

    <|startoftext|>{system?}\n\nUser: {<img>?}{user_prompt}\n\nA: {trigger?}

so the trigger (e.g. <think>) sits AFTER the assistant prefix and the
model continues from there. The previous build_prompt() concatenated
trigger BEFORE user_prompt, which placed the user instructions inside
the model's "thinking section" and broke greedy decoding.

Fix:
- T2T: bare instruct template, no system prompt (matches HF baseline)
- t2i_vanilla: keep pretrain mode (it is the only task designed for it)
- All others: instruct template with trigger after `\n\nA: `

Verified on remote 2x L20 with HunyuanImage-3.0-Instruct: IT2I greedy
output is now coherent <think> analysis covering all key elements
(matches HF AR baseline structure).

Signed-off-by: TaffyOfficial <2324465096@qq.com>
HunyuanImage3TokenizerFast.apply_general_template uses Assistant: as
the bot role prefix in instruct sequence_template (verified by
decoding HF prepare_model_inputs output with system_prompt=en_unified
+ image + bot_task=think: token 72803 = "Assistant"). Switch
build_prompt() to use the full word so the AR prefill aligns with the
official HF tokenization.

Also unify T2T to the same en_unified + Assistant: template (PR vllm-project#3107
reference implementation does the same; the previous T2T-specific
branch was a workaround for an earlier prompt-format experiment).

Note: BPE merge across user_prompt/Assistant boundary still produces
1 merged token (e.g. "。\n\n" -> single id) where HF apply_chat_template
keeps them separate. Full byte-identical alignment requires passing
pre-tokenized prompt_token_ids — that path is supported by vllm-omni
(OmniTokensPrompt) but not yet plumbed through build_prompt().

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Adds build_prompt_tokens() that mirrors HF apply_chat_template's
segment-by-segment tokenization. The previous build_prompt() returned a
single string that the engine fed through tokenizer.encode() in one BPE
pass, which merged tokens across segment boundaries (e.g. user_prompt
ending in "。" + the conv separator "\n\n" -> single token id 3490
instead of HF's [1811, 271]). This shifted tokens at the user-text /
Assistant: prefix boundary and made vllm-omni's input_ids drift from
HF's by 1-2 tokens, causing greedy outputs to diverge after the very
first generated token.

Loads the model's tokenizer in main(), encodes each conversation
segment independently (system prompt, "\n\n", "User: ", <img>
placeholder, user_prompt, "\n\nAssistant: ", trigger tag) and passes
the resulting list[int] to omni.generate() via the existing
prompt_token_ids dict path (OmniSingletonPrompt already supports
list[int] / OmniTokensPrompt — no engine-side changes needed).

t2i_vanilla still uses the pretrain whole-string path because that
mode has no chat-template segments.

Verified on remote 2x L20: text portion of input_ids (first 1227
tokens, before the <img> placeholder) is now byte-identical to HF's
prepare_model_inputs output. The trailer also matches: the previous
"。\n\nAssistant: <think>" -> [3490, 32, 25, 220, 128023] becomes the
HF-correct [1811, 271, 72803, 25, 220, 128023].

Note: build_prompt() is kept for backward compatibility but its
docstring now warns about the BPE merge issue and points to
build_prompt_tokens() as the replacement for HF-aligned inputs.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
The image expansion uses <img> (128006) at the timestep slot while HF's
apply_chat_template uses the literal <timestep> (128017). Naively
swapping the placeholder breaks output (model hallucinates additional
images) because HF's modeling forward calls
`instantiate_continuous_tokens` to *scatter-replace* the embedding at
the <timestep> position with `timestep_emb(0)` for cond images — the
wte embedding of <timestep> is irrelevant at runtime.

vllm-omni's existing <img>-placeholder + multimodal-merger path
already produces the same final hidden state at that position by
shipping `timestep_emb(0)` at the head of `embed_multimodal()`'s
combined_embeddings tensor. So the AR forward is numerically
equivalent to HF; only the dumped input_ids differ at that one slot.

Switching to <timestep> would require either a second PromptReplacement
targeting 128017, or letting `PromptUpdateDetails.select_token_id` take
a list of embed_token_ids. Both are deeper engine-level changes; out
of scope for this fix. Add explanatory comments in
`_get_prompt_updates` and `embed_multimodal` so future readers don't
re-discover this rabbit hole and don't break it with naive cleanups.

Verified on remote 2x L20: IT2I greedy output remains structurally
correct (2167 chars, full <think> analysis covering all key elements,
no image_2..N hallucinations).

Signed-off-by: TaffyOfficial <2324465096@qq.com>
Audit of vllm-omni's process_image() against HF's image_processor:

- resize/crop math: byte-for-byte identical to HF's `resize_and_crop`
  with crop_type="center". Same aspect-ratio preservation, same
  int(round(...)) ordering, same LANCZOS resampler, same crop region
  computation.
- VAE PIL->tensor: identical. Both use transforms.Compose([ToTensor,
  Normalize([0.5], [0.5])]) — fully equivalent.
- ViT processor: same Siglip2 processor class, but transformers
  version differs at runtime (vllm-omni venv = 5.6.2; HF baseline
  venv = 4.57.1). The Siglip2ImageProcessorFast normalization path
  changed between these versions, producing ~1 ULP differences in
  pixel values. This is a venv-pinning concern, not a code bug.
- dtype cast: vllm-omni casts vae_pixel_values to bf16 here; HF stores
  fp32 and casts inside the encoder forward. Tried delaying the cast
  to mirror HF, but vllm-omni's _vae_encode runs fp32 input through a
  bf16-weighted conv3d which raises a dtype mismatch (HF avoids this
  by an explicit cast at the encoder boundary that vllm-omni does not
  have). Keep the existing cast and document the divergence — fixing
  it requires plumbing a cast into _vae_encode, out of scope for this
  PR.

Net effect of this commit: comments only. No behavior change. The
remaining numerical drift between vllm-omni and HF on image
embeddings is bounded by the transformers version delta and the BF16
reduction-order noise floor; both are out of scope for code changes
in this branch.

Verified on remote 2x L20: IT2I greedy output unchanged (2167 chars,
structurally aligned with HF).

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…essor

`HunyuanImage3Processor.process_image` previously cast `vae_pixel_values`
to model dtype (bf16) right after VAE preprocessing. HF keeps these as
fp32 in `build_cond_images` and only casts inside the VAE forward, which
preserves fp32 precision through the multimodal_data dict.

Move the cast into `_vae_encode` (encoder boundary) and keep
`vae_pixel_values` as fp32 in the processor. Verified pixel-level
byte-identical with HF (fp32 mean=0.157296). Greedy IT2I output is
unchanged (the VAE encoder's first conv casts to bf16 anyway, so the
final latent is identical to before this fix), but this removes a
~7e-4 mean-abs-diff bf16 quantization error from `vae_pixel_values`
and aligns the multimodal_data path with HF.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
HF's `HunyuanTopKGate` runs the router in fp32: `wg` is constructed as
`nn.Linear(..., dtype=torch.float32)`, `hidden_states` is cast to fp32
before the matmul, the call is wrapped in
`with torch.autocast('cuda', enabled=False)`, and `easy_topk` does
`F.softmax` -> `torch.topk` -> divide by `clamp(weight_sums, min=1e-8)`,
all in fp32. Only the resulting topk weights are cast to bf16 for the
expert MLP combine.

vLLM's stock `HunYuanSparseMoeBlock` builds the gate as a default-dtype
(bf16) `ReplicatedLinear` and lets `SharedFusedMoE`'s `topk_softmax`
CUDA op consume bf16 logits. With 64 experts, top-k=8 per layer, and
32 MoE layers, bf16 quantization can flip top-k boundary decisions on
close routing scores -- wrong expert MLPs are applied, the resulting
hidden states diverge, the divergence cascades through the KV cache,
and the eventual decoded token differs from HF.

Add `HunyuanImage3SparseMoeBlock`, a subclass that mirrors the stock
block 1:1 except:

1. The router gate is `ReplicatedLinear(..., params_dtype=torch.float32)`,
   so the `mlp.gate.wg.weight` checkpoint values (stored bf16) are
   upcast into a fp32 parameter on load.
2. `forward()` casts hidden states to fp32 before the gate matmul,
   does softmax / topk / clamp+divide renormalization in fp32, then
   casts the topk weights back to model dtype, exactly mirroring HF's
   `easy_topk` math.
3. The fp32-routed (topk_weights, topk_indices) are packed into the
   `router_logits` slot and `SharedFusedMoE` is built with
   `custom_routing_function=_hunyuan_image3_unpack_packed_topk`, so
   the bf16 `topk_softmax` CUDA op is bypassed entirely.

`HunyuanImage3ForConditionalGeneration._patch_moe_blocks` walks the
already-built `model.layers`, pops each old experts' static-forward-
context registration, frees the old MoE block's GPU buffers (otherwise
the transient old+new allocation OOMs near the gpu_memory_utilization
cap on the 80B model with TP=2), then installs the new block. Must
run inside `__init__` so it takes effect before weight loading.

Verified end-to-end on a single greedy IT2I prompt
(`new year pet poster ...`):
- 32/32 MoE layers replaced (logged as
  "Replaced 32 HunYuanSparseMoeBlock layers with
   HunyuanImage3SparseMoeBlock (fp32 router matching HF reference)").
- Output deterministically diverged from the bf16-routed run, exactly
  as expected from a routing-precision change.
- Removed one observed hallucination ("dog sticking out tongue") that
  appeared in the bf16-routed output but not in HF's.

Does not byte-align with HF (PagedAttention vs contiguous KV cache and
sampler RNG path differences are independent architectural divergences
documented in `memory/hf_omni_alignment_method.md`), but closes the
single largest *fixable* deterministic precision gap remaining after
the prompt / preprocessing / image-pipeline alignment fixes.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
The i2t and t2t stage configs are `is_comprehension: true,
final_output_type: text` -- AR-only text output, no DiT image
generation. They should mirror HF's `bot_task="think"` which
terminates the response at `</think>`.

The previous stop_token_ids `[127957, 128026]` (`<|endoftext|>`,
`</answer>`) assumed the model would naturally stop at `</answer>`
once `_StageTransitionLogitsProcessor` is gated off (it only fires
in generation mode, not comprehension mode). In practice the
instruct-tuned model continues into a `<recaption>` section out of
trained habit and never emits `</answer>` (which is only meaningful
after the full `<think>...</think><recaption>...</recaption><answer>`
sequence the generation pipeline runs through).

Add `</think>` (128024) to the stop_token_ids for i2t and t2t. This
makes greedy IT2I AR-only output align with HF's `bot_task="think"`
baseline:

|             | HF baseline | omni before | omni after |
|-------------|------------:|------------:|-----------:|
| chars       |         466 |         811 |        482 |
| bytes       |        1354 |        2375 |       1416 |
| sections    |       think | think+recap |      think |
| gap to HF   |           0 |        +345 |        +16 |

Length divergence collapses from +74% to +3.4%. The remaining +16
chars sits in BF16 reduction noise / sampler implementation differences
(documented in `memory/hf_omni_alignment_method.md`) and cannot be
closed without reimplementing vllm's attention.

`hunyuan_image3_it2i.yaml` is intentionally NOT changed: the IT2I
pipeline (`is_comprehension: false`, AR -> DiT) needs the AR stage
to emit the full `<think>...<answer><boi><img_size_*><img_ratio_*>`
sequence so that DiT can decode the image latents. Stopping at
`</think>` there would break image generation.

Update the existing comment in `HunyuanImage3ForConditionalGeneration.__init__`
that incorrectly claimed comprehension mode would stop at `</answer>`
or EOS, so future readers understand why we explicitly stop at
`</think>` in the yaml.

Verified end-to-end: greedy IT2I AR-only output now ends cleanly at
the analysis section, byte-for-byte structurally aligned with HF's
`bot_task="think"` output (only differs in BF16-noise-driven
per-token text divergence, no extra recaption section).

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feature/hunyuan-t2t-sdpa-fa branch from 40ac16c to 07d8cf0 Compare April 30, 2026 04:07
Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>
@TaffyOfficial TaffyOfficial force-pushed the feature/hunyuan-t2t-sdpa-fa branch from 7756476 to 6978fd7 Compare April 30, 2026 05:54
@TaffyOfficial TaffyOfficial changed the title [WIP][Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template [Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template Apr 30, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

add regression test for this

@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 30, 2026
TaffyOfficial added 2 commits April 30, 2026 16:09
Move build_prompt and build_prompt_tokens out of the example script into
vllm_omni/diffusion/models/hunyuan_image3/prompt_utils.py so the AR-prefill
prompt template has a single source of truth that downstream callers can
reuse. The DiT pipeline keeps using TokenizerWrapper.apply_chat_template
(which eagerly consumes JointImageInfo); prompt_utils targets the lighter
client-side flow that uses an <img> placeholder + multi_modal_data.

README is updated to describe the actual instruct chat template (the
previous "pretrain template" wording was stale relative to the post-fix
behavior introduced earlier in this PR) and to point at the new module.

Addresses GH PR review comment requesting a common prompt-construction
function shared across AR / DiT / end2end.py.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
…m-project#3243)

Three layers of protection for the bug fixed in this PR:

1. Pure-logic structural tests (FakeTokenizer-based) verify:
   - The chat template framing (<|startoftext|> ... User: ... Assistant: ...)
   - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: `
     (Part A regression: putting the trigger BEFORE user_prompt sends the
     model into a death-loop under greedy decoding).
   - <img> placeholder is positioned correctly for image-input tasks.
   - Each prompt segment is encoded in an isolated tokenizer.encode() call
     so cross-segment BPE merges cannot occur (the bug from commit 7bd429e).

2. AST-based wiring guard verifies that examples/.../end2end.py imports
   build_prompt_tokens from prompt_utils and does NOT redefine it locally.
   This protects the *delivery vector* of the original regression: the
   wrong template re-entered the example via a hand-rolled local builder
   that diverged from the canonical helper.

3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache)
   asserts that segment-by-segment build_prompt_tokens produces a STRICTLY
   different id sequence than tokenizer.encode(build_prompt(...)) for a
   `。`-ending prompt. If a future "simplification" replaces segment encode
   with full-string encode, the BPE-merge-bypass behavior is gone and this
   test fires.

Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct
tokenizer in HF cache): 19/19 passed including the real-tokenizer test.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

add regression test for this

down

CI ruff format check required collapsing short multi-line constructs
(single-line assertions, single-line if conditions) onto one line.
No semantic change; 19/19 tests still pass.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 need review

@Gaohan123 Gaohan123 added high priority high priority issue, needs to be done asap ready label to trigger buildkite CI labels Apr 30, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Please fix CI

zuiho added 2 commits May 5, 2026 21:11
PR vllm-project#3232 [Rebase] Rebase to vllm 0.20.0 folded `SharedFusedMoE` into
`FusedMoE` and dropped the `vllm.model_executor.layers.fused_moe.shared_fused_moe`
submodule, which broke pytest collection for
tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_sampler.py with
`ModuleNotFoundError: No module named
'vllm.model_executor.layers.fused_moe.shared_fused_moe'` across all four CI
suites on this branch (simple-unit-test, diffusion-cache-backend-test,
cuda-unit-test-with-single-card, cuda-unit-test-with-multi-cards).

Mirrors the same minimal fix applied on cr/pr3107-rebased:

- Wrap the legacy import in try/except and fall back to
  `FusedMoE as SharedFusedMoE`. `FusedMoE` now accepts `shared_experts=`
  directly and the call sites only use `make_expert_params_mapping` and
  `__init__(shared_experts=..., ...)`, both present on `FusedMoE`.
- Drop `reduce_results=False` from the `SharedFusedMoE(...)` call —
  vllm 0.20 removed that parameter from `FusedMoE.__init__`.
- Drop the manual `(routed, shared)` tuple merge and
  `tensor_model_parallel_all_reduce` post-processing in
  `HunyuanImage3SparseMoeBlock.forward`. vllm 0.20+ `FusedMoE` merges
  shared-experts internally and runs the TP all-reduce inside its
  forward, so the result is the already-combined, already-reduced
  tensor.

Signed-off-by: zuiho <2324465096@qq.com>
Lint follow-up to c5089d1. The previous commit removed the call to
`tensor_model_parallel_all_reduce` in `HunyuanImage3SparseMoeBlock.forward`
(vllm 0.20+ FusedMoE runs the TP all-reduce internally) but left the
symbol in the `from vllm.distributed import (...)` block, which `ruff`
flags as unused (F401).

Signed-off-by: zuiho <2324465096@qq.com>
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

Please fix CI

fix now

Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@Gaohan123 Gaohan123 merged commit 44cde33 into vllm-project:main May 5, 2026
8 checks passed
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
…m-project#3243)

Three layers of protection for the bug fixed in this PR:

1. Pure-logic structural tests (FakeTokenizer-based) verify:
   - The chat template framing (<|startoftext|> ... User: ... Assistant: ...)
   - Trigger tag (<think> / <recaption>) is appended AFTER `Assistant: `
     (Part A regression: putting the trigger BEFORE user_prompt sends the
     model into a death-loop under greedy decoding).
   - <img> placeholder is positioned correctly for image-input tasks.
   - Each prompt segment is encoded in an isolated tokenizer.encode() call
     so cross-segment BPE merges cannot occur (the bug from commit 7bd429e).

2. AST-based wiring guard verifies that examples/.../end2end.py imports
   build_prompt_tokens from prompt_utils and does NOT redefine it locally.
   This protects the *delivery vector* of the original regression: the
   wrong template re-entered the example via a hand-rolled local builder
   that diverged from the canonical helper.

3. Real-tokenizer regression (skipped if HunyuanImage3 not in HF cache)
   asserts that segment-by-segment build_prompt_tokens produces a STRICTLY
   different id sequence than tokenizer.encode(build_prompt(...)) for a
   `。`-ending prompt. If a future "simplification" replaces segment encode
   with full-string encode, the BPE-merge-bypass behavior is gone and this
   test fires.

Verified on remote (2x L20X, transformers 4.57.1, HunyuanImage3-Instruct
tokenizer in HF cache): 19/19 passed including the real-tokenizer test.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
TaffyOfficial pushed a commit to skf-1999/vllm-omni that referenced this pull request May 6, 2026
Address PR vllm-project#3107 review (Bounty-hunter / Gaohan123) requesting
AR-output-format and DiT-output-accuracy regression tests. Layout
mirrors PR vllm-project#2949's split (CPU unit test under tests/diffusion/...,
GPU accuracy test under tests/e2e/accuracy/...).

CPU unit test
  tests/diffusion/models/hunyuan_image3/test_hunyuan_image3_it2i_ar_format.py
  - test_ar_prefill_tokens_match_hf_apply_chat_template_for_it2i:
    asserts build_prompt_tokens (the AR-side prefill builder) is
    token-id-identical to HF tokenizer.apply_chat_template for the
    same (system, user_prompt, image) triple. Catches drift between
    the AR's input distribution and the model's training distribution
    -- the same failure mode PR vllm-project#3243 fixed for T2I.
  - test_dit_condition_image_preprocessing_byte_matches_ar_processor:
    asserts the diffusion-side _resize_and_crop_center produces
    byte-identical pixels to the AR-side
    HunyuanImage3Processor._resize_and_crop on the canonical resize
    targets. Direct response to Bounty-hunter's PR vllm-project#3107 review.

Both tests gate on tencent/HunyuanImage-3.0-Instruct being in the local
HF cache (no GPU/model weights required at runtime, just the tokenizer
config + image processor).

GPU accuracy test
  tests/e2e/accuracy/test_hunyuan_image3_it2i.py
  - test_hunyuan_image3_it2i_matches_hf_reference_psnr_40:
    drives vllm-omni's offline IT2I path through Omni and runs the
    official HF reference via AutoModelForCausalLM.generate_image,
    compared via the shared assert_similarity helper at PSNR>=40 dB
    and SSIM>=0.92. Marked full_model + skipif<8 GPUs; the threshold
    follows PR vllm-project#2949's review discussion (40 dB gives slack for TP=2
    NCCL drift while still catching prompt/image-preprocessing bugs).

Signed-off-by: zuiho-kai <wu15922848573@outlook.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
… Instruct chat template (vllm-project#3243)

Signed-off-by: zuiho <wu15922848573@outlook.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>
Signed-off-by: zuiho <2324465096@qq.com>
Co-authored-by: TaffyOfficial <2324465096@qq.com>
Co-authored-by: zuiho-kai <31877877+zuiho-kai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants