mtmd, llama : Update HunyuanVL vision-language model support by ManaEstras · Pull Request #22037 · ggml-org/llama.cpp

ManaEstras · 2026-04-17T09:01:04Z

Overview

Update support for HunyuanVL vision-language model.

Model Architecture

HunyuanVL consists of:

Vision encoder: ViT with PatchMerger projector
Text decoder: Hunyuan-dense architecture with M-RoPE (multi-dimensional rotary position embedding)

Key differences from other VL models (and the previous HunyuanOCR):

Uses XD-RoPE (extended dimensional RoPE) with rope_dimension_sections for separate temporal/height/width position encoding
Image tokens follow a special layout: <BOI> + patch rows separated by newlines + <EOI>
Position embedding interpolation uses explicit scale factors via ggml_interpolate_sf()

Changes

llama:

Add LLM_ARCH_HUNYUAN_VL architecture
Add rope.scaling.alpha parameter support
Add rope_dimension_sections for M-RoPE configuration

mtmd:

Add PROJECTOR_TYPE_HUNYUANVL
Add M-RoPE position encoding for image tokens
Add mtmd_decode_use_mrope_hunyuanvl() API

convert:

Add HunyuanVLVisionModel and HunyuanVLTextModel GGUF export

Tests:

Add HunyuanVL entry in tools/mtmd/tests.sh

Dependencies

Depends on PR #22036 (ggml_interpolate_sf) for position embedding interpolation.

Testing

ctest -L main passes
Smoke test in tools/mtmd/tests.sh
Manual inference test with HunyuanVL-4B

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI assisted with code review and documentation drafting. All code was human-authored and manually reviewed.

ggml-gh-bot · 2026-04-17T09:05:47Z

Hi @ManaEstras, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ngxson

I think you misread my previous comments from #22029, please confirm before making changes to make sure you understand it correctly.

wendadawen · 2026-04-20T06:59:06Z

Pushed updated version:

mtmd-helper: now matches master; HunyuanVL layout is inside mtmd_image_tokens_get_decoder_pos (using pos_0 from mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) #22082).
convert: dropped hardcoded tensor remap, uses tensor_mapping.py.
ggml_interpolate_sf: removed.

wendadawen · 2026-04-20T10:28:15Z

clip: implement HunyuanVL position embedding resize on CPU.

hunyuanocr.cpp: HUNYUANVL declares pos_embd as graph input
clip.cpp: CPU bilinear resize + set_input_f32 in HUNYUANVL case

ngxson

I think this PR is not very far from merge-able, just having some comments regarding the maintainability of the code. Hope you can update it accordingly. Thanks in advanced!

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support - add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder - add HunyuanVL-specific M-RoPE position encoding for image tokens - add GGUF conversion for HunyuanVL vision and text models - add smoke test in tools/mtmd/tests.sh

wendadawen · 2026-04-21T05:19:57Z

Addressed all four points and rebased onto latest master:

Removed image_idx from the struct as a standalone "always 0" field. It's now populated by a counter in mtmd_tokenizer so it reflects the image's position in the prompt (XD-RoPE dim-3).
n_newline/n_boi/n_eoi collapsed: HunyuanVL layout is expressed via a new MTMD_POS_TYPE_HUNYUANVL enum value on top of your mtmd_pos_type.
mtmd_image_tokens_get_decoder_pos / get_n_pos now use switch (pos) style with a HUNYUANVL case.
Kept the local CPU bilinear for position_embeddings — it's an n_embd-channel float weight tensor with (target+0.1)/n_grid sampling, so reusing resize_bilinear(3-ch uint8, corner-aligned) would require generalizing that helper.

ngxson

Some small comments before we can merge.

Btw, could you test with hunyuan-ocr (both conversion + inference) to see if it still works?

ngxson · 2026-04-21T09:09:20Z

    YOUTUVL = "youtuvl"
    NEMOTRON_V2_VL = "nemotron_v2_vl"
    HUNYUANOCR     = "hunyuanocr"
+    HUNYUANVL      = "hunyuanvl_merger"


Suggested change

HUNYUANVL = "hunyuanvl_merger"

HUNYUANVL = "hunyuanvl"

remove _merger to make it shorter

ngxson · 2026-04-21T09:10:19Z

    { PROJECTOR_TYPE_KIMIK25,   "kimik25"},
    { PROJECTOR_TYPE_NEMOTRON_V2_VL, "nemotron_v2_vl"},
    { PROJECTOR_TYPE_HUNYUANOCR, "hunyuanocr"},
+    { PROJECTOR_TYPE_HUNYUANVL,  "hunyuanvl_merger"},


Suggested change

{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl_merger"},

{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},

ngxson · 2026-04-21T09:15:25Z

+    // HunyuanVL-specific layout state (only meaningful when pos == MTMD_POS_TYPE_HUNYUANVL)
+    uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt


Suggested change

// HunyuanVL-specific layout state (only meaningful when pos == MTMD_POS_TYPE_HUNYUANVL)

uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt

uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt (used by pos == MTMD_POS_TYPE_HUNYUANVL)

ngxson · 2026-04-21T09:17:05Z

                raise ValueError(f"Unprocessed experts: {experts}")


-@ModelBase.register("HunYuanDenseV1ForCausalLM", "HunYuanVLForConditionalGeneration")


will this break the conversion of hunyuan-ocr ?

Hopefully handled here, but needs testing:

llama.cpp/convert_hf_to_gguf.py

Lines 12071 to 12073 in 0a5a97c

@ModelBase.register("HunYuanVLForConditionalGeneration")

class HunyuanVLTextModel(HunYuanModel):

model_arch = gguf.MODEL_ARCH.HUNYUAN_VL

@ngxson @CISC Addressed, tested locally: both HunyuanOCR and HunyuanVL convert to GGUF successfully and produce correct inference output on Metal (F16 / Q8_0). The only difference between OCR and VL is the projection dim (vision_config.out_hidden_size: 1024 for OCR)

- Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF - successfully and produce correct inference output on Metal (F16 / Q8_0).

ngxson · 2026-04-21T18:42:10Z

Final thing before merging, @wendadawen can you look on the failed workflow(s)? For example: https://github.com/ggml-org/llama.cpp/actions/runs/24727215703/job/72373607602?pr=22037

Seems like there is a problem with code style

wendadawen · 2026-04-22T02:26:39Z

@ngxson Pushed a fix for the indentation error.

- convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

…g#22037) * mtmd, llama : add HunyuanVL vision-language model support - add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support - add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder - add HunyuanVL-specific M-RoPE position encoding for image tokens - add GGUF conversion for HunyuanVL vision and text models - add smoke test in tools/mtmd/tests.sh * fix: fix HunyuanVL XD-RoPE h/w section order * fix: Remove redundant code * convert : fix HunyuanOCR / HunyuanVL conversion - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF - successfully and produce correct inference output on Metal (F16 / Q8_0). * clip : fix -Werror=misleading-indentation in bilinear resize * fix CI: convert_hf_to_gguf type check error - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`. --------- Co-authored-by: wendadawen <wendadawen@tencent.com>

ManaEstras requested review from a team, CISC and ggerganov as code owners April 17, 2026 09:01

github-actions Bot added model Model specific examples python python script changes labels Apr 17, 2026

ngxson requested changes Apr 19, 2026

View reviewed changes

Comment thread tools/mtmd/mtmd-helper.cpp Outdated

Comment thread convert_hf_to_gguf.py Outdated

Comment thread tools/mtmd/models/hunyuanocr.cpp Outdated

ngxson mentioned this pull request Apr 19, 2026

mtmd, llama, ggml : Update HunyuanVL support #22029

Closed

3 tasks

wendadawen force-pushed the pr2-hunyuanvl branch from 641db31 to 3eef9d4 Compare April 20, 2026 06:56

ngxson reviewed Apr 20, 2026

View reviewed changes

Comment thread tools/mtmd/mtmd.cpp Outdated

Comment thread tools/mtmd/mtmd.cpp Outdated

Comment thread tools/mtmd/mtmd.cpp Outdated

Comment thread tools/mtmd/clip.cpp

wendadawen force-pushed the pr2-hunyuanvl branch from 965e868 to 89cd9b9 Compare April 21, 2026 05:14

CISC approved these changes Apr 21, 2026

View reviewed changes

Comment thread src/llama-model.cpp Outdated

wendadawen added 2 commits April 21, 2026 14:52

fix: fix HunyuanVL XD-RoPE h/w section order

741d7a7

fix: Remove redundant code

0a5a97c

ngxson reviewed Apr 21, 2026

View reviewed changes

convert : fix HunyuanOCR / HunyuanVL conversion

f829e27

- Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF - successfully and produce correct inference output on Metal (F16 / Q8_0).

ngxson approved these changes Apr 21, 2026

View reviewed changes

clip : fix -Werror=misleading-indentation in bilinear resize

22a70c6

fix CI: convert_hf_to_gguf type check error

96946e0

- convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

ngxson approved these changes Apr 22, 2026

View reviewed changes

ngxson merged commit 7bfe60f into ggml-org:master Apr 22, 2026
52 of 53 checks passed

	{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl_merger"},
	{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},

		// HunyuanVL-specific layout state (only meaningful when pos == MTMD_POS_TYPE_HUNYUANVL)
		uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt

		raise ValueError(f"Unprocessed experts: {experts}")


		@ModelBase.register("HunYuanDenseV1ForCausalLM", "HunYuanVLForConditionalGeneration")

	@ModelBase.register("HunYuanVLForConditionalGeneration")
	class HunyuanVLTextModel(HunYuanModel):
	model_arch = gguf.MODEL_ARCH.HUNYUAN_VL

Conversation

ManaEstras commented Apr 17, 2026

Overview

Model Architecture

Changes

Dependencies

Testing

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 17, 2026

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wendadawen commented Apr 20, 2026

Uh oh!

wendadawen commented Apr 20, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wendadawen commented Apr 21, 2026

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

wendadawen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 21, 2026

Uh oh!

wendadawen commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson left a comment •

edited

Loading