Skip to content

mtmd, llama : Update HunyuanVL vision-language model support#22037

Merged
ngxson merged 6 commits into
ggml-org:masterfrom
ManaEstras:pr2-hunyuanvl
Apr 22, 2026
Merged

mtmd, llama : Update HunyuanVL vision-language model support#22037
ngxson merged 6 commits into
ggml-org:masterfrom
ManaEstras:pr2-hunyuanvl

Conversation

@ManaEstras
Copy link
Copy Markdown
Contributor

Overview

Update support for HunyuanVL vision-language model.

Model Architecture

HunyuanVL consists of:

  • Vision encoder: ViT with PatchMerger projector
  • Text decoder: Hunyuan-dense architecture with M-RoPE (multi-dimensional rotary position embedding)

Key differences from other VL models (and the previous HunyuanOCR):

  • Uses XD-RoPE (extended dimensional RoPE) with rope_dimension_sections for separate temporal/height/width position encoding
  • Image tokens follow a special layout: <BOI> + patch rows separated by newlines + <EOI>
  • Position embedding interpolation uses explicit scale factors via ggml_interpolate_sf()

Changes

llama:

  • Add LLM_ARCH_HUNYUAN_VL architecture
  • Add rope.scaling.alpha parameter support
  • Add rope_dimension_sections for M-RoPE configuration

mtmd:

  • Add PROJECTOR_TYPE_HUNYUANVL
  • Add M-RoPE position encoding for image tokens
  • Add mtmd_decode_use_mrope_hunyuanvl() API

convert:

  • Add HunyuanVLVisionModel and HunyuanVLTextModel GGUF export

Tests:

  • Add HunyuanVL entry in tools/mtmd/tests.sh

Dependencies

Depends on PR #22036 (ggml_interpolate_sf) for position embedding interpolation.

Testing

  • ctest -L main passes
  • Smoke test in tools/mtmd/tests.sh
  • Manual inference test with HunyuanVL-4B

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI assisted with code review and documentation drafting. All code was human-authored and manually reviewed.

@ManaEstras ManaEstras requested review from a team, CISC and ggerganov as code owners April 17, 2026 09:01
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 17, 2026

Hi @ManaEstras, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added model Model specific examples python python script changes labels Apr 17, 2026
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misread my previous comments from #22029, please confirm before making changes to make sure you understand it correctly.

Comment thread tools/mtmd/mtmd-helper.cpp Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread tools/mtmd/models/hunyuanocr.cpp Outdated
@wendadawen
Copy link
Copy Markdown
Contributor

Pushed updated version:

@wendadawen
Copy link
Copy Markdown
Contributor

clip: implement HunyuanVL position embedding resize on CPU.

  • hunyuanocr.cpp: HUNYUANVL declares pos_embd as graph input
  • clip.cpp: CPU bilinear resize + set_input_f32 in HUNYUANVL case

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is not very far from merge-able, just having some comments regarding the maintainability of the code. Hope you can update it accordingly. Thanks in advanced!

Comment thread tools/mtmd/mtmd.cpp Outdated
Comment thread tools/mtmd/mtmd.cpp Outdated
Comment thread tools/mtmd/mtmd.cpp Outdated
Comment thread tools/mtmd/clip.cpp
- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh
@wendadawen
Copy link
Copy Markdown
Contributor

Addressed all four points and rebased onto latest master:

  • Removed image_idx from the struct as a standalone "always 0" field. It's now populated by a counter in mtmd_tokenizer so it reflects the image's position in the prompt (XD-RoPE dim-3).
  • n_newline/n_boi/n_eoi collapsed: HunyuanVL layout is expressed via a new MTMD_POS_TYPE_HUNYUANVL enum value on top of your mtmd_pos_type.
  • mtmd_image_tokens_get_decoder_pos / get_n_pos now use switch (pos) style with a HUNYUANVL case.
  • Kept the local CPU bilinear for position_embeddings — it's an n_embd-channel float weight tensor with (target+0.1)/n_grid sampling, so reusing resize_bilinear(3-ch uint8, corner-aligned) would require generalizing that helper.

Comment thread src/llama-model.cpp Outdated
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments before we can merge.

Btw, could you test with hunyuan-ocr (both conversion + inference) to see if it still works?

Comment thread gguf-py/gguf/constants.py Outdated
YOUTUVL = "youtuvl"
NEMOTRON_V2_VL = "nemotron_v2_vl"
HUNYUANOCR = "hunyuanocr"
HUNYUANVL = "hunyuanvl_merger"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HUNYUANVL = "hunyuanvl_merger"
HUNYUANVL = "hunyuanvl"

remove _merger to make it shorter

Comment thread tools/mtmd/clip-impl.h Outdated
{ PROJECTOR_TYPE_KIMIK25, "kimik25"},
{ PROJECTOR_TYPE_NEMOTRON_V2_VL, "nemotron_v2_vl"},
{ PROJECTOR_TYPE_HUNYUANOCR, "hunyuanocr"},
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl_merger"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl_merger"},
{ PROJECTOR_TYPE_HUNYUANVL, "hunyuanvl"},

Comment thread tools/mtmd/mtmd.cpp Outdated
Comment on lines +47 to +48
// HunyuanVL-specific layout state (only meaningful when pos == MTMD_POS_TYPE_HUNYUANVL)
uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// HunyuanVL-specific layout state (only meaningful when pos == MTMD_POS_TYPE_HUNYUANVL)
uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt
uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt (used by pos == MTMD_POS_TYPE_HUNYUANVL)

Comment thread convert_hf_to_gguf.py
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register("HunYuanDenseV1ForCausalLM", "HunYuanVLForConditionalGeneration")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this break the conversion of hunyuan-ocr ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully handled here, but needs testing:

llama.cpp/convert_hf_to_gguf.py

Lines 12071 to 12073 in 0a5a97c

@ModelBase.register("HunYuanVLForConditionalGeneration")
class HunyuanVLTextModel(HunYuanModel):
model_arch = gguf.MODEL_ARCH.HUNYUAN_VL

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson @CISC Addressed, tested locally: both HunyuanOCR and HunyuanVL convert to GGUF successfully and produce correct inference output on Metal (F16 / Q8_0). The only difference between OCR and VL is the projection dim (vision_config.out_hidden_size: 1024 for OCR)

 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 21, 2026

Final thing before merging, @wendadawen can you look on the failed workflow(s)? For example: https://github.com/ggml-org/llama.cpp/actions/runs/24727215703/job/72373607602?pr=22037

Seems like there is a problem with code style

@wendadawen
Copy link
Copy Markdown
Contributor

@ngxson Pushed a fix for the indentation error.

 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.
@ngxson ngxson merged commit 7bfe60f into ggml-org:master Apr 22, 2026
52 of 53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…g#22037)

* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants