Skip to content

mtmd, llama, ggml : Update HunyuanVL support#22029

Closed
ManaEstras wants to merge 5 commits into
ggml-org:masterfrom
ManaEstras:hyvl
Closed

mtmd, llama, ggml : Update HunyuanVL support#22029
ManaEstras wants to merge 5 commits into
ggml-org:masterfrom
ManaEstras:hyvl

Conversation

@ManaEstras
Copy link
Copy Markdown
Contributor

Overview

Update support for HunyuanVL vision-language model.

This PR includes:

ggml

  • New ggml_interpolate_sf() API for explicit scale factor interpolation (needed for HunyuanVL's (H+0.1)/n_grid position embedding scaling)
  • New GGML_SCALE_FLAG_CUSTOM_SF flag
  • Fix nearest interpolation out-of-bounds access in CPU/CUDA/Metal/SYCL/Vulkan/OpenCL backends
  • Add test_interpolate_sf test cases (38 tests covering various modes and edge cases)

llama

  • New HUNYUAN_VL architecture
  • Support for rope.scaling.alpha and rope_dimension_sections
  • M-RoPE support for HunyuanVL text model

mtmd

  • New PROJECTOR_TYPE_HUNYUANVL projector type
  • Special image token layout handling (BOI + rows with newlines + EOI)
  • New set_position_mrope_hunyuanvl() and mtmd_decode_use_mrope_hunyuanvl() APIs
  • Add HunyuanVL smoke test in tests.sh

convert

  • Add HunyuanVLVisionModel (mmproj) and HunyuanVLTextModel export support

Testing

  • ctest -L main passed
  • test-backend-ops passed (38 interpolate_sf tests on CPU and Metal)
  • tools/mtmd/tests.sh smoke test added

Additional information

HunyuanVL uses a special position embedding interpolation that differs from standard models - it requires explicit scale factors (H+0.1)/n_grid instead of the simple H/n_grid ratio. This necessitated the new ggml_interpolate_sf() API.

The image token layout is also non-standard: instead of a simple nx * ny grid, HunyuanVL uses BOI + (patch rows with newline separators) + EOI, which required special handling in mtmd.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI was used for code review, documentation drafting, and test case suggestions. All code was written and reviewed by human contributors.

Add support for HunyuanVL vision-language model.

ggml changes:
- add ggml_interpolate_sf() for explicit scale factor interpolation
- add GGML_SCALE_FLAG_CUSTOM_SF flag
- fix nearest interpolation out-of-bounds access in all backends
- add test_interpolate_sf test cases (38 tests)

llama changes:
- add HUNYUAN_VL architecture
- add rope.scaling.alpha and rope_dimension_sections support
- add M-RoPE support for HunyuanVL

mtmd changes:
- add PROJECTOR_TYPE_HUNYUANVL
- add special image token layout (BOI + rows with newlines + EOI)
- add set_position_mrope_hunyuanvl() for HunyuanVL M-RoPE
- add mtmd_decode_use_mrope_hunyuanvl() API

convert changes:
- add HunyuanVLVisionModel (mmproj) export
- add HunyuanVLTextModel export
@ManaEstras ManaEstras requested review from a team, CISC and ggerganov as code owners April 17, 2026 04:51
@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) OpenCL Issues specific to the OpenCL backend labels Apr 17, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 17, 2026

Hi @ManaEstras, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general note that if you want to accelerate merging this PR:

  1. Do not add a new ggml op, follow the recommendation in my comment
  2. If you still have a very good reason to add a new op, you should only add CPU support in this PR

The more changes you include here, the longer it take for other reviewers to approve, and the slower it can be merged.

Comment thread convert_hf_to_gguf.py Outdated
Comment on lines +12050 to +12072
if name.startswith("vit.perceive."):
suffix = name[len("vit.perceive."):]
if suffix.startswith("proj."):
# proj.0.weight -> mm.0.weight, proj.2.weight -> mm.2.weight
new_name = "mm." + suffix[len("proj."):]
elif suffix.startswith("mlp."):
# mlp.weight -> mm.model.fc.weight
new_name = "mm.model.fc." + suffix[len("mlp."):]
elif suffix.startswith("before_rms."):
# before_rms.weight -> mm.pre_norm.weight
new_name = "mm.pre_norm." + suffix[len("before_rms."):]
elif suffix.startswith("after_rms."):
# after_rms.weight -> mm.post_norm.weight
new_name = "mm.post_norm." + suffix[len("after_rms."):]
elif suffix == "image_newline":
new_name = "v.image_newline"
elif suffix == "image_sep":
new_name = "v.view_seperator"
else:
# image_begin, image_end -> mm.image_begin, mm.image_end
new_name = "mm." + suffix
yield (new_name, data_torch)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use proper tensor mapping like all other models

Comment thread tools/mtmd/mtmd-helper.cpp Outdated
}
}

void set_position_mrope_hunyuanvl(llama_pos pos_0, int nx, int ny, llama_seq_id seq_id, int image_count = 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove any changes in mtmd-helper, implement it in mtmd_image_tokens_get_decoder_pos instead

Comment thread tools/mtmd/mtmd.cpp Outdated
uint32_t ny; // number of tokens in y direction
bool use_mrope_pos = false; // use M-RoPE position counting (the whole image is 1 temporal position)
uint32_t n_tokens() const { return nx * ny; }
uint32_t n_tokens_total = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should ne uint32_t n_boi, the number of BOI tokens

please cherry-pick the logic from ngxson#100

Comment thread tools/mtmd/mtmd.h Outdated

// whether the current model uses HunyuanVL-style M-RoPE
// (token layout differs from standard 2D grid: BOI + rows-with-newlines + EOI)
MTMD_API bool mtmd_decode_use_mrope_hunyuanvl(mtmd_context * ctx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this API, use decoder_pos, see ngxson#100 (same idea)

Comment on lines +23 to +26
pos_patch = ggml_interpolate_sf(ctx0, pos_patch, pw, ph, n_embd, 1,
GGML_SCALE_MODE_BILINEAR,
(float)(pw + 0.1f) / n_grid,
(float)(ph + 0.1f) / n_grid);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the new op is quite hacky (though important note is that I'm not the one who can give the approval for a new op), it's better to simply resize the embedding on CPU, and set the resized as graph input

@ManaEstras
Copy link
Copy Markdown
Contributor Author

Thx @ngxson let me clean up some of the code and commit. I've added another two PRs regarding the same modifications as this one. so please temporarily ignore those PRs.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 17, 2026

@ManaEstras To clarify a bit, what I mean is that let's not change anything in GGML at this time, to avoid putting too much stress for backend maintainers (I'm not a backend maintainer btw, so I can only help you on the multimodal part)

My idea is that you can either:

  1. (Recommended way) to call ggml_backend_tensor_get() to get the tensor data, resize it inside clip_image_batch_encode and set the resized version as input data on the graph via set_input_f32()
  2. Or, you can use ggml_custom_4d but it's not very well documented, you may loss time to debug it

An alternative method is that you can also see if you can implement the same functionality with the existing ggml_interpolate, plus ggml_view to crop the output. But I can be wrong about how it work.

@wendadawen
Copy link
Copy Markdown
Contributor

@ngxson Thanks for the review comments. Here's what I've addressed:

  1. Tensor mapping — Replaced manual if-elif remapping with standard tensor_mapping.py.
  2. XD-RoPE — Replaced the old M-RoPE approach entirely with a fresh decoder_pos based XD-RoPE implementation in mtmd-helper. No longer uses the previous M-RoPE logic.
  3. ggml_interpolate_sf — Kept CPU backend only, reverted other backend changes.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 19, 2026

@wendadawen seems like you misunderstood my comments, please refer to #22037

Btw, there are 2 similar PRs and I don't know which one you are working on

@ManaEstras ManaEstras closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants