mtmd, llama, ggml : Update HunyuanVL support#22029
Conversation
Add support for HunyuanVL vision-language model. ggml changes: - add ggml_interpolate_sf() for explicit scale factor interpolation - add GGML_SCALE_FLAG_CUSTOM_SF flag - fix nearest interpolation out-of-bounds access in all backends - add test_interpolate_sf test cases (38 tests) llama changes: - add HUNYUAN_VL architecture - add rope.scaling.alpha and rope_dimension_sections support - add M-RoPE support for HunyuanVL mtmd changes: - add PROJECTOR_TYPE_HUNYUANVL - add special image token layout (BOI + rows with newlines + EOI) - add set_position_mrope_hunyuanvl() for HunyuanVL M-RoPE - add mtmd_decode_use_mrope_hunyuanvl() API convert changes: - add HunyuanVLVisionModel (mmproj) export - add HunyuanVLTextModel export
|
Hi @ManaEstras, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
There was a problem hiding this comment.
A general note that if you want to accelerate merging this PR:
- Do not add a new ggml op, follow the recommendation in my comment
- If you still have a very good reason to add a new op, you should only add CPU support in this PR
The more changes you include here, the longer it take for other reviewers to approve, and the slower it can be merged.
| if name.startswith("vit.perceive."): | ||
| suffix = name[len("vit.perceive."):] | ||
| if suffix.startswith("proj."): | ||
| # proj.0.weight -> mm.0.weight, proj.2.weight -> mm.2.weight | ||
| new_name = "mm." + suffix[len("proj."):] | ||
| elif suffix.startswith("mlp."): | ||
| # mlp.weight -> mm.model.fc.weight | ||
| new_name = "mm.model.fc." + suffix[len("mlp."):] | ||
| elif suffix.startswith("before_rms."): | ||
| # before_rms.weight -> mm.pre_norm.weight | ||
| new_name = "mm.pre_norm." + suffix[len("before_rms."):] | ||
| elif suffix.startswith("after_rms."): | ||
| # after_rms.weight -> mm.post_norm.weight | ||
| new_name = "mm.post_norm." + suffix[len("after_rms."):] | ||
| elif suffix == "image_newline": | ||
| new_name = "v.image_newline" | ||
| elif suffix == "image_sep": | ||
| new_name = "v.view_seperator" | ||
| else: | ||
| # image_begin, image_end -> mm.image_begin, mm.image_end | ||
| new_name = "mm." + suffix | ||
| yield (new_name, data_torch) | ||
| return |
There was a problem hiding this comment.
please use proper tensor mapping like all other models
| } | ||
| } | ||
|
|
||
| void set_position_mrope_hunyuanvl(llama_pos pos_0, int nx, int ny, llama_seq_id seq_id, int image_count = 0) { |
There was a problem hiding this comment.
remove any changes in mtmd-helper, implement it in mtmd_image_tokens_get_decoder_pos instead
| uint32_t ny; // number of tokens in y direction | ||
| bool use_mrope_pos = false; // use M-RoPE position counting (the whole image is 1 temporal position) | ||
| uint32_t n_tokens() const { return nx * ny; } | ||
| uint32_t n_tokens_total = 0; |
There was a problem hiding this comment.
it should ne uint32_t n_boi, the number of BOI tokens
please cherry-pick the logic from ngxson#100
|
|
||
| // whether the current model uses HunyuanVL-style M-RoPE | ||
| // (token layout differs from standard 2D grid: BOI + rows-with-newlines + EOI) | ||
| MTMD_API bool mtmd_decode_use_mrope_hunyuanvl(mtmd_context * ctx); |
There was a problem hiding this comment.
remove this API, use decoder_pos, see ngxson#100 (same idea)
| pos_patch = ggml_interpolate_sf(ctx0, pos_patch, pw, ph, n_embd, 1, | ||
| GGML_SCALE_MODE_BILINEAR, | ||
| (float)(pw + 0.1f) / n_grid, | ||
| (float)(ph + 0.1f) / n_grid); |
There was a problem hiding this comment.
IMO the new op is quite hacky (though important note is that I'm not the one who can give the approval for a new op), it's better to simply resize the embedding on CPU, and set the resized as graph input
|
Thx @ngxson let me clean up some of the code and commit. I've added another two PRs regarding the same modifications as this one. so please temporarily ignore those PRs. |
|
@ManaEstras To clarify a bit, what I mean is that let's not change anything in GGML at this time, to avoid putting too much stress for backend maintainers (I'm not a backend maintainer btw, so I can only help you on the multimodal part) My idea is that you can either:
An alternative method is that you can also see if you can implement the same functionality with the existing |
|
@ngxson Thanks for the review comments. Here's what I've addressed:
|
|
@wendadawen seems like you misunderstood my comments, please refer to #22037 Btw, there are 2 similar PRs and I don't know which one you are working on |
Overview
Update support for HunyuanVL vision-language model.
This PR includes:
ggml
ggml_interpolate_sf()API for explicit scale factor interpolation (needed for HunyuanVL's(H+0.1)/n_gridposition embedding scaling)GGML_SCALE_FLAG_CUSTOM_SFflagtest_interpolate_sftest cases (38 tests covering various modes and edge cases)llama
HUNYUAN_VLarchitecturerope.scaling.alphaandrope_dimension_sectionsmtmd
PROJECTOR_TYPE_HUNYUANVLprojector typeset_position_mrope_hunyuanvl()andmtmd_decode_use_mrope_hunyuanvl()APIsconvert
HunyuanVLVisionModel(mmproj) andHunyuanVLTextModelexport supportTesting
ctest -L mainpassedtest-backend-opspassed (38 interpolate_sf tests on CPU and Metal)tools/mtmd/tests.shsmoke test addedAdditional information
HunyuanVL uses a special position embedding interpolation that differs from standard models - it requires explicit scale factors
(H+0.1)/n_gridinstead of the simpleH/n_gridratio. This necessitated the newggml_interpolate_sf()API.The image token layout is also non-standard: instead of a simple
nx * nygrid, HunyuanVL usesBOI + (patch rows with newline separators) + EOI, which required special handling in mtmd.Requirements