mtmd: fix patch_size initialized to random value in audio models by ngxson · Pull Request #17128 · ggml-org/llama.cpp

ngxson · 2025-11-09T18:51:30Z

Tests are OK:

[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M

CISC

Can't you just have default values in clip_hparams though?

ngxson · 2025-11-09T19:01:50Z

Can't you just have default values in clip_hparams though?

Either way is possible, but I think it's better to avoid communicating patch_size should have a default value. It's technically should not have a default value. And in the case of audio model, it's more like a dummy value.

CISC · 2025-11-09T19:06:13Z

Can't you just have default values in clip_hparams though?

Either way is possible, but I think it's better to avoid communicating patch_size should have a default value. It's technically should not have a default value. And in the case of audio model, it's more like a dummy value.

What about at least initializing it to 0 so it'll crash if one forgets for a new modality?

henk717 · 2025-11-09T19:07:39Z

Initializing to 0 breaks the current audio model implementation. It was found because my system initialized it as 0 already resulting in audio models failing to launch at random moments.

CISC · 2025-11-09T19:08:39Z

Initializing to 0 breaks the current audio model implementation. It was found because my system initialized it as 0 already resulting in audio models failing to launch at random moments.

I meant in clip_hparams (while still setting it to 1 for audio).

ngxson · 2025-11-09T21:05:22Z

What about at least initializing it to 0 so it'll crash if one forgets for a new modality?

yes and even better, I think there should be default value for other hparams too (added in 4ac2c20)

This should prevent forgetting to initialize one of the params in general.

LostRuins · 2025-11-10T02:13:44Z

also there are multiple GGML_ASSERT scattered throughout the code that can trip if patch size is an unexpected value even if unused. but this approach should work.

* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (ggml-org#14928) * model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (ggml-org#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (ggml-org#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (ggml-org#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (ggml-org#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (ggml-org#16921) * mtmd: improve struct initialization (ggml-org#16981) * mtmd: allow QwenVL to process larger image by default (ggml-org#17020) * Disable flash attention * mtmd : fix embedding size for image input (ggml-org#17123) * mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>

…l-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams

mtmd: fix patch_size initialized to random value in audio models

1fffbd3

CISC approved these changes Nov 9, 2025

View reviewed changes

github-actions bot added the examples label Nov 9, 2025

DajanaV mentioned this pull request Nov 9, 2025

UPSTREAM PR #17128: mtmd: fix patch_size initialized to random value in audio models auroralabs-loci/llama.cpp#150

Closed

add default hparams

4ac2c20

LostRuins mentioned this pull request Nov 10, 2025

CUDA division by zero error when loading Qwen2.5-Omni multimodal projector LostRuins/koboldcpp#1663

Closed

LostRuins approved these changes Nov 10, 2025

View reviewed changes

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Nov 10, 2025

merge ggml-org#17128

df6e303

ngxson merged commit 4b13a68 into ggml-org:master Nov 10, 2025
71 checks passed

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

mtmd: fix patch_size initialized to random value in audio models (ggm…

e3a5af9

…l-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: fix patch_size initialized to random value in audio models#17128

mtmd: fix patch_size initialized to random value in audio models#17128
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/fix_audio_patch_size_zero

ngxson commented Nov 9, 2025 •

edited

Loading

Uh oh!

CISC left a comment

Uh oh!

ngxson commented Nov 9, 2025

Uh oh!

CISC commented Nov 9, 2025

Uh oh!

henk717 commented Nov 9, 2025

Uh oh!

CISC commented Nov 9, 2025

Uh oh!

ngxson commented Nov 9, 2025 •

edited

Loading

Uh oh!

LostRuins commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngxson commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 9, 2025

Uh oh!

CISC commented Nov 9, 2025

Uh oh!

henk717 commented Nov 9, 2025

Uh oh!

CISC commented Nov 9, 2025

Uh oh!

ngxson commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Nov 9, 2025 •

edited

Loading

ngxson commented Nov 9, 2025 •

edited

Loading