mtmd: port gemma4uv/gemma4ua support — fixes Gemma 4 12B vision (#163) by TheTom · Pull Request #168 · TheTom/llama-cpp-turboquant

TheTom · 2026-06-05T01:40:36Z

Summary

Fixes #163. The fork could not load the new Gemma 4 12B mmproj — it errored load_hparams: unknown projector type: gemma4uv because the fork lacked the gemma4uv (unified, encoder-less vision) and gemma4ua (unified audio) projectors. The SIGFPE in the original report is the upstream symptom; on this fork the model never got that far.

This cherry-picks the upstream fix ggml-org#24077 ("mtmd, model: allow skip build_vit()", commit a731805) into the fork.

Conflicts resolved

Clean 3-way except 3 non-mtmd files (converter + vocab). Notably: normalizer_lowercase lines that appeared in the diff context are pre-existing upstream code (from an earlier commit the fork hasn't synced), not part of ggml-org#24077 — dropped them so the port only carries ggml-org#24077's actual changes (suppress_tokens, the gemma4uv/ua graphs, projector enums, skip-build_vit).

Testing — verified end to end on CUDA (GB10, sm_121)

Downloaded ggml-org/gemma-4-12B-it-GGUF (Q4_K_M + mmproj-Q8_0) and ran llama-mtmd-cli on a real photo:

$ llama-mtmd-cli -m gemma-4-12B-it-Q4_K_M.gguf --mmproj mmproj-gemma-4-12B-it-Q8_0.gguf \
    --image cat.jpg -p "Describe this image in one sentence." --jinja
> A close-up, slightly blurry shot of a tabby cat's face with large dark eyes and a pink nose.

mmproj loads (gemma4uv projector recognized, no SIGFPE)
image encodes, model describes it accurately
Builds clean on CUDA (sm_121) and Metal, 0 errors both
Text-only path unaffected

Supersedes #166 (the standalone d_head guard is included here via ggml-org#24077's clip.cpp hunk).

Credit: @guarismo for the report, upstream ggml-org#24077.

* add model * nits (cherry picked from commit a731805)

(cherry picked from commit 94a220c)

(cherry picked from commit c8d6a00)

) * mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> (cherry picked from commit e3ba22d)

* Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim (cherry picked from commit e802356)

…p limit The Metal im2col kernel launches KH*KW threads per threadgroup (one per kernel element). For large conv kernels — e.g. the Gemma 4 unified vision (gemma4uv) patch embedding — KH*KW exceeds the Apple GPU 1024-thread cap and the kernel hits a runtime GGML_ASSERT instead of producing a result. Guard supports_op so an oversized im2col is declined; the backend scheduler then runs that one op on CPU while the rest of the graph stays on the GPU. Fixes Gemma 4 12B vision on the Metal backend (verified end-to-end: loads mmproj + describes an image correctly on an M5 Max).

TheTom · 2026-06-05T02:06:58Z

Updated — the branch now fully supports Gemma 4 12B vision on both backends.

Commit stack (6):

mtmd, model: add Gemma 4 "unified" variant ggml-org/llama.cpp#24077 base gemma4uv/gemma4ua support
fix(mtmd): floating point exception in new Gemma 4 12B model ggml-org/llama.cpp#24088 d_head/kq_scale FPE guard
mtmd: enable non-causal vision for gemma 4 unified ggml-org/llama.cpp#24082 non-causal vision attention
fix(mtmd): handle Gemma 4 audio projector embedding size ggml-org/llama.cpp#24091 audio projector embedding size
Fix Gemma 4 Unified conversion ggml-org/llama.cpp#24118 converter fix
ggml-metal: fall back to CPU for im2col when KH*KW exceeds threadgroup limit (new, fork-local)

Metal needed an extra fix. The gemma4uv patch-embedding conv decomposes to im2col, and the Metal im2col kernel launches KHKW threads/threadgroup. For Gemma 4's large patch conv, KHKW exceeds the Apple GPU 1024-thread cap, so it hit a runtime GGML_ASSERT(KH*KW <= max_threads) (this assert exists upstream too — no upstream fix). The guard makes Metal decline the oversized im2col so the scheduler runs that one op on CPU; rest of the graph stays on GPU.

Verified end-to-end on both:

CUDA (GB10, sm_121): loads mmproj + describes a cat image correctly
Metal (M5 Max): same, EXIT 0, accurate description (CLIP graph uses unsupported operators warning confirms the im2col→CPU split)

Builds clean on both backends, text-only path unaffected.

mtmd, model: allow skip build_vit() (ggml-org#24077)

27d622f

* add model * nits (cherry picked from commit a731805)

TheTom mentioned this pull request Jun 5, 2026

fix(mtmd): guard clip d_head/kq_scale against n_head==0 (Gemma 4 12B SIGFPE) #166

Closed

github-actions Bot added examples python model labels Jun 5, 2026

abetlen and others added 5 commits June 4, 2026 20:44

mtmd: fix Gemma 4 unified FPE (ggml-org#24088)

f3b990f

(cherry picked from commit 94a220c)

mtmd: enable non-causal vision for gemma 4 unified (ggml-org#24082)

882dbdf

(cherry picked from commit c8d6a00)

fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091

f69717f

) * mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> (cherry picked from commit e3ba22d)

convert: Fix Gemma 4 Unified conversion (ggml-org#24118)

ab013ee

* Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim (cherry picked from commit e802356)

github-actions Bot added ggml Apple Metal labels Jun 5, 2026

TheTom merged commit a62320c into feature/turboquant-kv-cache Jun 5, 2026
32 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: port gemma4uv/gemma4ua support — fixes Gemma 4 12B vision (#163)#168

mtmd: port gemma4uv/gemma4ua support — fixes Gemma 4 12B vision (#163)#168
TheTom merged 6 commits into
feature/turboquant-kv-cachefrom
feat/gemma4uv

TheTom commented Jun 5, 2026

Uh oh!

TheTom commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

TheTom commented Jun 5, 2026

Summary

Conflicts resolved

Testing — verified end to end on CUDA (GB10, sm_121)

Uh oh!

TheTom commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants