mtmd: port gemma4uv/gemma4ua support — fixes Gemma 4 12B vision (#163)#168
Merged
Conversation
* add model * nits (cherry picked from commit a731805)
(cherry picked from commit 94a220c)
(cherry picked from commit c8d6a00)
) * mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> (cherry picked from commit e3ba22d)
* Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim (cherry picked from commit e802356)
…p limit The Metal im2col kernel launches KH*KW threads per threadgroup (one per kernel element). For large conv kernels — e.g. the Gemma 4 unified vision (gemma4uv) patch embedding — KH*KW exceeds the Apple GPU 1024-thread cap and the kernel hits a runtime GGML_ASSERT instead of producing a result. Guard supports_op so an oversized im2col is declined; the backend scheduler then runs that one op on CPU while the rest of the graph stays on the GPU. Fixes Gemma 4 12B vision on the Metal backend (verified end-to-end: loads mmproj + describes an image correctly on an M5 Max).
Owner
Author
|
Updated — the branch now fully supports Gemma 4 12B vision on both backends. Commit stack (6):
Metal needed an extra fix. The gemma4uv patch-embedding conv decomposes to im2col, and the Metal im2col kernel launches KHKW threads/threadgroup. For Gemma 4's large patch conv, KHKW exceeds the Apple GPU 1024-thread cap, so it hit a runtime Verified end-to-end on both:
Builds clean on both backends, text-only path unaffected. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #163. The fork could not load the new Gemma 4 12B mmproj — it errored
load_hparams: unknown projector type: gemma4uvbecause the fork lacked thegemma4uv(unified, encoder-less vision) andgemma4ua(unified audio) projectors. The SIGFPE in the original report is the upstream symptom; on this fork the model never got that far.This cherry-picks the upstream fix ggml-org#24077 ("mtmd, model: allow skip build_vit()", commit a731805) into the fork.
Conflicts resolved
Clean 3-way except 3 non-mtmd files (converter + vocab). Notably:
normalizer_lowercaselines that appeared in the diff context are pre-existing upstream code (from an earlier commit the fork hasn't synced), not part of ggml-org#24077 — dropped them so the port only carries ggml-org#24077's actual changes (suppress_tokens, the gemma4uv/ua graphs, projector enums, skip-build_vit).Testing — verified end to end on CUDA (GB10, sm_121)
Downloaded
ggml-org/gemma-4-12B-it-GGUF(Q4_K_M + mmproj-Q8_0) and ranllama-mtmd-clion a real photo:Supersedes #166 (the standalone
d_headguard is included here via ggml-org#24077's clip.cpp hunk).Credit: @guarismo for the report, upstream ggml-org#24077.