Skip to content

mtmd : build_vit GQA support#23782

Merged
ngxson merged 1 commit into
ggml-org:masterfrom
sfallah:sf/clip-vit-gqa
May 28, 2026
Merged

mtmd : build_vit GQA support#23782
ngxson merged 1 commit into
ggml-org:masterfrom
sfallah:sf/clip-vit-gqa

Conversation

@sfallah
Copy link
Copy Markdown
Contributor

@sfallah sfallah commented May 27, 2026

Overview

This PR adds GQA support to the shared build_vit helper.

build_vit now reads hparams.n_head_kv and uses it for the K/V reshape. If n_head_kv is unset or zero, it falls back to n_head, preserving the existing MHA behavior.

Additional information

This closes a gap for GQA ViT models. Any GGUF that sets clip.vision.attention.head_count_kv can now work through the shared build_vit path instead of needing model-specific handling.

Existing callers are not affected. Current GGUFs that use build_vit do not set clip.vision.attention.head_count_kv, so n_head_kv defaults to n_head and the generated graph remains unchanged for those models.

A safety guard was added for the fused-QKV branch: it asserts that n_head_kv == n_head. That layout cannot represent GQA without splitting the fused tensor at a non-n_embd boundary.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI assistance was used to draft the code changes under my direction, and to review the PR description. I reviewed and tested every change, and I take responsibility for the full contents of this PR.

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change looks good, but AI-generated comments look excessive

I don't get why we have to add 4 lines of comment just to explain a trivial line of code

Comment thread tools/mtmd/clip.cpp Outdated
// "clip.vision.attention.head_count_kv" (see KEY_N_HEAD_KV). It is 0 for
// every existing build_vit caller (those GGUFs do not set the key), so the
// fallback to n_head preserves MHA behaviour by default.
const int n_kv_head = hparams.n_head_kv > 0 ? hparams.n_head_kv : n_head;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the best way is to default n_head_kv to n_head then allow the loader to override it with gguf value. that's the exact behavior of n_head_kv inside llama.cpp

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about AI generated comment.
I have now tidied it up.
n_head_kv defaults to n_head in load_hparams as you suggested.

removed AI-generated comment
@sfallah sfallah force-pushed the sf/clip-vit-gqa branch from f061928 to f88d2dc Compare May 28, 2026 07:47
@sfallah sfallah marked this pull request as ready for review May 28, 2026 07:51
@sfallah sfallah requested a review from a team as a code owner May 28, 2026 07:51
@sfallah sfallah requested a review from ngxson May 28, 2026 07:57
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 28, 2026

@ggml-org/maintainers can I have the 2nd approval? thanks!

@ngxson ngxson merged commit 0b56d28 into ggml-org:master May 28, 2026
26 of 27 checks passed
adrianhoehne pushed a commit to adrianhoehne/llama.cpp that referenced this pull request May 28, 2026
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 28, 2026
* origin/master: (32 commits)
hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (ggml-org#23835)
mtmd-debug: add color and rainbow mode (ggml-org#23829)
mtmd: fix gemma 4 projector pre_norm (ggml-org#23822)
opencl: move backend info printing into its own function (ggml-org#23702)
ci : run ui publish on ubuntu-slim (ggml-org#23818)
ui: fix audio and video modality detection (ggml-org#23756)
ci : releases use Github-hosted builds for the UI (ggml-org#23823)
app : improve help output (ggml-org#23805)
mtmd: n_head_kv defaults to n_head (ggml-org#23782)
mtmd: fix gemma 4 audio rms norm eps (ggml-org#23815)
ci : change Vulkan builds to Release to reduce ccache (ggml-org#23820)
arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (ggml-org#23167)
test-llama-archs: fix table format [no release] (ggml-org#23810)
ggml: auto apply iGPU flag CUDA/HIP if integrated device (ggml-org#23007)
mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (ggml-org#23729)
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (ggml-org#23227)
server: minor tweaks to use more cpp features (ggml-org#23785)
hexagon: minor refresh for HMX FA and MM (ggml-org#23796)
vulkan: fast path for walsh-hadamard transform (ggml-org#23687)
chat : add Granite 4.1 chat template (ggml-org#23518)
...
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants