mtmd: build_vit batching by sfallah · Pull Request #24352 · ggml-org/llama.cpp

sfallah · 2026-06-09T10:30:32Z

Overview

This PR introduces an optional batch dimension in build_vit, so a
caller can encode several same-size inputs (image tiles, frames) in one graph.
No change for existing models: that means for a 2D [n_embd, n_pos]
input (B == 1), nothing changes.

Changes

build_vit takes inp as [n_embd, n_pos] or [n_embd, n_pos, B].
Body runs flattened 2D [n_embd, n_pos * B]; the batch only reappears in
self-attention as 4D [d_head, n_head, n_pos, B] Q/K/V views. Output restored
to [n_embd, n_pos, B].

First consumer: DeepSeek-OCR multi-tile encoding (#24300, stacked on this).

Testing

Built llama-mtmd-cli; DeepSeek-OCR single-view still matches (the B == 1 path).
Ran tools/mtmd/tests.sh big;
all tests that pass on master pass here too.
The huge variant is not tested.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

ngxson · 2026-06-09T10:33:28Z

can you run ./tools/mtmd/tests.sh and report the results here?

note: granite is known to be broken

sfallah · 2026-06-09T10:39:35Z

can you run ./tools/mtmd/tests.sh and report the results here?

note: granite is known to be broken

I have already, there are two that are failing when I run tools/mtmd/tests.sh big
The exact same two that failed on master (my base).

[vision] FAIL: ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] FAIL: ggml-org/HunyuanVL-4B-GGUF:Q8_0

sfallah · 2026-06-09T11:18:03Z

@ngxson
FYI: ggml-org/HunyuanVL-4B-GGUF:Q8_0 fails because it doesn't exist on HF hub.

ngxson

yes that's expected

ngxson · 2026-06-09T14:07:41Z

            std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos,
            const build_vit_opts & opts
        ) {
+    // batch dim: inp is [n_embd, n_pos] (B==1) or [n_embd, n_pos, B] (multi-tile encode)


note that batching is not just for multi-tile encode, but it should eventually allow batching multiple images of same size. that will be important for video processing where we need to process multiple images in the same pass

I will fix this comment along with my refactoring to add the proper architecture for doing so

mtmd: build_vit batching

ca3bd23

sfallah requested a review from a team as a code owner June 9, 2026 10:30

github-actions Bot added the examples label Jun 9, 2026

ngxson approved these changes Jun 9, 2026

View reviewed changes

CISC approved these changes Jun 9, 2026

View reviewed changes

ngxson reviewed Jun 9, 2026

View reviewed changes

ngxson merged commit 49f3542 into ggml-org:master Jun 9, 2026
24 of 25 checks passed

ngxson mentioned this pull request Jun 9, 2026

mtmd: add batching API #24384

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: build_vit batching#24352

mtmd: build_vit batching#24352
ngxson merged 1 commit into
ggml-org:masterfrom
sfallah:sf/build-vit-batching

sfallah commented Jun 9, 2026

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

sfallah commented Jun 9, 2026

Uh oh!

sfallah commented Jun 9, 2026

Uh oh!

ngxson left a comment

Uh oh!

ngxson Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sfallah commented Jun 9, 2026

Overview

Changes

Testing

Requirements

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

sfallah commented Jun 9, 2026

Uh oh!

sfallah commented Jun 9, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants