Skip to content

mtmd: build_vit batching#24352

Merged
ngxson merged 1 commit into
ggml-org:masterfrom
sfallah:sf/build-vit-batching
Jun 9, 2026
Merged

mtmd: build_vit batching#24352
ngxson merged 1 commit into
ggml-org:masterfrom
sfallah:sf/build-vit-batching

Conversation

@sfallah

@sfallah sfallah commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR introduces an optional batch dimension in build_vit, so a
caller can encode several same-size inputs (image tiles, frames) in one graph.
No change for existing models: that means for a 2D [n_embd, n_pos]
input (B == 1), nothing changes.

Changes

  • build_vit takes inp as [n_embd, n_pos] or [n_embd, n_pos, B].
  • Body runs flattened 2D [n_embd, n_pos * B]; the batch only reappears in
    self-attention as 4D [d_head, n_head, n_pos, B] Q/K/V views. Output restored
    to [n_embd, n_pos, B].

First consumer: DeepSeek-OCR multi-tile encoding (#24300, stacked on this).

Testing

Built llama-mtmd-cli; DeepSeek-OCR single-view still matches (the B == 1 path).
Ran tools/mtmd/tests.sh big;
all tests that pass on master pass here too.
The huge variant is not tested.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

@sfallah sfallah requested a review from a team as a code owner June 9, 2026 10:30
@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

can you run ./tools/mtmd/tests.sh and report the results here?

note: granite is known to be broken

@sfallah

sfallah commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

can you run ./tools/mtmd/tests.sh and report the results here?

note: granite is known to be broken

I have already, there are two that are failing when I run tools/mtmd/tests.sh big
The exact same two that failed on master (my base).

[vision] FAIL: ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] FAIL: ggml-org/HunyuanVL-4B-GGUF:Q8_0

@sfallah

sfallah commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@ngxson
FYI: ggml-org/HunyuanVL-4B-GGUF:Q8_0 fails because it doesn't exist on HF hub.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's expected

Comment thread tools/mtmd/clip.cpp
std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos,
const build_vit_opts & opts
) {
// batch dim: inp is [n_embd, n_pos] (B==1) or [n_embd, n_pos, B] (multi-tile encode)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that batching is not just for multi-tile encode, but it should eventually allow batching multiple images of same size. that will be important for video processing where we need to process multiple images in the same pass

I will fix this comment along with my refactoring to add the proper architecture for doing so

@ngxson ngxson merged commit 49f3542 into ggml-org:master Jun 9, 2026
24 of 25 checks passed
@ngxson ngxson mentioned this pull request Jun 9, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants