mtmd: build_vit batching#24352
Conversation
|
can you run note: granite is known to be broken |
I have already, there are two that are failing when I run
|
|
@ngxson |
| std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos, | ||
| const build_vit_opts & opts | ||
| ) { | ||
| // batch dim: inp is [n_embd, n_pos] (B==1) or [n_embd, n_pos, B] (multi-tile encode) |
There was a problem hiding this comment.
note that batching is not just for multi-tile encode, but it should eventually allow batching multiple images of same size. that will be important for video processing where we need to process multiple images in the same pass
I will fix this comment along with my refactoring to add the proper architecture for doing so
Overview
This PR introduces an optional batch dimension in
build_vit, so acaller can encode several same-size inputs (image tiles, frames) in one graph.
No change for existing models: that means for a 2D
[n_embd, n_pos]input (
B == 1), nothing changes.Changes
build_vittakesinpas[n_embd, n_pos]or[n_embd, n_pos, B].[n_embd, n_pos * B]; the batch only reappears inself-attention as 4D
[d_head, n_head, n_pos, B]Q/K/V views. Output restoredto
[n_embd, n_pos, B].First consumer: DeepSeek-OCR multi-tile encoding (#24300, stacked on this).
Testing
Built
llama-mtmd-cli; DeepSeek-OCR single-view still matches (theB == 1path).Ran
tools/mtmd/tests.sh big;all tests that pass on master pass here too.
The
hugevariant is not tested.Requirements