mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding by sfallah · Pull Request #24300 · ggml-org/llama.cpp

sfallah · 2026-06-08T11:13:25Z

Overview

This PR adds batched image encoding for both DeepSeek-OCR (DSOCR) versions and unifies their encoding and preprocessing. It builds on the existing DeepSeek-OCR v1/v2 support.

Instead of encoding each tile iteratively, all tiles are now encoded in a single graph execution. build_vit is extended with a batch dimension.
The v1 and v2 preprocessors are merged into one multi-tile dynamic-resolution preprocessor.

Main changes

added batched image encoding for both DeepSeek-OCR and DeepSeek-OCR-2
added a batch dimension to build_vit; the batch dim is flattened everywhere except self-attention, which uses 4D [d_head, n_head, n_pos, B] Q/K/V views
added encode_deepseekocr: the tile grid is encoded as one batch, then the global view is encoded and appended
unified the v1 and v2 image preprocessors into one multi-tile dynamic-resolution preprocessor (global view + local tile grid)
made clip_n_output_tokens grid-aware for the tiled token counts
added reserve_dsocr_max_tiles to pre-reserve the worst-case tile batch at warmup
pinned the SAM layer-norm epsilon to 1e-6 and the v1 ViT body to 1e-5 (matches the HF reference)

Implementation notes

The batch dimension B = grid_x * grid_y is derived inside build_vit. The body runs on a flattened 2D [n_embd, n_pos * B] tensor; the batch only re-appears in the 4D self-attention views. For B == 1 (all non-DSOCR models and the single global view) the graph is unchanged.
All tiles are encoded as a single batch; the global view is encoded separately and appended, in [tiles..., global] order.
The SAM stage runs its layer-norms at 1e-6 while the v1 CLIP/ViT body runs at 1e-5, following the HF reference.

Testing

extended tools/mtmd/tests/test-deepseek-ocr.py with multi-tile (dynamic-resolution) cases for both versions
tests.sh passes for all models (the huge variant is not tested)
manually verified with llama-mtmd-cli and llama-server

Caveats

The v1 1280 "large" single-view mode is removed. v1 now always uses a 1024 global view plus dynamic 640px tiles; the fixed 1024/1280 single-view selection is superseded by dynamic tiling.

@ngxson

I'm aware you might prefer not to support batched encoding because of the memory overhead. The overhead is capped for DSOCR models (worst case reserved at warmup), and all other models are unchanged (B = 1).
I have prepared a sequential (non-batched) alternative (see sf/deepseek-ocr-mul-tile-dyn-res) that I consider to be a hack. That is why I followed this path more seriously.
This can be split into a separate build_vit batching PR and this one, if you'd prefer.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

ngxson · 2026-06-08T11:20:33Z

unfortunately I don't think we can accept this change as-is, it's large change and you didn't push any discussions before implementing it. as a consequence, it already conflict with recent changes from #21858

ngxson · 2026-06-08T11:23:49Z

+//
+// v1 weaves newlines onto the grid in-graph;
+// v2 just concatenates the per-tile query tokens.
+static bool encode_deepseekocr(clip_ctx * ctx_clip,


nope, it's too hacky this way

instead of hacking to benefit only deepseek-ocr, we should discuss & make a proper batching API that can benefit all models

sfallah · 2026-06-08T13:13:13Z

unfortunately I don't think we can accept this change as-is, it's large change and you didn't push any discussions before implementing it. as a consequence, it already conflict with recent changes from #21858

@ngxson
sorry, my bad!
Honestly this is my way of opening the discussion.

It wasn't my aim to actually introduce batching in build_vit. My aim was/is to finish the DeepSeek-OCR support in llama.cpp in the best possible way.

As always I will be grateful if you tell me how we can proceed.
BTW: this will most probably be the last thing that I will implement for DSOCR.

ngxson · 2026-06-08T14:04:52Z

As always I will be grateful if you tell me how we can proceed.

the more important thing for now is to plan: coding it is easy, but it will be hard to design an API that is both (1) easy to be adapted in downstream code and (2) more importantly, being model-agnostic.

in theory, not just deepseek-ocr, but batching can be enabled on any multimodal encoders (including image and audio input). the most simple way of batching is to allow multiple input having the same size being batched using the 4th dim. dynamic resolution (i.e. 2D rope, m-rope) can also be supported with a mask, but that's quite complicated so it can be skipped for now. so point (2) is not very difficult to achieve.

for point (1), that will be a different story. models can have text between 2 images, which make it tricky to keep track of image position. a new array of API (both core API and helpers) will need to be added to handle this case, but I don't yet have a good picture for now.

in either ways, I don't think this PR is useful as-is, as it's built mostly around the quirks of deepseek-ocr (for example, no special tokens between 2 images). if we manage to enable model-agnostic batching, then deepseek-ocr will automatically have it, as it's just a sub case of llava-uhd.

ngxson · 2026-06-09T23:09:17Z

I'm replacing this one with #24384

Still WIP, but the public API already works with video input (e.g. multiple input images with same size), it should be trivial to adapt it to the newly 4th dim in build_vit()

sfallah · 2026-06-10T06:36:03Z

@ngxson
Beyond batch encoding, this PR also included some DSOCR-specific logic.

added batched image encoding for both DeepSeek-OCR and DeepSeek-OCR-2
The v1 and v2 preprocessors are merged into one multi-tile dynamic-resolution preprocessor.
The SAM stage runs its layer-norms at 1e-6 while the v1 CLIP/ViT body runs at 1e-5, following the HF reference.
The DSOCR regression test is also extended here to include multi-tile test cases for both versions.

Beyond that, something special about DeepSeek-OCR v1 is that newline tensors are appended/concatenated per tile-grid row. This is why a proper DSOCR v1 dynamic-resolution multi-tile implementation couldn't have been done cleanly without batch encoding.

I will wait for #24384 to introduce the batching API, and then adapt the DSOCR implementation according to the new API design.

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding

e1248f1

sfallah requested a review from a team as a code owner June 8, 2026 11:13

github-actions Bot added examples python python script changes labels Jun 8, 2026

ngxson reviewed Jun 8, 2026

View reviewed changes

sfallah mentioned this pull request Jun 9, 2026

mtmd: build_vit batching #24352

Merged

ngxson mentioned this pull request Jun 9, 2026

mtmd: add batching API #24384

Draft

6 tasks

ngxson closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300
sfallah wants to merge 1 commit into
ggml-org:masterfrom
sfallah:sf/dsocr-mul-tile-batched-encode

sfallah commented Jun 8, 2026

Uh oh!

ngxson commented Jun 8, 2026

Uh oh!

ngxson Jun 8, 2026

Uh oh!

sfallah commented Jun 8, 2026

Uh oh!

ngxson commented Jun 8, 2026 •

edited

Loading

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

sfallah commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfallah commented Jun 8, 2026

Overview

Main changes

Implementation notes

Testing

Caveats

Requirements

Uh oh!

ngxson commented Jun 8, 2026

Uh oh!

ngxson Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

sfallah commented Jun 8, 2026

Uh oh!

ngxson commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 9, 2026

Uh oh!

sfallah commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jun 8, 2026 •

edited

Loading