Skip to content

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300

Closed
sfallah wants to merge 1 commit into
ggml-org:masterfrom
sfallah:sf/dsocr-mul-tile-batched-encode
Closed

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300
sfallah wants to merge 1 commit into
ggml-org:masterfrom
sfallah:sf/dsocr-mul-tile-batched-encode

Conversation

@sfallah

@sfallah sfallah commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR adds batched image encoding for both DeepSeek-OCR (DSOCR) versions and unifies their encoding and preprocessing. It builds on the existing DeepSeek-OCR v1/v2 support.

Instead of encoding each tile iteratively, all tiles are now encoded in a single graph execution. build_vit is extended with a batch dimension.
The v1 and v2 preprocessors are merged into one multi-tile dynamic-resolution preprocessor.

Main changes

  • added batched image encoding for both DeepSeek-OCR and DeepSeek-OCR-2
  • added a batch dimension to build_vit; the batch dim is flattened everywhere except self-attention, which uses 4D [d_head, n_head, n_pos, B] Q/K/V views
  • added encode_deepseekocr: the tile grid is encoded as one batch, then the global view is encoded and appended
  • unified the v1 and v2 image preprocessors into one multi-tile dynamic-resolution preprocessor (global view + local tile grid)
  • made clip_n_output_tokens grid-aware for the tiled token counts
  • added reserve_dsocr_max_tiles to pre-reserve the worst-case tile batch at warmup
  • pinned the SAM layer-norm epsilon to 1e-6 and the v1 ViT body to 1e-5 (matches the HF reference)

Implementation notes

  • The batch dimension B = grid_x * grid_y is derived inside build_vit. The body runs on a flattened 2D [n_embd, n_pos * B] tensor; the batch only re-appears in the 4D self-attention views. For B == 1 (all non-DSOCR models and the single global view) the graph is unchanged.
  • All tiles are encoded as a single batch; the global view is encoded separately and appended, in [tiles..., global] order.
  • The SAM stage runs its layer-norms at 1e-6 while the v1 CLIP/ViT body runs at 1e-5, following the HF reference.

Testing

  • extended tools/mtmd/tests/test-deepseek-ocr.py with multi-tile (dynamic-resolution) cases for both versions
  • tests.sh passes for all models (the huge variant is not tested)
  • manually verified with llama-mtmd-cli and llama-server

Caveats

  • The v1 1280 "large" single-view mode is removed. v1 now always uses a 1024 global view plus dynamic 640px tiles; the fixed 1024/1280 single-view selection is superseded by dynamic tiling.

@ngxson

  • I'm aware you might prefer not to support batched encoding because of the memory overhead. The overhead is capped for DSOCR models (worst case reserved at warmup), and all other models are unchanged (B = 1).
  • I have prepared a sequential (non-batched) alternative (see sf/deepseek-ocr-mul-tile-dyn-res) that I consider to be a hack. That is why I followed this path more seriously.
  • This can be split into a separate build_vit batching PR and this one, if you'd prefer.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

@sfallah sfallah requested a review from a team as a code owner June 8, 2026 11:13
@ngxson

ngxson commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

unfortunately I don't think we can accept this change as-is, it's large change and you didn't push any discussions before implementing it. as a consequence, it already conflict with recent changes from #21858

@github-actions github-actions Bot added examples python python script changes labels Jun 8, 2026
Comment thread tools/mtmd/mtmd.cpp
//
// v1 weaves newlines onto the grid in-graph;
// v2 just concatenates the per-tile query tokens.
static bool encode_deepseekocr(clip_ctx * ctx_clip,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's too hacky this way

instead of hacking to benefit only deepseek-ocr, we should discuss & make a proper batching API that can benefit all models

@sfallah

sfallah commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

unfortunately I don't think we can accept this change as-is, it's large change and you didn't push any discussions before implementing it. as a consequence, it already conflict with recent changes from #21858

@ngxson
sorry, my bad!
Honestly this is my way of opening the discussion.

It wasn't my aim to actually introduce batching in build_vit. My aim was/is to finish the DeepSeek-OCR support in llama.cpp in the best possible way.

As always I will be grateful if you tell me how we can proceed.
BTW: this will most probably be the last thing that I will implement for DSOCR.

@ngxson

ngxson commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

As always I will be grateful if you tell me how we can proceed.

the more important thing for now is to plan: coding it is easy, but it will be hard to design an API that is both (1) easy to be adapted in downstream code and (2) more importantly, being model-agnostic.

in theory, not just deepseek-ocr, but batching can be enabled on any multimodal encoders (including image and audio input). the most simple way of batching is to allow multiple input having the same size being batched using the 4th dim. dynamic resolution (i.e. 2D rope, m-rope) can also be supported with a mask, but that's quite complicated so it can be skipped for now. so point (2) is not very difficult to achieve.

for point (1), that will be a different story. models can have text between 2 images, which make it tricky to keep track of image position. a new array of API (both core API and helpers) will need to be added to handle this case, but I don't yet have a good picture for now.

in either ways, I don't think this PR is useful as-is, as it's built mostly around the quirks of deepseek-ocr (for example, no special tokens between 2 images). if we manage to enable model-agnostic batching, then deepseek-ocr will automatically have it, as it's just a sub case of llava-uhd.

@ngxson ngxson mentioned this pull request Jun 9, 2026
6 tasks
@ngxson

ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

I'm replacing this one with #24384

Still WIP, but the public API already works with video input (e.g. multiple input images with same size), it should be trivial to adapt it to the newly 4th dim in build_vit()

@ngxson ngxson closed this Jun 9, 2026
@sfallah

sfallah commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@ngxson
Beyond batch encoding, this PR also included some DSOCR-specific logic.

  • added batched image encoding for both DeepSeek-OCR and DeepSeek-OCR-2
  • The v1 and v2 preprocessors are merged into one multi-tile dynamic-resolution preprocessor.
  • The SAM stage runs its layer-norms at 1e-6 while the v1 CLIP/ViT body runs at 1e-5, following the HF reference.
  • The DSOCR regression test is also extended here to include multi-tile test cases for both versions.

Beyond that, something special about DeepSeek-OCR v1 is that newline tensors are appended/concatenated per tile-grid row. This is why a proper DSOCR v1 dynamic-resolution multi-tile implementation couldn't have been done cleanly without batch encoding.

I will wait for #24384 to introduce the batching API, and then adapt the DSOCR implementation according to the new API design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants