mtmd: Add DeepSeekOCR 2 Support by sfallah · Pull Request #20975 · ggml-org/llama.cpp

sfallah · 2026-03-25T07:09:38Z

Overview

This PR adds support for DeepSeek-OCR-2 deepseek-ai/DeepSeek-OCR-2.

GGUF model files for testing this PR are available at sabafallah/DeepSeek-OCR-2-GGUF.

The implementation adds the DeepSeek-OCR-2 vision path that reuses SAM encoder (from DeepSeek-OCR v1), a new Qwen2-based vision encoder, and multi-tile dynamic-resolution image preprocessing.

The preprocessing includes multi-tile dynamic-resolution, with a 1024 global view and grid of tiles with 768 tile views.
The image resize/padding behavior is Pillow-based to match the original deepseek-ai implementation.

Main changes

added DeepSeek-OCR-2 model support in mtmd
reused the SAM implementation from DeepSeek-OCR where possible
added the Qwen2 vision encoder
added multi-tile dynamic-resolution preprocessing
implemented the DeepSeek-OCR-2 attention mask handling for the Qwen2 encoder
updated mtmd_encode to handle multi-tile and global view token counts diff
added converter support for DeepSeek-OCR-2 vision weights
extended tools/mtmd/tests/test-deepseek-ocr.py to cover both DeepSeek-OCR and DeepSeek-OCR-2

Implementation notes

The SAM encoder is shared with DeepSeek-OCR through the existing build_sam.
The Qwen2 attention mask is prepared CPU-side.
The dynamic-resolution preprocessing follows the InternVL-style behavior used by the reference implementation.

Testing

Extended tools/mtmd/tests/test-deepseek-ocr.py to cover both v1 and v2.
DeepSeek-OCR-2 result: CER 0.6944 / chrF 23.01
DeepSeek-OCR-2 HF reference: CER 0.7761 / chrF 28.70
Manually verified with llama-mtmd-cli.
Manually verified with llama-server.

Caveats

The DeepSeek-OCR-2 test gate is intentionally loose (cer_tol = 0.12). The current test image is low quality, and the HF reference itself has a high CER on this sample.
The DRY sampler is used as an approximation for the HF no_repeat_ngram_size behavior.
For DeepSeek-OCR-2, llama-server currently requires --chat-template deepseek-ocr --no-jinja. This is due to the template handling for DeepSeek-OCR-2.

Running llama-server:

build/bin/llama-server \
-m gguf_models/deepseek-ai/deepseek-ocr-2-bf16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-2-bf16.gguf \
--chat-template deepseek-ocr --no-jinja \
--temp 0 \
--flash-attn off \
--no-warmup \
-n 2048 \
--dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 \
--dry-penalty-last-n -1 --dry-sequence-breaker none

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

Sign in to view

ngxson · 2026-03-25T18:53:00Z

you may also need to enable mtmd_decode_use_non_causal for this model, so that text is causal while image tokens are non-causal

ngxson · 2026-03-25T19:04:34Z

hmm, please ignore what I said earlier, it seems like their paper is poorly written and the v2 is not easier than v1.

basically it's a 3-step pipeline: SAM encode the image into embeddings, Qwen take embedding in and generate yet another embeddings, then finally they are fed into the main deepseek model

~~if my understanding is correct, that means the prompt will now be processed by vision encoder, which make things a whole lot more complicated~~

ngxson · 2026-03-25T19:14:27Z

small correction, seems like what they refers to "query" on the diagram is not the text prompts, but simply a set of pre-trained tokens (fixed size), so that won't be too complicated to implement

the only thing that need to be provided correctly to the cgraph is the attention mask that will correctly mask the image tokens as non-causal and the query as causal, it should not be complicated

and finally, because SAM is the same between 2 models it is recommended that you extract SAM as a function and inherit it in DS-OCR v2, example:

struct clip_graph_deepseekocr : clip_graph {
    clip_graph_deepseekocr(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
    ggml_cgraph * build() override;
    ggml_cgraph * build_sam(); // build the SAM model
};

struct clip_graph_deepseekocr2 : clip_graph_deepseekocr { // inherit
    clip_graph_deepseekocr2(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
    ggml_cgraph * build() override; // can directly reuse build_sam() from base class, no need to duplicate the code
};

- drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

ngxson · 2026-05-29T13:29:33Z

@sfallah can you check locally if test-llama-archs passes? the CI fails on that test but I'm not sure if it has something to do with this PR, since the test only runs the text model

sfallah · 2026-05-29T13:46:39Z

@sfallah can you check locally if test-llama-archs passes? the CI fails on that test but I'm not sure if it has something to do with this PR, since the test only runs the text model

@ngxson

All tests pass locally.
As far as I can tell, this test is also failing on master.
DSOCR-2 is currently excluded from test-llama-archs. In fact, I have already prepared a small follow-up PR that adds DeepSeek-OCR, since you have already approved this PR.

* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)

* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

sfallah mentioned this pull request Mar 25, 2026

mtmd: Add DeepSeekOCR Support #17400

Merged

ngxson reviewed Mar 25, 2026

View reviewed changes

Comment thread tools/mtmd/models/deepseekocr2.cpp

This comment was marked as outdated.

Sign in to view

sfallah mentioned this pull request May 19, 2026

mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor #23345

Merged

sfallah force-pushed the sf/deepseek-ocr-2 branch 2 times, most recently from 206af81 to 4fc448c Compare May 22, 2026 18:32

github-actions Bot added examples python python script changes labels May 22, 2026

sfallah force-pushed the sf/deepseek-ocr-2 branch 2 times, most recently from f0cb3bb to 11ec07a Compare May 27, 2026 07:26

sfallah marked this pull request as ready for review May 27, 2026 09:01

sfallah requested review from a team and CISC as code owners May 27, 2026 09:01

sfallah requested a review from ngxson May 27, 2026 11:26

ngxson reviewed May 27, 2026

View reviewed changes

Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated

CISC reviewed May 27, 2026

View reviewed changes

Comment thread conversion/base.py Outdated

mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

851e55e

sfallah force-pushed the sf/deepseek-ocr-2 branch from 1b7bba1 to 851e55e Compare May 28, 2026 14:59

ngxson reviewed May 28, 2026

View reviewed changes

Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated

ngxson reviewed May 28, 2026

View reviewed changes

Comment thread tools/mtmd/clip.cpp Outdated

ngxson mentioned this pull request May 28, 2026

model: Granite4 Vision #23545

Merged

4 tasks

introduced clip_image_f32::add_viewsep

19a16fc

CISC reviewed May 28, 2026

View reviewed changes

Comment thread conversion/deepseek.py

ngxson reviewed May 28, 2026

View reviewed changes

Comment thread tools/mtmd/models/deepseekocr2.cpp

Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated

Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated

address PR review

6613331

- drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d

sfallah force-pushed the sf/deepseek-ocr-2 branch from 71c4b0c to 6613331 Compare May 29, 2026 07:24

sfallah requested review from CISC and ngxson May 29, 2026 07:30

ngxson approved these changes May 29, 2026

View reviewed changes

Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated

Update tools/mtmd/models/deepseekocr2.cpp

f3b4ca9

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

CISC approved these changes May 29, 2026

View reviewed changes

ngxson merged commit da3f990 into ggml-org:master May 29, 2026
16 of 30 checks passed

Schopenhauer-loves-Hegel mentioned this pull request Jun 4, 2026

Add DeepSeek-OCR-2 model support ollama/ollama#16503

Open

6 tasks

Milor123 mentioned this pull request Jun 6, 2026

I integrated DeepSeek-OCR-2 via GGUF + llama.cpp — 3 hours → 17 minutes oomol-lab/pdf-craft#365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: Add DeepSeekOCR 2 Support#20975

mtmd: Add DeepSeekOCR 2 Support#20975
ngxson merged 4 commits into
ggml-org:masterfrom
sfallah:sf/deepseek-ocr-2

sfallah commented Mar 25, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

ngxson commented Mar 25, 2026

Uh oh!

ngxson commented Mar 25, 2026 •

edited

Loading

Uh oh!

ngxson commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented May 29, 2026

Uh oh!

sfallah commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sfallah commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Main changes

Implementation notes

Testing

Caveats

Requirements

Uh oh!

This comment was marked as outdated.

Uh oh!

ngxson commented Mar 25, 2026

Uh oh!

ngxson commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented May 29, 2026

Uh oh!

sfallah commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sfallah commented Mar 25, 2026 •

edited

Loading

ngxson commented Mar 25, 2026 •

edited

Loading