Skip to content

mtmd: Add DeepSeekOCR 2 Support#20975

Merged
ngxson merged 4 commits into
ggml-org:masterfrom
sfallah:sf/deepseek-ocr-2
May 29, 2026
Merged

mtmd: Add DeepSeekOCR 2 Support#20975
ngxson merged 4 commits into
ggml-org:masterfrom
sfallah:sf/deepseek-ocr-2

Conversation

@sfallah
Copy link
Copy Markdown
Contributor

@sfallah sfallah commented Mar 25, 2026

Overview

This PR adds support for DeepSeek-OCR-2 deepseek-ai/DeepSeek-OCR-2.

GGUF model files for testing this PR are available at sabafallah/DeepSeek-OCR-2-GGUF.

The implementation adds the DeepSeek-OCR-2 vision path that reuses SAM encoder (from DeepSeek-OCR v1), a new Qwen2-based vision encoder, and multi-tile dynamic-resolution image preprocessing.

The preprocessing includes multi-tile dynamic-resolution, with a 1024 global view and grid of tiles with 768 tile views.
The image resize/padding behavior is Pillow-based to match the original deepseek-ai implementation.

Main changes

  • added DeepSeek-OCR-2 model support in mtmd
  • reused the SAM implementation from DeepSeek-OCR where possible
  • added the Qwen2 vision encoder
  • added multi-tile dynamic-resolution preprocessing
  • implemented the DeepSeek-OCR-2 attention mask handling for the Qwen2 encoder
  • updated mtmd_encode to handle multi-tile and global view token counts diff
  • added converter support for DeepSeek-OCR-2 vision weights
  • extended tools/mtmd/tests/test-deepseek-ocr.py to cover both DeepSeek-OCR and DeepSeek-OCR-2

Implementation notes

  • The SAM encoder is shared with DeepSeek-OCR through the existing build_sam.
  • The Qwen2 attention mask is prepared CPU-side.
  • The dynamic-resolution preprocessing follows the InternVL-style behavior used by the reference implementation.

Testing

  • Extended tools/mtmd/tests/test-deepseek-ocr.py to cover both v1 and v2.
  • DeepSeek-OCR-2 result: CER 0.6944 / chrF 23.01
  • DeepSeek-OCR-2 HF reference: CER 0.7761 / chrF 28.70
  • Manually verified with llama-mtmd-cli.
  • Manually verified with llama-server.

Caveats

  • The DeepSeek-OCR-2 test gate is intentionally loose (cer_tol = 0.12). The current test image is low quality, and the HF reference itself has a high CER on this sample.

  • The DRY sampler is used as an approximation for the HF no_repeat_ngram_size behavior.

  • For DeepSeek-OCR-2, llama-server currently requires --chat-template deepseek-ocr --no-jinja. This is due to the template handling for DeepSeek-OCR-2.

  • Running llama-server:

    build/bin/llama-server \
    -m gguf_models/deepseek-ai/deepseek-ocr-2-bf16.gguf \
    --mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-2-bf16.gguf \
    --chat-template deepseek-ocr --no-jinja \
    --temp 0 \
    --flash-attn off \
    --no-warmup \
    -n 2048 \
    --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 \
    --dry-penalty-last-n -1 --dry-sequence-breaker none

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - I used AI assistance for code review, debugging, implementation checks, and testing. I have reviewed the submitted changes and take responsibility for the full contents of this PR.

This comment was marked as outdated.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 25, 2026

you may also need to enable mtmd_decode_use_non_causal for this model, so that text is causal while image tokens are non-causal

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 25, 2026

hmm, please ignore what I said earlier, it seems like their paper is poorly written and the v2 is not easier than v1.

image

basically it's a 3-step pipeline: SAM encode the image into embeddings, Qwen take embedding in and generate yet another embeddings, then finally they are fed into the main deepseek model

if my understanding is correct, that means the prompt will now be processed by vision encoder, which make things a whole lot more complicated

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 25, 2026

small correction, seems like what they refers to "query" on the diagram is not the text prompts, but simply a set of pre-trained tokens (fixed size), so that won't be too complicated to implement

the only thing that need to be provided correctly to the cgraph is the attention mask that will correctly mask the image tokens as non-causal and the query as causal, it should not be complicated

and finally, because SAM is the same between 2 models it is recommended that you extract SAM as a function and inherit it in DS-OCR v2, example:

struct clip_graph_deepseekocr : clip_graph {
    clip_graph_deepseekocr(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
    ggml_cgraph * build() override;
    ggml_cgraph * build_sam(); // build the SAM model
};

struct clip_graph_deepseekocr2 : clip_graph_deepseekocr { // inherit
    clip_graph_deepseekocr2(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
    ggml_cgraph * build() override; // can directly reuse build_sam() from base class, no need to duplicate the code
};

@sfallah sfallah force-pushed the sf/deepseek-ocr-2 branch 2 times, most recently from 206af81 to 4fc448c Compare May 22, 2026 18:32
@github-actions github-actions Bot added examples python python script changes labels May 22, 2026
@sfallah sfallah force-pushed the sf/deepseek-ocr-2 branch 2 times, most recently from f0cb3bb to 11ec07a Compare May 27, 2026 07:26
@sfallah sfallah marked this pull request as ready for review May 27, 2026 09:01
@sfallah sfallah requested review from a team and CISC as code owners May 27, 2026 09:01
@sfallah sfallah requested a review from ngxson May 27, 2026 11:26
Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated
Comment thread conversion/base.py Outdated
@sfallah sfallah force-pushed the sf/deepseek-ocr-2 branch from 1b7bba1 to 851e55e Compare May 28, 2026 14:59
Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated
Comment thread tools/mtmd/clip.cpp Outdated
@ngxson ngxson mentioned this pull request May 28, 2026
4 tasks
Comment thread conversion/deepseek.py
Comment thread tools/mtmd/models/deepseekocr2.cpp
Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated
Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated
- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d
@sfallah sfallah force-pushed the sf/deepseek-ocr-2 branch from 71c4b0c to 6613331 Compare May 29, 2026 07:24
@sfallah sfallah requested review from CISC and ngxson May 29, 2026 07:30
Comment thread tools/mtmd/models/deepseekocr2.cpp Outdated
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 29, 2026

@sfallah can you check locally if test-llama-archs passes? the CI fails on that test but I'm not sure if it has something to do with this PR, since the test only runs the text model

@sfallah
Copy link
Copy Markdown
Contributor Author

sfallah commented May 29, 2026

@sfallah can you check locally if test-llama-archs passes? the CI fails on that test but I'm not sure if it has something to do with this PR, since the test only runs the text model

@ngxson

All tests pass locally.
As far as I can tell, this test is also failing on master.
DSOCR-2 is currently excluded from test-llama-archs. In fact, I have already prepared a small follow-up PR that adds DeepSeek-OCR, since you have already approved this PR.

@ngxson ngxson merged commit da3f990 into ggml-org:master May 29, 2026
16 of 30 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants