mtmd: Add DeepSeekOCR 2 Support#20975
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
you may also need to enable |
|
small correction, seems like what they refers to "query" on the diagram is not the text prompts, but simply a set of pre-trained tokens (fixed size), so that won't be too complicated to implement the only thing that need to be provided correctly to the cgraph is the attention mask that will correctly mask the image tokens as non-causal and the query as causal, it should not be complicated and finally, because SAM is the same between 2 models it is recommended that you extract SAM as a function and inherit it in DS-OCR v2, example: struct clip_graph_deepseekocr : clip_graph {
clip_graph_deepseekocr(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
ggml_cgraph * build() override;
ggml_cgraph * build_sam(); // build the SAM model
};
struct clip_graph_deepseekocr2 : clip_graph_deepseekocr { // inherit
clip_graph_deepseekocr2(clip_ctx * ctx, const clip_image_f32 & img) : clip_graph(ctx, img) {}
ggml_cgraph * build() override; // can directly reuse build_sam() from base class, no need to duplicate the code
}; |
206af81 to
4fc448c
Compare
f0cb3bb to
11ec07a
Compare
1b7bba1 to
851e55e
Compare
- drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d
71c4b0c to
6613331
Compare
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
|
@sfallah can you check locally if |
All tests pass locally. |
* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Overview
This PR adds support for DeepSeek-OCR-2 deepseek-ai/DeepSeek-OCR-2.
GGUF model files for testing this PR are available at sabafallah/DeepSeek-OCR-2-GGUF.
The implementation adds the DeepSeek-OCR-2 vision path that reuses SAM encoder (from DeepSeek-OCR v1), a new Qwen2-based vision encoder, and multi-tile dynamic-resolution image preprocessing.
The preprocessing includes multi-tile dynamic-resolution, with a 1024 global view and grid of tiles with 768 tile views.
The image resize/padding behavior is Pillow-based to match the original deepseek-ai implementation.
Main changes
mtmd_encodeto handle multi-tile and global view token counts difftools/mtmd/tests/test-deepseek-ocr.pyto cover both DeepSeek-OCR and DeepSeek-OCR-2Implementation notes
build_sam.Testing
tools/mtmd/tests/test-deepseek-ocr.pyto cover both v1 and v2.llama-mtmd-cli.llama-server.Caveats
The DeepSeek-OCR-2 test gate is intentionally loose (
cer_tol = 0.12). The current test image is low quality, and the HF reference itself has a high CER on this sample.The DRY sampler is used as an approximation for the HF
no_repeat_ngram_sizebehavior.For DeepSeek-OCR-2,
llama-servercurrently requires--chat-template deepseek-ocr --no-jinja. This is due to the template handling for DeepSeek-OCR-2.Running
llama-server:Requirements