Skip to content

Comments

model: Add PaddleOCR-VL model support#18825

Merged
ngxson merged 20 commits intoggml-org:masterfrom
megemini:paddleocr-vl
Feb 19, 2026
Merged

model: Add PaddleOCR-VL model support#18825
ngxson merged 20 commits intoggml-org:masterfrom
megemini:paddleocr-vl

Conversation

@megemini
Copy link
Contributor

@megemini megemini commented Jan 14, 2026

Add PaddleOCR-VL model support.

Test with some images:

  1. A receipt

1640

with command:

./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \
  --mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \
  --color on\
  --image /home/shun/Pictures/1640.jpeg \
  --prompt "OCR:"
Screenshot from 2026-01-14 13-54-35
  1. A table

paddleocr

with command:

./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \
  --mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \
  --color on\
  --image /home/shun/Pictures/paddleocr.jpg \
  --prompt "Table Recognition:"
Screenshot from 2026-01-14 13-56-26

can be formatted:

Screenshot from 2026-01-14 13-56-53

p.s. Thanks to @ngxson #16701

Update

The model converted by commands:

python3 convert_hf_to_gguf.py /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL \
  --outfile /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \
  --outtype bf16 \
  --verbose
python3 convert_hf_to_gguf.py /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL \
  --mmproj \
  --outfile /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \
  --outtype bf16 \
  --verbose

or, can be downloaded from : https://modelscope.cn/models/megemini/PaddleOCR-VL-GGUF/summary

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR looks a bit suspicious, please explicitly specify if you're using AI to generate it

Comment on lines 287 to 288
MAX_PIXELS = "clip.vision.max_pixels"
MIN_PIXELS = "clip.vision.min_pixels"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the correct naming from #18719

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the different between this and qwen2vl.cpp? seems like 100% identical

@megemini
Copy link
Contributor Author

this PR looks a bit suspicious, please explicitly specify if you're using AI to generate it

@ngxson

First, thanks for your review, but, NO! Absolutely not!

The initial commit was based on your PR: #16701.

However, as you mentioned, Model generate hallucinated text, likely because of the projector being incorrect.

Therefore, I tried to compare the code with the implementation from huggingface/transformers#42178, which adds PaddleOCR-VL support to transformers.

To be frank, I’ve only been working with the llama.cpp source code for about two weeks, so I DO USE AI to help me understand the code. However, since llama.cpp is a C++ project, the AI isn't very helpful, so I'm forced to work on it all by myself!

PaddleOCR-VL is almost identical to ERNIE4.5, but there are two main differences from your PR:

  • PaddleOCR-VL uses mrope instead of rope, which requires rope_sections.
  • The positions tensor is different from ERNIE4.5 or Qwen2VL.

llama.cpp/tools/mtmd/clip.cpp

Lines 3529 to 3550 in 9d5a701

case PROJECTOR_TYPE_PADDLEOCR:
{
const int merge_ratio = hparams.n_merge;
const int pw = image_size_width / patch_size;
const int ph = image_size_height / patch_size;
std::vector<int> positions(n_pos * 4);
int ptr = 0;
for (int y = 0; y < ph; y += merge_ratio) {
for (int dy = 0; dy < 2; dy++) {
for (int x = 0; x < pw; x += merge_ratio) {
for (int dx = 0; dx < 2; dx++) {
positions[ ptr] = y + dy;
positions[ num_patches + ptr] = x + dx;
positions[2 * num_patches + ptr] = y + dy;
positions[3 * num_patches + ptr] = x + dx;
ptr++;
}
}
}
}
set_input_i32("positions", positions);

Additionally, ERNIE4.5 had a minor issue where add_prefix_space needs to be added.

def set_vocab(self):
self._set_vocab_sentencepiece()
tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
if tokenizer_config_file.is_file():
with open(tokenizer_config_file, "r", encoding="utf-8") as f:
tokenizer_config_json = json.load(f)
if "add_prefix_space" in tokenizer_config_json:
self.gguf_writer.add_add_space_prefix(tokenizer_config_json["add_prefix_space"])

Or, PaddleOCR-VL prompt is changed from OCR: to _OCR:, which makes PaddleOCR-VL error-prone, because it is trained on OCR:, and the generative model is only 0.3B, which may not be so smart.

what's the different between this and qwen2vl.cpp? seems like 100% identical

The src/models/paddleocr.cpp is based on src/models/ernie4-5.cpp, but uses ggml_rope_multi instead of ggml_rope_ext.

While src/models/paddleocr.cpp, src/models/ernie4-5.cpp and src/models/qwen2vl.cpp are almost identical, but,

cur = build_attn(inp_attn,
model.layers[il].wo, NULL,
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f / sqrtf(float(n_embd_head)), il);

cur = build_attn(inp_attn,
model.layers[il].wo, model.layers[il].bo,
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f/sqrtf(float(n_embd_head)), il);

ERNIE4.5 and PaddleOCR-VL do not require model.layers[il].bo, although they behave the same if model.layers[il].bo is NULL.

I don't ge why you need to make this overcomplicated, what's wrong with my code on #16701 where I just reuse the same Ernie4_5Model arch?

As I mentioned above, PaddleOCR-VL requires mrope and rope_sections

llama.cpp/src/llama-model.cpp

Lines 8318 to 8320 in 9d5a701

case LLM_ARCH_QWEN2VL:
case LLM_ARCH_PADDLEOCR:
return LLAMA_ROPE_TYPE_MROPE;

@ngxson
Copy link
Collaborator

ngxson commented Jan 15, 2026

Thanks for the explanation, that's useful. I think it should be put onto the PR description to make reviewing easier.

Indeed, it seems like the model is just qwen2vl under the hood, minus small difference in tokenizer. At least for the language model, it can just reuse the same qwen2vl arch if the bo tensor is not present in the weight.

I'll come back to this after #18719 being merged

@ngxson ngxson self-assigned this Jan 15, 2026
@megemini
Copy link
Contributor Author

Update 20260129

PaddleOCR-VL-1.5 released.

Model exported: https://modelscope.cn/models/megemini/PaddleOCR-VL-1.5-GGUF/summary

Test with new type of image:

seal

with command and result:

Screenshot from 2026-01-29 20-33-22

Compare with gt:

Screenshot from 2026-01-29 20-38-38

@ngxson
Copy link
Collaborator

ngxson commented Feb 7, 2026

the chat template is broken, should be fixed via #19419

@IIIIIllllIIIIIlllll
Copy link

I encountered an error after building with the latest code. Could you please tell me how to resolve it? Thank you very much.

2026-02-08 16:20:05.648 - warmup: flash attention is enabled
2026-02-08 16:20:05.648 - srv    load_model: loaded multimodal model, '/home/mark/Models/Test/PaddleOCR-VL-1.5-BF16/mmproj.gguf'
2026-02-08 16:20:05.648 - srv    load_model: initializing slots, n_slots = 1
2026-02-08 16:20:05.660 - no implementations specified for speculative decoding
2026-02-08 16:20:05.660 - slot   load_model: id  0 | task -1 | speculative decoding context not initialized
2026-02-08 16:20:05.660 - slot   load_model: id  0 | task -1 | new slot, n_ctx = 32768
2026-02-08 16:20:05.660 - srv    load_model: prompt cache is enabled, size limit: 8192 MiB
2026-02-08 16:20:05.660 - srv    load_model: use `--cache-ram 0` to disable the prompt cache
2026-02-08 16:20:05.660 - srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
2026-02-08 16:20:05.661 - srv          init: init: chat template parsing error: 
2026-02-08 16:20:05.661 - ------------
2026-02-08 16:20:05.661 - While executing For at line 34, column 13 in source:
2026-02-08 16:20:05.661 - ...age["role"] == "system" -%}↵        {%- for content in message["content"] -%}↵  ...
2026-02-08 16:20:05.661 -                                            ^
2026-02-08 16:20:05.661 - Error: Expected iterable or object type in for loop: got String
2026-02-08 16:20:05.661 - srv          init: init: please consider disabling jinja via --no-jinja, or use a custom chat template via --chat-template
2026-02-08 16:20:05.661 - srv          init: init: for example: --no-jinja --chat-template chatml
2026-02-08 16:20:05.661 - srv    operator(): operator(): cleaning up before exit...
2026-02-08 16:20:05.662 - main: exiting due to model loading error

@megemini
Copy link
Contributor Author

megemini commented Feb 8, 2026

I encountered an error after building with the latest code. Could you please tell me how to resolve it? Thank you very much.

wait #19419 merged, or use template below:

{%- if not add_generation_prompt is defined -%}
    {%- set add_generation_prompt = true -%}
{%- endif -%}
{%- if not cls_token is defined -%}
    {%- set cls_token = "<|begin_of_sentence|>" -%}
{%- endif -%}
{%- if not eos_token is defined -%}
    {%- set eos_token = "</s>" -%}
{%- endif -%}
{{- cls_token -}}
{%- for message in messages -%}
    {%- if message["role"] == "user" -%}
        {{- "User: " -}}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "image" -%}
                {{ "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" }}
            {%- endif -%}
        {%- endfor -%}
        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}


        {{ "\n" -}}
    {%- elif message["role"] == "assistant" -%}
        {{- "Assistant: " -}}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}


        {{ eos_token -}}
    {%- elif message["role"] == "system" -%}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] + "\n" }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}

    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{- "Assistant: " -}}
{%- endif -%}

with command:

./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \
  --mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \
  --color on\
  --image /home/shun/Pictures/1640.jpeg \
  --prompt "OCR:" \
  --chat-template-file /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL/chat_template_llama.jinja

@zhang-prog
Copy link

@ngxson Is there any documentation for specific models in llama.cpp? (like PaddleOCR-VL vLLM recipes)? If there is, we’ll document how to use the model there; otherwise, we’ll put it in the PaddleOCR-VL docs.

@zhang-prog
Copy link

@ngxson @megemini
We have added documentation for llama-server to the PaddleOCR-VL documentation.

image

@lilydjwg
Copy link

@zhang-prog I have a simple question: to parse an image from start to result, is CUDA still needed to run the whole process despite llama.cpp has a Vulkan backend?

@ngxson
Copy link
Collaborator

ngxson commented Feb 10, 2026

The chat template fix has just been merged, I'll give it a try soon. Hopefully things work this time and I'll be able to merge this

@ro99
Copy link

ro99 commented Feb 19, 2026

hello @zhang-prog @ngxson, I can see that PP-DocLayoutV3 is an intermediate step in the inference pipeline.

Will PP-DocLayoutV3 work with this? Or it has to be executed separated (e.g. with transformers)?

megemini and others added 2 commits February 19, 2026 19:59
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@megemini
Copy link
Contributor Author

hello @zhang-prog @ngxson, I can see that PP-DocLayoutV3 is an intermediate step in the inference pipeline.

Will PP-DocLayoutV3 work with this? Or it has to be executed separated (e.g. with transformers)?

PP-DocLayoutV3 not included in llama.cpp

@ngxson ngxson merged commit 237958d into ggml-org:master Feb 19, 2026
78 of 79 checks passed
@jiabochao
Copy link

the result:

Screenshot from 2026-01-29 23-53-58 can be formatted: Screenshot from 2026-01-29 23-59-16

@megemini Hi, I have a question: how is the LLM Spotting format converted to this document? Are there any libraries or tools that can convert this format to DOCX or Markdown?

@IIIIIllllIIIIIlllll
Copy link

I'm encountering this error when using the latest version (B8115):
tokenize: error: number of bitmaps (1) does not match number of markers (0).
It seems that it only works with the previous chat template. Is this as expected?
the chat template:

{%- if not add_generation_prompt is defined -%}
    {%- set add_generation_prompt = true -%}
{%- endif -%}
{%- if not cls_token is defined -%}
    {%- set cls_token = "<|begin_of_sentence|>" -%}
{%- endif -%}
{%- if not eos_token is defined -%}
    {%- set eos_token = "</s>" -%}
{%- endif -%}
{{- cls_token -}}
{%- for message in messages -%}
    {%- if message["role"] == "user" -%}
        {{- "User: " -}}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "image" -%}
                {{ "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" }}
            {%- endif -%}
        {%- endfor -%}
        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}


        {{ "\n" -}}
    {%- elif message["role"] == "assistant" -%}
        {{- "Assistant: " -}}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}


        {{ eos_token -}}
    {%- elif message["role"] == "system" -%}

      {%- if message["content"] is string -%}
        {{- message["content"] }}
      {%- else -%}

        {%- for content in message["content"] -%}
            {%- if content["type"] == "text" -%}
                {{ content["text"] + "\n" }}
            {%- endif -%}
        {%- endfor -%}

      {%- endif -%}

    {%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{- "Assistant: " -}}
{%- endif -%}

@megemini
Copy link
Contributor Author

@megemini Hi, I have a question: how is the LLM Spotting format converted to this document? Are there any libraries or tools that can convert this format to DOCX or Markdown?

https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PaddleOCR-VL/PaddleOCR-VL-1.5.en.md#core-features

text spotting (text-line localization and recognition) with prompt Spotting: gives result in text-line recognition

image

with prompt OCR: gives result reorganized

image

I think this is the main difference.

As for the conversion of formats, can refer https://github.com/PaddlePaddle/PaddleX/blob/release/3.4/paddlex/inference/pipelines/paddleocr_vl/pipeline.py

@megemini
Copy link
Contributor Author

I'm encountering this error when using the latest version (B8115): tokenize: error: number of bitmaps (1) does not match number of markers (0). It seems that it only works with the previous chat template. Is this as expected? the chat template:

b8115 https://github.com/ggml-org/llama.cpp/releases/download/b8115/llama-b8115-bin-ubuntu-x64.tar.gz works fine without --chat-template-file

image

@IIIIIllllIIIIIlllll
Copy link

I'm encountering this error when using the latest version (B8115): tokenize: error: number of bitmaps (1) does not match number of markers (0). It seems that it only works with the previous chat template. Is this as expected? the chat template:

b8115 https://github.com/ggml-org/llama.cpp/releases/download/b8115/llama-b8115-bin-ubuntu-x64.tar.gz works fine without --chat-template-file

image

oh, sorry, I'm using llama-server.
C:\Users\Mark\App\llama-server-v0.4.2-b8112-windows-vulkan\llamacpp\win-vulkan\llama-server.exe -m C:\Users\Mark\Models\PaddleOCR-VL-1.5\PaddleOCR-VL-1.5.gguf --port 8083 --mmproj C:\Users\Mark\Models\PaddleOCR-VL-1.5\mmproj-PaddleOCR-VL-1.5.gguf --ctx-size 252144 --flash-attn on --no-mmap --temp 0 --top-p 0.95 --top-k 40 --min-p 0.05 --presence-penalty 0.0 --repeat-penalty 1.0 --frequency-penalty 0.0 --batch-size 2048 --ubatch-size 2048

@megemini
Copy link
Contributor Author

I'm encountering this error when using the latest version (B8115): tokenize: error: number of bitmaps (1) does not match number of markers (0). It seems that it only works with the previous chat template. Is this as expected? the chat template:

b8115 https://github.com/ggml-org/llama.cpp/releases/download/b8115/llama-b8115-bin-ubuntu-x64.tar.gz works fine without --chat-template-file
image

oh, sorry, I'm using llama-server. C:\Users\Mark\App\llama-server-v0.4.2-b8112-windows-vulkan\llamacpp\win-vulkan\llama-server.exe -m C:\Users\Mark\Models\PaddleOCR-VL-1.5\PaddleOCR-VL-1.5.gguf --port 8083 --mmproj C:\Users\Mark\Models\PaddleOCR-VL-1.5\mmproj-PaddleOCR-VL-1.5.gguf --ctx-size 252144 --flash-attn on --no-mmap --temp 0 --top-p 0.95 --top-k 40 --min-p 0.05 --presence-penalty 0.0 --repeat-penalty 1.0 --frequency-penalty 0.0 --batch-size 2048 --ubatch-size 2048

The easiest way to fix this issue is add <__media__> before prompt, e.g. change OCR: to <__media__>OCR: when using llama-server , and do NOT use --chat-template-file, because there is a slight difference between PaddleOCR-VL and PaddleOCR-VL-1.5 templates.

@ngxson I found that, with llama-server command, the prompt of body_parsed after oaicompat_chat_params_parse should be something like "<|begin_of_sentence|>User: <__media__>Formula Recognition:\nAssistant:\n" other than "<|begin_of_sentence|>User: Formula Recognition:\nAssistant:\n", which means <__media__> not been added properly.

#19419 changed common_chat_msgs_to_json_oaicompat to render_message_to_json, should oaicompat_chat_params_parse in server-common also be updated?

@0xFS0CIETY
Copy link

0xFS0CIETY commented Feb 20, 2026

why i got error in llama-server but cli works fine did i done something wrong
1

2

@megemini
Copy link
Contributor Author

why i got error in llama-server but cli works fine did i done something wrong

use <__media__>OCR: instead of OCR: , same issue #18825 (comment)

@0xFS0CIETY
Copy link

thanks its works
image

@IIIIIllllIIIIIlllll
Copy link

Thank you very much! However, as shown in the image, I'm unsure if my usage is correct.
I used llama-server, and the prompt was: '<media>Spotting:'.
In any case, the identified mathematical formulas all contained errors.

image image

@megemini
Copy link
Contributor Author

Thank you very much! However, as shown in the image, I'm unsure if my usage is correct. I used llama-server, and the prompt was: '<media>Spotting:'. In any case, the identified mathematical formulas all contained errors.

I think the best way to process this test image is to first preprocess it using PP-DocLayoutV3, and then identify the text and formula parts separately. Otherwise, it should not be expected to obtain results that are 100% consistent with using PaddleOCR with PaddleOCR-VL model. Especially the formula in the middle of the picture.

p.s. @IIIIIllllIIIIIlllll @0xFS0CIETY the <__media__>XXX: is just a temporary solution, not a feature

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* support PaddleOCR-VL

* clip: update PaddleOCR model loader parameters to prevent OOM during warmup

* [update] add paddleocr vl text model instead of ernie4.5

* [update] restore change of minicpmv

* [update] format

* [update] format

* [update] positions and patch merge permute

* [update] mtmd_decode_use_mrope for paddleocr

* [update] image min/max pixels

* [update] remove set_limit_image_tokens

* upate: preprocess without padding

* clean up

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.