model: Add PaddleOCR-VL model support#18825
Conversation
ngxson
left a comment
There was a problem hiding this comment.
this PR looks a bit suspicious, please explicitly specify if you're using AI to generate it
gguf-py/gguf/constants.py
Outdated
| MAX_PIXELS = "clip.vision.max_pixels" | ||
| MIN_PIXELS = "clip.vision.min_pixels" |
There was a problem hiding this comment.
what's the different between this and qwen2vl.cpp? seems like 100% identical
First, thanks for your review, but, NO! Absolutely not! The initial commit was based on your PR: #16701. However, as you mentioned, Therefore, I tried to compare the code with the implementation from huggingface/transformers#42178, which adds PaddleOCR-VL support to transformers. To be frank, I’ve only been working with the llama.cpp source code for about two weeks, so I DO USE AI to help me understand the code. However, since llama.cpp is a C++ project, the AI isn't very helpful, so I'm forced to work on it all by myself! PaddleOCR-VL is almost identical to ERNIE4.5, but there are two main differences from your PR:
Lines 3529 to 3550 in 9d5a701 Additionally, ERNIE4.5 had a minor issue where llama.cpp/convert_hf_to_gguf.py Lines 3728 to 3736 in 9d5a701 Or, PaddleOCR-VL prompt is changed from
The While llama.cpp/src/models/paddleocr.cpp Lines 73 to 75 in 9d5a701 llama.cpp/src/models/qwen2vl.cpp Lines 68 to 70 in 36f0132 ERNIE4.5 and PaddleOCR-VL do not require
As I mentioned above, PaddleOCR-VL requires Lines 8318 to 8320 in 9d5a701 |
|
Thanks for the explanation, that's useful. I think it should be put onto the PR description to make reviewing easier. Indeed, it seems like the model is just qwen2vl under the hood, minus small difference in tokenizer. At least for the language model, it can just reuse the same qwen2vl arch if the bo tensor is not present in the weight. I'll come back to this after #18719 being merged |
Update 20260129PaddleOCR-VL-1.5 released. Model exported: https://modelscope.cn/models/megemini/PaddleOCR-VL-1.5-GGUF/summary Test with new type of image:
with command and result:
Compare with gt:
|
|
the chat template is broken, should be fixed via #19419 |
|
I encountered an error after building with the latest code. Could you please tell me how to resolve it? Thank you very much. |
wait #19419 merged, or use template below: {%- if not add_generation_prompt is defined -%}
{%- set add_generation_prompt = true -%}
{%- endif -%}
{%- if not cls_token is defined -%}
{%- set cls_token = "<|begin_of_sentence|>" -%}
{%- endif -%}
{%- if not eos_token is defined -%}
{%- set eos_token = "</s>" -%}
{%- endif -%}
{{- cls_token -}}
{%- for message in messages -%}
{%- if message["role"] == "user" -%}
{{- "User: " -}}
{%- if message["content"] is string -%}
{{- message["content"] }}
{%- else -%}
{%- for content in message["content"] -%}
{%- if content["type"] == "image" -%}
{{ "<|IMAGE_START|><|IMAGE_PLACEHOLDER|><|IMAGE_END|>" }}
{%- endif -%}
{%- endfor -%}
{%- for content in message["content"] -%}
{%- if content["type"] == "text" -%}
{{ content["text"] }}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{ "\n" -}}
{%- elif message["role"] == "assistant" -%}
{{- "Assistant: " -}}
{%- if message["content"] is string -%}
{{- message["content"] }}
{%- else -%}
{%- for content in message["content"] -%}
{%- if content["type"] == "text" -%}
{{ content["text"] }}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{ eos_token -}}
{%- elif message["role"] == "system" -%}
{%- if message["content"] is string -%}
{{- message["content"] }}
{%- else -%}
{%- for content in message["content"] -%}
{%- if content["type"] == "text" -%}
{{ content["text"] + "\n" }}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- "Assistant: " -}}
{%- endif -%}
with command: ./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \
--mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \
--color on\
--image /home/shun/Pictures/1640.jpeg \
--prompt "OCR:" \
--chat-template-file /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL/chat_template_llama.jinja
|
|
@ngxson Is there any documentation for specific models in llama.cpp? (like PaddleOCR-VL vLLM recipes)? If there is, we’ll document how to use the model there; otherwise, we’ll put it in the PaddleOCR-VL docs. |
|
@ngxson @megemini
|
|
@zhang-prog I have a simple question: to parse an image from start to result, is CUDA still needed to run the whole process despite llama.cpp has a Vulkan backend? |
|
The chat template fix has just been merged, I'll give it a try soon. Hopefully things work this time and I'll be able to merge this |
|
hello @zhang-prog @ngxson, I can see that PP-DocLayoutV3 is an intermediate step in the inference pipeline. Will PP-DocLayoutV3 work with this? Or it has to be executed separated (e.g. with transformers)? |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
PP-DocLayoutV3 not included in llama.cpp |
@megemini Hi, I have a question: how is the LLM Spotting format converted to this document? Are there any libraries or tools that can convert this format to DOCX or Markdown? |
|
I'm encountering this error when using the latest version (B8115): |
with prompt
I think this is the main difference. As for the conversion of formats, can refer https://github.com/PaddlePaddle/PaddleX/blob/release/3.4/paddlex/inference/pipelines/paddleocr_vl/pipeline.py |
b8115 https://github.com/ggml-org/llama.cpp/releases/download/b8115/llama-b8115-bin-ubuntu-x64.tar.gz works fine without
|
oh, sorry, I'm using llama-server. |
The easiest way to fix this issue is add @ngxson I found that, with #19419 changed |
use |
I think the best way to process this test image is to first preprocess it using PP-DocLayoutV3, and then identify the text and formula parts separately. Otherwise, it should not be expected to obtain results that are 100% consistent with using PaddleOCR with PaddleOCR-VL model. Especially the formula in the middle of the picture. p.s. @IIIIIllllIIIIIlllll @0xFS0CIETY the |
* support PaddleOCR-VL * clip: update PaddleOCR model loader parameters to prevent OOM during warmup * [update] add paddleocr vl text model instead of ernie4.5 * [update] restore change of minicpmv * [update] format * [update] format * [update] positions and patch merge permute * [update] mtmd_decode_use_mrope for paddleocr * [update] image min/max pixels * [update] remove set_limit_image_tokens * upate: preprocess without padding * clean up * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>













Add PaddleOCR-VL model support.
Test with some images:
with command:
./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \ --mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \ --color on\ --image /home/shun/Pictures/1640.jpeg \ --prompt "OCR:"with command:
./build/bin/llama-cli -m /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF.gguf \ --mmproj /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL-GGUF-mmproj.gguf \ --color on\ --image /home/shun/Pictures/paddleocr.jpg \ --prompt "Table Recognition:"can be formatted:
p.s. Thanks to @ngxson #16701
Update
The model converted by commands:
or, can be downloaded from : https://modelscope.cn/models/megemini/PaddleOCR-VL-GGUF/summary