[Model] Add PaddleOCR-VL Model Support by zhang-prog · Pull Request #42178 · huggingface/transformers

zhang-prog · 2025-11-13T08:15:40Z

What does this PR do?

This PR adds PaddleOCR-VL model to Hugging Face Transformers from PaddleOCR.

Relevant Links:

PaddleOCR
https://huggingface.co/PaddlePaddle/PaddleOCR-VL

Usage

Use a pipeline

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="PaddlePaddle/PaddleOCR-VL", dtype="bfloat16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo2.jpg"},
            {"type": "text", "text": "OCR:"},
        ]
    }
]
result = pipe(text=messages)
print(result)

Load model directly

from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("PaddlePaddle/PaddleOCR-VL")
model = AutoModelForImageTextToText.from_pretrained("PaddlePaddle/PaddleOCR-VL", dtype="bfloat16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo2.jpg"},
            {"type": "text", "text": "OCR:"},
        ]
    }
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

zucchini-nlp

hey @zhang-prog , thanks for the PR! Great model to have in transformers!

The main thing to fix first is the naming, it should clearly include "PaddlePaddleOCR" and follow the usual pattern depending on the modality. The config format also isn’t right; it needs to be fully nested, with text and vision configs inside. Additionally there are no tests or docs, several files are missing. You can run transformers add-new-model-like which would generate a placeholder with the necessary files. I also left some smaller comments here and there. Let me know if you hit any issues

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py

zhang-prog · 2025-11-21T05:22:13Z

@zucchini-nlp
We have refactored the code to address the issues you mentioned in your comments.
Please review the code again when you have time.
Thank you for your efforts!!!

zucchini-nlp

@zhang-prog thanks for iterating!

There are a couple major comments which were not addressed and can be a blocker for merging.

The model seems to not support batched inference in current state. We need to enable batching before merging if possible. Should not be hard I think given that the image tower is quite similar to existing models
We also need tests to make sure everything actually works and a documentation page. These files are usually auto-prefilled with empty files when you run transformers add-new-model-like
Let the modular copy automatically when possible. I think there are a few more modules which can be copied from similar models. If you struggle with finding a similar model, you can try out a modular detector

src/transformers/models/paddleocr_vl/__init__.py

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py

zhang-prog · 2025-11-26T04:06:15Z

@zhang-prog thanks for iterating!

There are a couple major comments which were not addressed and can be a blocker for merging.

The model seems to not support batched inference in current state. We need to enable batching before merging if possible. Should not be hard I think given that the image tower is quite similar to existing models

We also need tests to make sure everything actually works and a documentation page. These files are usually auto-prefilled with empty files when you run transformers add-new-model-like

Let the modular copy automatically when possible. I think there are a few more modules which can be copied from similar models. If you struggle with finding a similar model, you can try out a modular detector

@zucchini-nlp Thank you for your valuable insights! We’ve carefully addressed all comments and responded to your overall recommendations.

We support bs > 1, like this:

from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("PaddlePaddle/PaddleOCR-VL")
model = AutoModelForImageTextToText.from_pretrained("PaddlePaddle/PaddleOCR-VL", dtype="bfloat16")
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo.jpg"},
            {"type": "text", "text": "OCR:"},
        ]
    }
]
messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo2.jpg"},
            {"type": "text", "text": "OCR:"},
        ]
    }
]
batch_messages = [messages1, messages2]
inputs = processor.apply_chat_template(
	batch_messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
    padding=True,
    padding_side='left',
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
result = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)

We still have some issues to discuss. I replied to your comment and will generate the final version of the document once it’s completed.
We also added the PaddleOCRVisionConfig and PaddleOCRTextConfig into modular.

Thank you for your efforts. ❤️
PTAL.

zhang-prog · 2025-11-26T08:54:04Z

@zucchini-nlp How do I properly add documentation pages and unit tests? I tried to use transformers add-new-model-like, which generates the new modular_xxx.py files, but this process might not be the right approach.

zhang-prog · 2025-12-03T09:19:10Z

@zucchini-nlp PTAL. Thanks❤️

zucchini-nlp · 2025-12-03T11:07:11Z

Sorry, taking a look.Got lost in my notifications

zucchini-nlp

Nice, only a few comments and replied to your questions above

For the docs and the tests, they need to be in source/docs/en/model_doc and in tests folder. You can take a look at the recently merged model for an example https://github.com/huggingface/transformers/pull/41112/files#diff-857421affc3c877bca95377cbb6adb3a8374b149fcbdcc6c759ea0408fa96897

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py

zhang-prog · 2025-12-04T10:21:37Z

@zucchini-nlp

PTAL.❤️

_checkpoint_conversion_mapping and ignore_keys_at_rope_validation needs to be discussed.

I am working on the docs and tests.

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py

zucchini-nlp · 2025-12-04T11:17:32Z

Great, looking good already. We can keep the conversion mapping as is, no issue for us! There are also a few unresolved comments from the past iterations, if you can take a look

Ping me when the docs/tests are added and the CI shows ✅

zhang-prog · 2025-12-05T14:00:58Z

Don't merge. Working.....

zhang-prog · 2025-12-10T13:42:49Z

@zucchini-nlp
Why did it crash here?

in my environment, the test passed:

zucchini-nlp · 2025-12-10T13:48:28Z

@zhang-prog worker crashed means that the tests might be using too much RAM. I see that the image sizes in tests are quite high, 600x400 images. Let's make dummy inputs and model as tiny as possible

Reviewing now

docs/source/en/model_doc/paddleocr_vl.md

zucchini-nlp · 2025-12-10T13:54:11Z

docs/source/en/model_doc/paddleocr_vl.md

+### Usage tips
+
+> [!IMPORTANT]
+> We currently recommend using the [PaddleOCR official method for inference](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html), as it is faster and supports page-level document parsing. 


curious if we plan to support page-level document parsing in transformers in the future. Let us know if you need help with it

This is one of our goals as well. We aim to resolve this issue, but we anticipate encountering some engineering challenges, such as the need to manage the sequential logic between the two models, which is quite complex.

In fact, we plan to submit a PR for PP-DocLayoutV2 soon and hope you can help review it.

src/transformers/models/auto/processing_auto.py

zucchini-nlp · 2025-12-10T13:57:30Z

utils/check_config_attributes.py

        "expert_layer_offset",
        "expert_layer_period",
    ],
+    "PaddleOCRTextConfig": ["tie_word_embeddings"],


interesting, tie_word_embeddings is a universal attribute and I wouldn't expect CI to complain. Will check 👁️

Ah, I see. It has a comment that # Allow if the default value in the configuration class is different from the one in PreTrainedConfig, and there are more models skipping tie_word_embeddings explicitly

tests/models/paddleocr_vl/test_modeling_paddleocr_vl.py

zhang-prog · 2025-12-11T04:09:22Z

@zucchini-nlp

Now, CI shows ✅
PTAL. ❤️

zucchini-nlp · 2025-12-11T10:35:29Z

run-slow: paddleocr_vl

github-actions · 2025-12-11T10:36:38Z

This comment contains run-slow, running the specified jobs:

models: ["models/paddleocr_vl"]
quantizations: []

HuggingFaceDocBuilderDev · 2025-12-11T10:45:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-12-11T11:07:11Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

paddleocr_vl:
tests/models/paddleocr_vl/test_modeling_paddleocr_vl.py::PaddleOCRVLModelTest::test_flex_attention_with_grads
tests/models/paddleocr_vl/test_modeling_paddleocr_vl.py::PaddleOCRVLIntegrationTest::test_small_model_integration_test
tests/models/paddleocr_vl/test_modeling_paddleocr_vl.py::PaddleOCRVLIntegrationTest::test_small_model_integration_test_batch

zucchini-nlp

Thanks a lot, huge work! Last bits will be adjusting the slow tests. Above comment shows failing cases

I will ping core maintainers in the meanwhile for review

docs/source/en/model_doc/paddleocr_vl.md

zhang-prog · 2025-12-11T11:58:32Z

@zucchini-nlp fixed, please try slow tests again. btw, some conversation can be marked as resolved?

zucchini-nlp · 2025-12-11T12:06:01Z

run-slow: paddleocr_vl

github-actions · 2025-12-11T12:07:12Z

This comment contains run-slow, running the specified jobs:

models: ["models/paddleocr_vl"]
quantizations: []

ArthurZucker

Very nice! 🤗

ArthurZucker · 2025-12-11T12:06:43Z

src/transformers/models/auto/tokenization_auto.py

        ("ovis2", "Qwen2TokenizerFast" if is_tokenizers_available() else None),
        ("owlv2", "CLIPTokenizerFast" if is_tokenizers_available() else None),
        ("owlvit", "CLIPTokenizerFast" if is_tokenizers_available() else None),
+        ("paddleocr_vl", "LlamaTokenizer" if is_tokenizers_available() else None),


Suggested change

("paddleocr_vl", "LlamaTokenizer" if is_tokenizers_available() else None),

("paddleocr_vl", "TokenizersBackend" if is_tokenizers_available() else None),

from looking at the tokenizer.json on the hub, its not a llama!

"normalizer": { "type": "Sequence", "normalizers": [ { "type": "Replace", "pattern": { "String": " " }, "content": "▁" } ] }, "pre_tokenizer": null,

while llama would initialize:

self._tokenizer.normalizer = None self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace( replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False )

ArthurZucker · 2025-12-11T12:23:37Z

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py

+        rope_deltas (`torch.LongTensor` of shape `(batch_size, )`, *optional*):
+            The rope index difference between sequence length and multimodal rope.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict


this in general should not be needed

github-actions · 2025-12-11T12:26:42Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

paddleocr_vl:
tests/models/paddleocr_vl/test_modeling_paddleocr_vl.py::PaddleOCRVLModelTest::test_flex_attention_with_grads

zhang-prog · 2025-12-11T12:28:27Z

@zucchini-nlp uh, OOM on test_flex_attention_with_grads again, how can I pass this test by changing the parameters?

zucchini-nlp · 2025-12-11T12:39:22Z

@zhang-prog the test changes hidden sizes of the model to be multiple of 16 iirc, prob that is the reason it is OOM'ing. If the model causes OOM and can't be made smaller, we can skip the test imo

zhang-prog · 2025-12-11T13:21:47Z

@zucchini-nlp ok, I've reduced the value of hidden_size, intermediate_size and num_attention_heads. Maybe this attempt can pass the test. let's try again!

github-actions · 2025-12-11T13:22:39Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, paddleocr_vl

zucchini-nlp · 2025-12-11T13:50:34Z

run-slow: paddleocr_vl

github-actions · 2025-12-11T13:51:47Z

This comment contains run-slow, running the specified jobs:

models: ["models/paddleocr_vl"]
quantizations: []

github-actions · 2025-12-11T14:10:23Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

zucchini-nlp · 2025-12-11T14:27:00Z

Great, ci green now and we can merge 🚀

* init * refactor * update * update * fix unresolved problems * fix how position_ids work with flash_attn_2 * add tests and fix code * add model_doc * update model_doc * fix ci * update docstring * add tests * update * add **kwargs * update * update * update * reduce max_position_embeddings in tests * update

init

de74e15

zucchini-nlp self-requested a review November 13, 2025 09:07

zucchini-nlp reviewed Nov 13, 2025

View reviewed changes

refactor

0fe7b78

zucchini-nlp reviewed Nov 24, 2025

View reviewed changes

update

70c2338

zucchini-nlp reviewed Dec 3, 2025

View reviewed changes

zhang-prog added 2 commits December 4, 2025 16:16

Merge remote-tracking branch 'origin/main' into feat/paddleocr_vl

d4034ca

update

f1cd9ac

zucchini-nlp reviewed Dec 4, 2025

View reviewed changes

src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py Outdated Show resolved Hide resolved

fix unresolved problems

1839fab

zhang-prog added 11 commits December 8, 2025 14:22

fix how position_ids work with flash_attn_2

a7468c5

Merge remote-tracking branch 'origin/main' into feat/paddleocr_vl

1592266

add tests and fix code

af6d108

add model_doc

daf3dfc

update model_doc

4a37734

Merge remote-tracking branch 'origin/main' into feat/paddleocr_vl

46436f2

fix ci

0aecf6f

Merge remote-tracking branch 'origin/main' into feat/paddleocr_vl

684b225

update docstring

2665f9c

add tests

da31847

Merge remote-tracking branch 'origin/main' into feat/paddleocr_vl

68bb5f4

zucchini-nlp reviewed Dec 10, 2025

View reviewed changes

update

a59aa8a

zucchini-nlp approved these changes Dec 11, 2025

View reviewed changes

docs/source/en/model_doc/paddleocr_vl.md Outdated Show resolved Hide resolved

zucchini-nlp requested a review from ArthurZucker December 11, 2025 11:22

reduce max_position_embeddings in tests

35be830

ArthurZucker approved these changes Dec 11, 2025

View reviewed changes

update

faba6a4

zucchini-nlp merged commit 8c84144 into huggingface:main Dec 11, 2025
26 checks passed

megemini mentioned this pull request Dec 12, 2025

【Hackathon 9th ERNIE Tutorial No.1】微调 PaddleOCR-VL 新姿势 -- Prompt 与信息抽取 PaddlePaddle/community#1196

Merged

megemini mentioned this pull request Jan 15, 2026

model: Add PaddleOCR-VL model support ggml-org/llama.cpp#18825

Merged

	("paddleocr_vl", "LlamaTokenizer" if is_tokenizers_available() else None),
	("paddleocr_vl", "TokenizersBackend" if is_tokenizers_available() else None),

Comments

Conversation

zhang-prog commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Use a pipeline

Load model directly

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhang-prog commented Nov 21, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhang-prog commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhang-prog commented Nov 26, 2025

Uh oh!

zhang-prog commented Dec 3, 2025

Uh oh!

zucchini-nlp commented Dec 3, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhang-prog commented Dec 4, 2025

Uh oh!

Uh oh!

zucchini-nlp commented Dec 4, 2025

Uh oh!

zhang-prog commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhang-prog commented Dec 10, 2025

Uh oh!

zucchini-nlp commented Dec 10, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhang-prog commented Nov 13, 2025 •

edited

Loading

zhang-prog commented Nov 26, 2025 •

edited

Loading

zhang-prog commented Dec 5, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

zhang-prog commented Dec 11, 2025 •

edited

Loading