[Model] Add PaddleOCR-VL Model Support#42178
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
hey @zhang-prog , thanks for the PR! Great model to have in transformers!
The main thing to fix first is the naming, it should clearly include "PaddlePaddleOCR" and follow the usual pattern depending on the modality. The config format also isn’t right; it needs to be fully nested, with text and vision configs inside. Additionally there are no tests or docs, several files are missing. You can run transformers add-new-model-like which would generate a placeholder with the necessary files. I also left some smaller comments here and there. Let me know if you hit any issues
|
@zucchini-nlp |
zucchini-nlp
left a comment
There was a problem hiding this comment.
@zhang-prog thanks for iterating!
There are a couple major comments which were not addressed and can be a blocker for merging.
- The model seems to not support batched inference in current state. We need to enable batching before merging if possible. Should not be hard I think given that the image tower is quite similar to existing models
- We also need tests to make sure everything actually works and a documentation page. These files are usually auto-prefilled with empty files when you run
transformers add-new-model-like - Let the modular copy automatically when possible. I think there are a few more modules which can be copied from similar models. If you struggle with finding a similar model, you can try out a modular detector
@zucchini-nlp Thank you for your valuable insights! We’ve carefully addressed all comments and responded to your overall recommendations.
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("PaddlePaddle/PaddleOCR-VL")
model = AutoModelForImageTextToText.from_pretrained("PaddlePaddle/PaddleOCR-VL", dtype="bfloat16")
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo.jpg"},
{"type": "text", "text": "OCR:"},
]
}
]
messages2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/ocr_demo2.jpg"},
{"type": "text", "text": "OCR:"},
]
}
]
batch_messages = [messages1, messages2]
inputs = processor.apply_chat_template(
batch_messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
padding=True,
padding_side='left',
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
result = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(result)
Thank you for your efforts. ❤️ |
|
@zucchini-nlp How do I properly add documentation pages and unit tests? I tried to use |
|
@zucchini-nlp PTAL. Thanks❤️ |
|
Sorry, taking a look.Got lost in my notifications |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Nice, only a few comments and replied to your questions above
For the docs and the tests, they need to be in source/docs/en/model_doc and in tests folder. You can take a look at the recently merged model for an example https://github.com/huggingface/transformers/pull/41112/files#diff-857421affc3c877bca95377cbb6adb3a8374b149fcbdcc6c759ea0408fa96897
|
PTAL.❤️
I am working on the docs and tests. |
|
Great, looking good already. We can keep the conversion mapping as is, no issue for us! There are also a few unresolved comments from the past iterations, if you can take a look Ping me when the docs/tests are added and the CI shows ✅ |
|
Don't merge. Working..... |
|
@zucchini-nlp
in my environment, the test passed:
|
|
@zhang-prog worker crashed means that the tests might be using too much RAM. I see that the image sizes in tests are quite high, 600x400 images. Let's make dummy inputs and model as tiny as possible Reviewing now |
| ### Usage tips | ||
|
|
||
| > [!IMPORTANT] | ||
| > We currently recommend using the [PaddleOCR official method for inference](https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html), as it is faster and supports page-level document parsing. |
There was a problem hiding this comment.
curious if we plan to support page-level document parsing in transformers in the future. Let us know if you need help with it
There was a problem hiding this comment.
This is one of our goals as well. We aim to resolve this issue, but we anticipate encountering some engineering challenges, such as the need to manage the sequential logic between the two models, which is quite complex.
In fact, we plan to submit a PR for PP-DocLayoutV2 soon and hope you can help review it.
| "expert_layer_offset", | ||
| "expert_layer_period", | ||
| ], | ||
| "PaddleOCRTextConfig": ["tie_word_embeddings"], |
There was a problem hiding this comment.
interesting, tie_word_embeddings is a universal attribute and I wouldn't expect CI to complain. Will check 👁️
There was a problem hiding this comment.
Ah, I see. It has a comment that # Allow if the default value in the configuration class is different from the one in PreTrainedConfig, and there are more models skipping tie_word_embeddings explicitly
|
Now, CI shows ✅ |
|
run-slow: paddleocr_vl |
|
This comment contains models: ["models/paddleocr_vl"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI ResultsModel CI Report❌ Failed tests
|
|
@zucchini-nlp fixed, please try slow tests again. btw, some conversation can be marked as resolved? |
|
run-slow: paddleocr_vl |
|
This comment contains models: ["models/paddleocr_vl"] |
| ("ovis2", "Qwen2TokenizerFast" if is_tokenizers_available() else None), | ||
| ("owlv2", "CLIPTokenizerFast" if is_tokenizers_available() else None), | ||
| ("owlvit", "CLIPTokenizerFast" if is_tokenizers_available() else None), | ||
| ("paddleocr_vl", "LlamaTokenizer" if is_tokenizers_available() else None), |
There was a problem hiding this comment.
| ("paddleocr_vl", "LlamaTokenizer" if is_tokenizers_available() else None), | |
| ("paddleocr_vl", "TokenizersBackend" if is_tokenizers_available() else None), |
from looking at the tokenizer.json on the hub, its not a llama!
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
}
]
},
"pre_tokenizer": null,while llama would initialize:
self._tokenizer.normalizer = None
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)| rope_deltas (`torch.LongTensor` of shape `(batch_size, )`, *optional*): | ||
| The rope index difference between sequence length and multimodal rope. | ||
| """ | ||
| return_dict = return_dict if return_dict is not None else self.config.use_return_dict |
There was a problem hiding this comment.
this in general should not be needed
CI ResultsModel CI Report❌ Failed tests
|
|
@zucchini-nlp uh, OOM on |
|
@zhang-prog the test changes hidden sizes of the model to be multiple of 16 iirc, prob that is the reason it is OOM'ing. If the model causes OOM and can't be made smaller, we can skip the test imo |
|
@zucchini-nlp ok, I've reduced the value of |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, paddleocr_vl |
|
run-slow: paddleocr_vl |
|
This comment contains models: ["models/paddleocr_vl"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
Great, ci green now and we can merge 🚀 |
* init * refactor * update * update * fix unresolved problems * fix how position_ids work with flash_attn_2 * add tests and fix code * add model_doc * update model_doc * fix ci * update docstring * add tests * update * add **kwargs * update * update * update * reduce max_position_embeddings in tests * update


What does this PR do?
This PR adds PaddleOCR-VL model to Hugging Face Transformers from PaddleOCR.
Relevant Links:
PaddleOCR
https://huggingface.co/PaddlePaddle/PaddleOCR-VL
Usage
Use a pipeline
Load model directly