【Hackathon 9th ERNIE Tutorial No.1】微调 PaddleOCR-VL 新姿势 -- Prompt 与信息抽取 by megemini · Pull Request #1196 · PaddlePaddle/community

megemini · 2025-12-10T14:00:11Z

微调 PaddleOCR-VL 新姿势 -- Prompt 与信息抽取

https://aistudio.baidu.com/projectdetail/9857242

paddle-bot · 2025-12-10T14:00:21Z

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

jzhang533

顺师傅厉害～
训好的模型，可以放到 huggingface 上。
这里已经留好了位置：https://huggingface.co/ERNIE-Community

jzhang533 · 2025-12-11T07:53:19Z

reports/docs/ernie_tutorial/paddleocr_vl_prompt/PaddleOCR_VL_SFT_zh.md

+
+这里以识别与抽取一张发票内的信息为例：
+
+**微调之前**


这里的对比要是能通过图片直观的对比就更好了。看原始 json 输出还是有点累。

jzhang533 · 2025-12-11T07:54:51Z

reports/docs/ernie_tutorial/paddleocr_vl_prompt/PaddleOCR_VL_SFT_zh.md

+
+前面提到，我们可以把 PaddleOCR-VL 当作 VLM 模型来使用，那么，我们可以让能力更强的 VLM 模型来 `教` PaddleOCR-VL 去识别 `购买方名称` 和 `销售方名称`。
+
+数据可以通过 `ernie-4.5-turbo-vl-preview` 模型来生成，参考脚本 `paddleocr_vl/tools/extract_ner/extract_ner.py`。


这里我觉得就重点介绍数据集的格式就好了。怎么样生成合成数据，可以作为附录放到后面。这样文章可以专注在 SFT 模型本身上。

jzhang533 · 2025-12-11T07:57:57Z

reports/docs/ernie_tutorial/paddleocr_vl_prompt/PaddleOCR_VL_Prompt_cn.md

+- `mask` 的 `text` 不仅仅是 `OCR:` ，还包括之后需要抽取的字段信息
+- `no_mask` 的 `text` 是完整的 `JSON` 格式信息，而不是一段纯文本
+
+> 注意，有的文章在进行 PaddleOCR-VL 微调的时候，会提到 `Completion-Only Training` ，也就是只关心 `completion` 的信息（`no_mask` 部分）而不改变 `prompt` （`mask` 部分），但本文这里需要 `Full-Sequence Training`，而且重点是对 `prompt` 进行微调，需要 `completion` 根据 `prompt` 改变生成行为。


这里应该还是 completion-only training，因为你上面的 prompt 部分（"text": "OCR:{\"发票名称\": \"\"}"），也是 mask 掉的，不会参与到 loss 计算。

FYI : https://huggingface.co/docs/trl/en/sft_trainer#train-on-completion-only

~~我用 ERNIEKIT 微调的，run_ocr_vl_sft_16k.yaml 里面没有 completion_only_loss 这个参数～~~

~~ERNIEKIT 会自动 mask 掉 prompt ？如果 prompt 不计算入 loss ，模型是怎么区别不同的 prompt 的？~~

是我之前理解错了，ERNIE 应该是通过 tag 为 mask 的方式掩掉了这部分～是 completion-only training ～

jzhang533 · 2025-12-11T07:59:02Z

reports/docs/ernie_tutorial/paddleocr_vl_prompt/PaddleOCR_VL_Prompt_cn.md

+注意两点：
+
+- `use_layout_detection=False`，不通过 layout 模型，而是直接将图片送入 `PaddleOCR-VL-0.9B`
+- `prompt_label="OCR:{}"`，这里使用我们微调的 `prompt` ，希望模型输出完整的 json 格式的信息


其实我觉得这里给个例子怎么直接使用 VLM 部分的模型就好了。因为微调出来的模型，跟第一阶段基本没关系了。

我也觉得直接调用 VLM 部分应该就可以，问题是，咱们 ERNIEKIT 和 PaddleOCR 木有说怎么用 😂

可以直接用 transformers 库进行推理。或者用 FastDeploy, sglang, vllm, 这类推理引擎，应该都可以。

PaddlePaddle/FastDeploy#5525

FD 可以加载 PaddlePaddle/PaddleOCR-VL 这个原始模型～但是加载不了微调后的模型～

另外，paddle 的模型给 transformers 或者 llama 之类的使用，是不是需要做转换？现在用啥工具做？

我看你存的是 safetensors 格式，应该不用做任何转换就可以用 transformers, vllm, sglang 跑。
FD 不能跑微调后的模型，不应该呀。具体是报什么错误？

FD 已经提 issue 了 PaddlePaddle/FastDeploy#5525

我本地跑微调的模型，用了 huggingface/transformers#42178 中提到的两个方法

# from transformers import pipeline # pipe = pipeline( # "image-text-to-text", # model="/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt", # dtype="bfloat16") # messages = [ # { # "role": "user", # "content": [ # {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"}, # {"type": "text", "text": "OCR:{}"}, # ] # } # ] # result = pipe(text=messages) # print(result) from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt") model = AutoModelForImageTextToText.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt", dtype="bfloat16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"}, {"type": "text", "text": "OCR:{}"}, ] } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1]) print(result)

报错：

(venv310) ✘ shun@shun-B660M-Pro-RS  ~/workspace/Projects/erniekit_paddleocr_vl_ner   master ±✚  python paddleocr_vl_transformers.py The repository /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt contains custom code which must be executed to correctly load the model. You can inspect the repository content at /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt . You can inspect the repository content at https://hf.co//media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt. You can avoid this prompt in future by passing the argument `trust_remote_code=True`. Do you wish to run the custom code? [y/N] y Traceback (most recent call last): File "/home/shun/workspace/Projects/erniekit_paddleocr_vl_ner/paddleocr_vl_transformers.py", line 22, in <module> processor = AutoProcessor.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt") File "/home/shun/workspace/Projects/github/transformers/src/transformers/models/auto/processing_auto.py", line 382, in from_pretrained processor_class = get_class_from_dynamic_module( File "/home/shun/workspace/Projects/github/transformers/src/transformers/dynamic_module_utils.py", line 572, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/home/shun/workspace/Projects/github/transformers/src/transformers/dynamic_module_utils.py", line 390, in get_cached_module_file resolved_module_file = cached_file( File "/home/shun/workspace/Projects/github/transformers/src/transformers/utils/hub.py", line 276, in cached_file file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) File "/home/shun/workspace/Projects/github/transformers/src/transformers/utils/hub.py", line 377, in cached_files raise OSError( OSError: /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt does not appear to have a file named processing_ppocrvl.py. Checkout 'https://huggingface.co//media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt/tree/main' for available files.

应该是缺少文件～我看 hf 的 repo 中有这个文件，但是，目前微调之后保存下来的目录里面木有～

@megemini 目前微调是缺这几个文件的，目前可以先拉 hf 的 repo，用你微调后的 safetensors 替换一下原来的 safetensors 就可以运行了。

收到，感谢～：）

jzhang533 · 2025-12-11T07:59:40Z

reports/docs/ernie_tutorial/paddleocr_vl_prompt/PaddleOCR_VL_Prompt_cn.md

+{'res': {'input_path': '/home/aistudio/paddleocr_vl/data/test.jpg', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': False, 'use_chart_recognition': False, 'format_block_content': False}, 'parsing_res_list': [{'block_label': 'OCR:{"购买方名称": {}, "销售方名称": {}}', 'block_content': '{"购买方名称": {"名称": "中青旅联科（杭州）公关顾问有限公司", "统一社会信用代码": "91330105MA2H2DUJ92"}, "销售方名称": {"名称": "杭州万力酒店管理有限公司", "统一社会信用代码": "91330106MA2B1C4UXN"}}', 'block_bbox': [0, 0, 1260, 838]}]}}
+```
+
+可以看到，模型基本上可以跟随我们的指令抽取对应的信息。


还有精力加一些简单的指标对比吗？比如在验证集上，训之前和训之后的精度的提升。

这个好像还真没法比较～目前微调的目的是控制输出的格式

微调之前，输出的是markdown表格形式

微调之后输出 json

megemini · 2025-12-11T10:58:01Z

Update 20251211

更新了数据部分章节，并删除了 completion-only training 部分的说明～

模型我先下载下来，看看能不能传到 aistudio 上～ hf 上传得用技巧，太麻烦了，看看到时候谁帮忙上传一下 😂

megemini · 2025-12-11T13:59:23Z

模型已上传 https://aistudio.baidu.com/modelsdetail/41228/intro

megemini · 2025-12-12T13:25:15Z

@jzhang533 @zhang-prog

感谢，已经可以用 transformers 推理了 👍️👍️👍️

只是，我这边只有 6G 显存，所以用了量化（虽然还是很慢 😅）

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch

path = "/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL"
processor = AutoProcessor.from_pretrained(path, local_files_only=True, use_fast=True)

# 4-bit 量化配置，大幅减少显存占用
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)
model = AutoModelForImageTextToText.from_pretrained(
    path,
    quantization_config=quantization_config,
    # device_map="auto",
    local_files_only=True
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"},
            {"type": "text", "text": "OCR:{\"发票日期\": \"\"}"},
        ]
    }
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

megemini · 2025-12-15T15:15:05Z

20251215

更新文档
新建 repo https://github.com/megemini/PaddleOCR-VL-REC 方便进行信息抽取
新建微调模型，https://aistudio.baidu.com/modelsdetail/41446/intro https://modelscope.cn/models/megemini/PaddleOCR-VL-Receipt/summary

jzhang533

Cool~

megemini added 3 commits October 30, 2025 14:06

【Hackathon 9th No.109】基于 Setuptools 80+ 版本自定义算子机制适配设计文档

520aac0

Merge branch 'master' of https://github.com/PaddlePaddle/community

b6e4f2b

【Hackathon 9th ERNIE Tutorial】PaddleOCR_VL_Prompt

fdc85ce

paddle-bot bot added the contributor label Dec 10, 2025

megemini mentioned this pull request Dec 10, 2025

【Hackathon 9th】文心大模型案例征集 PaddlePaddle/Paddle#74776

Closed

luotao1 assigned jzhang533, luotao1 and sunzhongkai588 Dec 11, 2025

jzhang533 reviewed Dec 11, 2025

View reviewed changes

更新 PaddleOCR_VL_Prompt 教程，添加详细的数据准备和微调说明

fcbe7fb

更新 PaddleOCR_VL_Prompt 教程

671d9ea

megemini added 2 commits December 15, 2025 22:31

添加 PaddleOCR-VL 微调教程：Prompt 与信息抽取

8551cc8

更新 PaddleOCR-VL 教程，添加 PaddleOCR-VL-REC 项目链接

a89aea0

megemini requested a review from jzhang533 December 15, 2025 15:15

jzhang533 approved these changes Dec 16, 2025

View reviewed changes

jzhang533 merged commit e89c9e1 into PaddlePaddle:master Dec 16, 2025
1 check passed

megemini mentioned this pull request Jan 7, 2026

2025年全年飞桨开源之星评选-信息征集 #1201

Open


		这里以识别与抽取一张发票内的信息为例：

		微调之前


		前面提到，我们可以把 PaddleOCR-VL 当作 VLM 模型来使用，那么，我们可以让能力更强的 VLM 模型来 `教` PaddleOCR-VL 去识别 `购买方名称` 和 `销售方名称`。

		数据可以通过 `ernie-4.5-turbo-vl-preview` 模型来生成，参考脚本 `paddleocr_vl/tools/extract_ner/extract_ner.py`。

Comments

Conversation

megemini commented Dec 10, 2025

Uh oh!

paddle-bot bot commented Dec 10, 2025

Uh oh!

jzhang533 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

megemini Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

megemini commented Dec 11, 2025

Update 20251211

Uh oh!

megemini commented Dec 11, 2025

Uh oh!

megemini commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

megemini commented Dec 15, 2025

20251215

Uh oh!

jzhang533 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

megemini Dec 11, 2025 •

edited

Loading

megemini commented Dec 12, 2025 •

edited

Loading