Skip to content

Comments

【Hackathon 9th ERNIE Tutorial No.1】微调 PaddleOCR-VL 新姿势 -- Prompt 与 信息抽取#1196

Merged
jzhang533 merged 7 commits intoPaddlePaddle:masterfrom
megemini:paddleocr_vl_prompt
Dec 16, 2025
Merged

【Hackathon 9th ERNIE Tutorial No.1】微调 PaddleOCR-VL 新姿势 -- Prompt 与 信息抽取#1196
jzhang533 merged 7 commits intoPaddlePaddle:masterfrom
megemini:paddleocr_vl_prompt

Conversation

@megemini
Copy link
Contributor

微调 PaddleOCR-VL 新姿势 -- Prompt 与 信息抽取

关联 PaddlePaddle/Paddle#74776

https://aistudio.baidu.com/projectdetail/9857242

@paddle-bot
Copy link

paddle-bot bot commented Dec 10, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

Copy link
Collaborator

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

顺师傅厉害~
训好的模型,可以放到 huggingface 上。
这里已经留好了位置:https://huggingface.co/ERNIE-Community


这里以识别与抽取一张发票内的信息为例:

**微调之前**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的对比要是能通过图片直观的对比就更好了。看原始 json 输出还是有点累。


前面提到,我们可以把 PaddleOCR-VL 当作 VLM 模型来使用,那么,我们可以让能力更强的 VLM 模型来 `教` PaddleOCR-VL 去识别 `购买方名称` 和 `销售方名称`。

数据可以通过 `ernie-4.5-turbo-vl-preview` 模型来生成,参考脚本 `paddleocr_vl/tools/extract_ner/extract_ner.py`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我觉得就重点介绍数据集的格式就好了。怎么样生成合成数据,可以作为附录放到后面。这样文章可以专注在 SFT 模型本身上。

- `mask` 的 `text` 不仅仅是 `OCR:` ,还包括之后需要抽取的字段信息
- `no_mask` 的 `text` 是完整的 `JSON` 格式信息,而不是一段纯文本

> 注意,有的文章在进行 PaddleOCR-VL 微调的时候,会提到 `Completion-Only Training` ,也就是只关心 `completion` 的信息(`no_mask` 部分)而不改变 `prompt` (`mask` 部分),但本文这里需要 `Full-Sequence Training`,而且重点是对 `prompt` 进行微调,需要 `completion` 根据 `prompt` 改变生成行为。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该还是 completion-only training,因为你上面的 prompt 部分("text": "OCR:{\"发票名称\": \"\"}"),也是 mask 掉的,不会参与到 loss 计算。

FYI : https://huggingface.co/docs/trl/en/sft_trainer#train-on-completion-only

Copy link
Contributor Author

@megemini megemini Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我用 ERNIEKIT 微调的,run_ocr_vl_sft_16k.yaml 里面没有 completion_only_loss 这个参数 ~

ERNIEKIT 会自动 mask 掉 prompt ? 如果 prompt 不计算入 loss ,模型是怎么区别不同的 prompt 的?

是我之前理解错了,ERNIE 应该是通过 tag 为 mask 的方式掩掉了这部分 ~ 是 completion-only training ~

注意两点:

- `use_layout_detection=False`,不通过 layout 模型,而是直接将图片送入 `PaddleOCR-VL-0.9B`
- `prompt_label="OCR:{}"`,这里使用我们微调的 `prompt` ,希望模型输出完整的 json 格式的信息
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实我觉得这里给个例子怎么直接使用 VLM 部分的模型就好了。因为微调出来的模型,跟第一阶段基本没关系了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我也觉得直接调用 VLM 部分应该就可以,问题是,咱们 ERNIEKIT 和 PaddleOCR 木有说怎么用 😂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接用 transformers 库进行推理。或者用 FastDeploy, sglang, vllm, 这类推理引擎,应该都可以。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle/FastDeploy#5525

FD 可以加载 PaddlePaddle/PaddleOCR-VL 这个原始模型 ~ 但是加载不了微调后的模型 ~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,paddle 的模型给 transformers 或者 llama 之类的使用,是不是需要做转换?现在用啥工具做?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看你存的是 safetensors 格式,应该不用做任何转换就可以用 transformers, vllm, sglang 跑。
FD 不能跑微调后的模型,不应该呀。具体是报什么错误?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FD 已经提 issue 了 PaddlePaddle/FastDeploy#5525

我本地跑微调的模型,用了 huggingface/transformers#42178 中提到的两个方法

# from transformers import pipeline

# pipe = pipeline(
#     "image-text-to-text", 
#     model="/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt", 
#     dtype="bfloat16")
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"},
#             {"type": "text", "text": "OCR:{}"},
#         ]
#     }
# ]
# result = pipe(text=messages)
# print(result)


from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt")
model = AutoModelForImageTextToText.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt", dtype="bfloat16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"},
            {"type": "text", "text": "OCR:{}"},
        ]
    }
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

报错:

(venv310)  ✘ shun@shun-B660M-Pro-RS  ~/workspace/Projects/erniekit_paddleocr_vl_ner   master ±✚  python paddleocr_vl_transformers.py
The repository /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt contains custom code which must be executed to correctly load the model. You can inspect the repository content at /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt .
 You can inspect the repository content at https://hf.co//media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Traceback (most recent call last):
  File "/home/shun/workspace/Projects/erniekit_paddleocr_vl_ner/paddleocr_vl_transformers.py", line 22, in <module>
    processor = AutoProcessor.from_pretrained("/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt")
  File "/home/shun/workspace/Projects/github/transformers/src/transformers/models/auto/processing_auto.py", line 382, in from_pretrained
    processor_class = get_class_from_dynamic_module(
  File "/home/shun/workspace/Projects/github/transformers/src/transformers/dynamic_module_utils.py", line 572, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
  File "/home/shun/workspace/Projects/github/transformers/src/transformers/dynamic_module_utils.py", line 390, in get_cached_module_file
    resolved_module_file = cached_file(
  File "/home/shun/workspace/Projects/github/transformers/src/transformers/utils/hub.py", line 276, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/home/shun/workspace/Projects/github/transformers/src/transformers/utils/hub.py", line 377, in cached_files
    raise OSError(
OSError: /media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt does not appear to have a file named processing_ppocrvl.py. Checkout 'https://huggingface.co//media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR_VL_Prompt/tree/main' for available files.

应该是缺少文件 ~ 我看 hf 的 repo 中有这个文件,但是,目前微调之后保存下来的目录里面木有 ~

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@megemini 目前微调是缺这几个文件的,目前可以先拉 hf 的 repo,用你微调后的 safetensors 替换一下原来的 safetensors 就可以运行了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,感谢~ :)

{'res': {'input_path': '/home/aistudio/paddleocr_vl/data/test.jpg', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': False, 'use_chart_recognition': False, 'format_block_content': False}, 'parsing_res_list': [{'block_label': 'OCR:{"购买方名称": {}, "销售方名称": {}}', 'block_content': '{"购买方名称": {"名称": "中青旅联科(杭州)公关顾问有限公司", "统一社会信用代码": "91330105MA2H2DUJ92"}, "销售方名称": {"名称": "杭州万力酒店管理有限公司", "统一社会信用代码": "91330106MA2B1C4UXN"}}', 'block_bbox': [0, 0, 1260, 838]}]}}
```

可以看到,模型基本上可以跟随我们的指令抽取对应的信息。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有精力加一些简单的指标对比吗?比如在验证集上,训之前和训之后的精度的提升。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个好像还真没法比较 ~ 目前微调的目的是控制输出的格式

微调之前,输出的是markdown表格形式

image

微调之后输出 json

image

@megemini
Copy link
Contributor Author

Update 20251211

更新了数据部分章节,并删除了 completion-only training 部分的说明 ~

模型我先下载下来,看看能不能传到 aistudio 上 ~ hf 上传得用技巧,太麻烦了,看看到时候谁帮忙上传一下 😂

@megemini
Copy link
Contributor Author

@megemini
Copy link
Contributor Author

megemini commented Dec 12, 2025

@jzhang533 @zhang-prog

感谢,已经可以用 transformers 推理了 👍️👍️👍️

image

只是,我这边只有 6G 显存,所以用了量化 (虽然还是很慢 😅)

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch

path = "/media/shun/bigdata/Models/PaddleOCR_VL_SFT/PaddleOCR-VL"
processor = AutoProcessor.from_pretrained(path, local_files_only=True, use_fast=True)

# 4-bit 量化配置,大幅减少显存占用
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)
model = AutoModelForImageTextToText.from_pretrained(
    path,
    quantization_config=quantization_config,
    # device_map="auto",
    local_files_only=True
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://ai-studio-static-online.cdn.bcebos.com/dc31c334d4664ca4955aa47d8e202a53a276fd0aab0840b09abe953fe51207d0"},
            {"type": "text", "text": "OCR:{\"发票日期\": \"\"}"},
        ]
    }
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

@megemini
Copy link
Contributor Author

@megemini megemini requested a review from jzhang533 December 15, 2025 15:15
Copy link
Collaborator

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool~

@jzhang533 jzhang533 merged commit e89c9e1 into PaddlePaddle:master Dec 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants