Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM-IE] Add pp-uie to Taskflow #9681

Open
wants to merge 28 commits into
base: develop
Choose a base branch
from

Conversation

Fantasy-02
Copy link

PR types

New features

PR changes

APIs

Description

add qwen2 to Taskflow

Copy link

paddle-bot bot commented Dec 24, 2024

Thanks for your contribution!

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 10.47120% with 171 lines in your changes missing coverage. Please review.

Project coverage is 52.13%. Comparing base (67bc4e2) to head (1c0f41b).
Report is 38 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/taskflow/information_extraction.py 9.09% 160 Missing ⚠️
paddlenlp/taskflow/task.py 21.42% 11 Missing ⚠️

❌ Your patch check has failed because the patch coverage (10.47%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (52.13%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9681      +/-   ##
===========================================
- Coverage    52.19%   52.13%   -0.06%     
===========================================
  Files          728      730       +2     
  Lines       117770   116043    -1727     
===========================================
- Hits         61470    60501     -969     
+ Misses       56300    55542     -758     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ZHUI ZHUI changed the title add qwen2 to Taskflow [LLM-IE] Add qwen2 to Taskflow Dec 24, 2024
@@ -314,6 +314,17 @@
},
"information_extraction": {
"models": {
"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有Llama吗?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有,这个是我当时测试的,可以删了

"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},
"qwen-1.5b": {
"task_class": QwenIETask,
"hidden_size": 768,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些 hidden_size 不对吧 @wawltor zeyang看一下,这个 hidden_size 参数有用不?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个hidden_size参数不需要用到

@@ -314,6 +314,17 @@
},
"information_extraction": {
"models": {
"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},
"qwen-1.5b": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看看名字要不要换,ie-qwen-1.5b 或者其他 @wawltor

@@ -1,252 +0,0 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件是不需要了吗?

self._temperature = kwargs.get("temperature", 1.0)
self._decode_strategy = kwargs.get("decode_strategy", "sampling")
self._num_return_sequences = kwargs.get("num_return_sequences", 1)
self.prompt = """你是一个阅读理解专家,请提取所给句子与问题,提取实体。请注意,如果存在实体,则一定在原句中逐字出现,请输出对应实体的原文,不要进行额外修改;如果无法提取,请输出“无相应实体”。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写成全局变量,大写。放在类定义的外面。

QWEN_IE_PROMPT = """"xxx"""

paddlenlp/taskflow/information_extraction.py Show resolved Hide resolved
llm/ie/README.md Outdated
@@ -0,0 +1,381 @@
# 通用信息抽取 UIE(Universal Information Extraction)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 通用信息抽取 UIE(Universal Information Extraction)
# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)

llm/ie/README.md Outdated
Comment on lines 156 to 164
| `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
| `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 | -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换成qwen

llm/ie/README.md Outdated
```

* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
* `schema_lang`:设置 schema 的语言,默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同,因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还没吃吃的先删除吧、

llm/ie/README.md Outdated
* `schema_lang`:设置 schema 的语言,默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同,因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。
* `model`:选择任务使用的模型,默认为`qwen-0.5b`,可选有`qwen-0.5b`, `qwen-1.5b`。
* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持 GPU 和 NPU 硬件环境。如果选择`fp16`,在 GPU 硬件环境下,请先确保机器正确安装 NVIDIA 相关驱动和基础软件,**确保 CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保 GPU 设备的 CUDA 计算能力(CUDA Compute Capability)大于7.0,典型的设备包括 V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于 CUDA Compute Capability 和精度支持情况请参考 NVIDIA 文档:[GPU 硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16 支持

@DrownFish19
Copy link
Collaborator

Lint 问题需要安装pre-commit 后格式化代码,参考步骤如下:

# 安装
pip install pre-commit

# 在项目文件夹下注册pre-commit,每次commit提交时都会格式化代码
pre-commit install

# 单独处理之前的代码文件
pre-commit run --file XXXX.py



## 2. 应用示例

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缺少应用case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥不能直接用Taskflow,至少Taskflow放在前面

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经把taskflow移到前面

schema = ['出发地', '目的地', '费用', '时间']
```

标注步骤如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意数据形式和训练形式的差异

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑修改chat_template,避免重复设置

如果在 GPU 环境中使用,可以指定 gpus 参数进行多卡训练:

```shell
python -u -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/sft_argument.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改run_finetune.py 路径,引导至llm目录


通过运行以下命令进行模型评估:
```shell
python ./predictor.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

引导至llm目录进行推理


## 3. 开箱即用

```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用,并满足各类信息抽取需求**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加llm直接调用case,如讨论

"""


class UIELLMTask(Task):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意输入是否应用chat_template

"""
Construct the inference model for the predictor.
"""
model_instance = AutoModelForCausalLM.from_pretrained(self._task_path, dtype=self._infer_precision)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为对应模型ID

@@ -0,0 +1,370 @@
# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Application -> applications,保持和之前的小模型一致


## 1. 模型简介

Yaojie Lu 等人在 ACL-2022中提出了通用信息抽取统一框架 UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。然而,该模型在零样本场景下的表现仍存在不足。为此,PaddleNLP 借鉴 UIE 的方法,基于 Qwen2.5-0.5B 预训练模型,训练并开源了一款面向中文通用信息抽取的大模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddleNLP 借鉴 UIE 的方法 -> PaddleNLP基于百度 UIE 的建模思路,通过大模型的能力来训练并开源了一款面向中文通用信息抽取的大模型。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以先暂且不提 Qwen2.5-0.5B,首先我们开源不同尺寸的模型,后续要看看怎么突出Qwen模型

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
ie = Taskflow('information_extraction',
schema= ['时间', '选手', '赛事名称'],
schema_lang="zh",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要schema_lang字段吗?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果只是面向中文,就不需要

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但好像yayi的数据集是有英文的,所以英文的应该也可以兼容下



## 2. 应用示例

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥不能直接用Taskflow,至少Taskflow放在前面

"""

sentences = [
"蒋经国在日记中也称蒋介石病逝时“天发雷电,继之以倾盆大雨,正是所谓风云异色,天地同哀",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

政治性的case不用发加

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b":
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "float16"
else:
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp32 -> float32,不然跑不通吧

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是为了兼容原来的uie

@@ -0,0 +1,348 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022 -> 2024

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -0,0 +1,146 @@
# coding=utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022 -> 2024

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

self._config.set_cpu_math_library_num_threads(self._num_threads)
self._config.switch_use_feed_fetch_ops(False)
self._config.disable_glog_info()
self._config.enable_memory_optim()

self._config.delete_pass("fused_rotary_position_embedding_pass")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete_pass的代码放到 embedding_eltwise_layernorm_fuse_pass 一起

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放在这里也能执行,看起来这来IR已经构造好了

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

paddlenlp/taskflow/information_extraction.py Show resolved Hide resolved
@DrownFish19 DrownFish19 changed the title [LLM-IE] Add qwen2 to Taskflow [LLM-IE] Add pp-uie to Taskflow Feb 6, 2025
schema= ['时间', '选手', '赛事名称'],
schema_lang="zh",
batch_size=1,
model='uie-llm-0.5b')
Copy link
Collaborator

@DrownFish19 DrownFish19 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pp-uie-0.5b

修改此处的模型名称

from paddlenlp.generation import GenerationConfig
from paddlenlp.trl import llm_utils

model_id = "paddlenlp/LLM-UIE-0114"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改此处的模型名称

sft_argument.json 的参考配置如下:
```shell
{
"model_name_or_path": "paddlenlp/LLM-UIE-0114",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp/PP-UIE-0.5B

@@ -54,7 +55,10 @@ def __init__(self, model, task, priority_path=None, **kwargs):
self._param_updated = False

self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2)
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意此处的模型名称

@@ -314,16 +323,56 @@
},
"information_extraction": {
"models": {
"uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"},
"uie-llm-0.5b": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改此处的模型名称 pp-uie-0.5b, 以下模型也修改

@@ -693,6 +745,10 @@
}

support_schema_list = [
"uie-llm-0.5b",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同修改

@@ -725,6 +781,8 @@
]

support_argument_list = [
"uie-llm-1.5b",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同修改

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Feb 7, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Feb 7, 2025
@Fantasy-02 Fantasy-02 closed this Feb 7, 2025
@Fantasy-02 Fantasy-02 reopened this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants