-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM-IE] Add pp-uie to Taskflow #9681
base: develop
Are you sure you want to change the base?
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (10.47%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #9681 +/- ##
===========================================
- Coverage 52.19% 52.13% -0.06%
===========================================
Files 728 730 +2
Lines 117770 116043 -1727
===========================================
- Hits 61470 60501 -969
+ Misses 56300 55542 -758 ☔ View full report in Codecov by Sentry. |
paddlenlp/taskflow/taskflow.py
Outdated
@@ -314,6 +314,17 @@ | |||
}, | |||
"information_extraction": { | |||
"models": { | |||
"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有Llama吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有,这个是我当时测试的,可以删了
paddlenlp/taskflow/taskflow.py
Outdated
"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"}, | ||
"qwen-1.5b": { | ||
"task_class": QwenIETask, | ||
"hidden_size": 768, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些 hidden_size 不对吧 @wawltor zeyang看一下,这个 hidden_size 参数有用不?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个hidden_size参数不需要用到
paddlenlp/taskflow/taskflow.py
Outdated
@@ -314,6 +314,17 @@ | |||
}, | |||
"information_extraction": { | |||
"models": { | |||
"llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"}, | |||
"qwen-1.5b": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看看名字要不要换,ie-qwen-1.5b
或者其他 @wawltor
@@ -1,252 +0,0 @@ | |||
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件是不需要了吗?
self._temperature = kwargs.get("temperature", 1.0) | ||
self._decode_strategy = kwargs.get("decode_strategy", "sampling") | ||
self._num_return_sequences = kwargs.get("num_return_sequences", 1) | ||
self.prompt = """你是一个阅读理解专家,请提取所给句子与问题,提取实体。请注意,如果存在实体,则一定在原句中逐字出现,请输出对应实体的原文,不要进行额外修改;如果无法提取,请输出“无相应实体”。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写成全局变量,大写。放在类定义的外面。
QWEN_IE_PROMPT = """"xxx"""
llm/ie/README.md
Outdated
@@ -0,0 +1,381 @@ | |||
# 通用信息抽取 UIE(Universal Information Extraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# 通用信息抽取 UIE(Universal Information Extraction) | |
# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction) |
llm/ie/README.md
Outdated
| `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 | | ||
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 | | ||
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 | | ||
| `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 | | ||
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 | | ||
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 | | ||
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 | | ||
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 | | ||
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 | --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换成qwen
llm/ie/README.md
Outdated
``` | ||
|
||
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。 | ||
* `schema_lang`:设置 schema 的语言,默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同,因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还没吃吃的先删除吧、
llm/ie/README.md
Outdated
* `schema_lang`:设置 schema 的语言,默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同,因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。 | ||
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。 | ||
* `model`:选择任务使用的模型,默认为`qwen-0.5b`,可选有`qwen-0.5b`, `qwen-1.5b`。 | ||
* `precision`:选择模型精度,默认为`fp32`,可选有`fp16`和`fp32`。`fp16`推理速度更快,支持 GPU 和 NPU 硬件环境。如果选择`fp16`,在 GPU 硬件环境下,请先确保机器正确安装 NVIDIA 相关驱动和基础软件,**确保 CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖。其次,需要确保 GPU 设备的 CUDA 计算能力(CUDA Compute Capability)大于7.0,典型的设备包括 V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于 CUDA Compute Capability 和精度支持情况请参考 NVIDIA 文档:[GPU 硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bf16 支持
Lint 问题需要安装pre-commit 后格式化代码,参考步骤如下: # 安装
pip install pre-commit
# 在项目文件夹下注册pre-commit,每次commit提交时都会格式化代码
pre-commit install
# 单独处理之前的代码文件
pre-commit run --file XXXX.py |
|
||
|
||
## 2. 应用示例 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缺少应用case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥不能直接用Taskflow,至少Taskflow放在前面
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经把taskflow移到前面
schema = ['出发地', '目的地', '费用', '时间'] | ||
``` | ||
|
||
标注步骤如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意数据形式和训练形式的差异
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑修改chat_template,避免重复设置
如果在 GPU 环境中使用,可以指定 gpus 参数进行多卡训练: | ||
|
||
```shell | ||
python -u -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/sft_argument.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改run_finetune.py 路径,引导至llm目录
|
||
通过运行以下命令进行模型评估: | ||
```shell | ||
python ./predictor.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
引导至llm目录进行推理
|
||
## 3. 开箱即用 | ||
|
||
```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用,并满足各类信息抽取需求** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加llm直接调用case,如讨论
""" | ||
|
||
|
||
class UIELLMTask(Task): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意输入是否应用chat_template
""" | ||
Construct the inference model for the predictor. | ||
""" | ||
model_instance = AutoModelForCausalLM.from_pretrained(self._task_path, dtype=self._infer_precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改为对应模型ID
@@ -0,0 +1,370 @@ | |||
# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Application -> applications,保持和之前的小模型一致
|
||
## 1. 模型简介 | ||
|
||
Yaojie Lu 等人在 ACL-2022中提出了通用信息抽取统一框架 UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。然而,该模型在零样本场景下的表现仍存在不足。为此,PaddleNLP 借鉴 UIE 的方法,基于 Qwen2.5-0.5B 预训练模型,训练并开源了一款面向中文通用信息抽取的大模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleNLP 借鉴 UIE 的方法 -> PaddleNLP基于百度 UIE 的建模思路,通过大模型的能力来训练并开源了一款面向中文通用信息抽取的大模型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以先暂且不提 Qwen2.5-0.5B,首先我们开源不同尺寸的模型,后续要看看怎么突出Qwen模型
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction | ||
ie = Taskflow('information_extraction', | ||
schema= ['时间', '选手', '赛事名称'], | ||
schema_lang="zh", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要schema_lang字段吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果只是面向中文,就不需要
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
但好像yayi的数据集是有英文的,所以英文的应该也可以兼容下
|
||
|
||
## 2. 应用示例 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥不能直接用Taskflow,至少Taskflow放在前面
""" | ||
|
||
sentences = [ | ||
"蒋经国在日记中也称蒋介石病逝时“天发雷电,继之以倾盆大雨,正是所谓风云异色,天地同哀", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
政治性的case不用发加
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b": | ||
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "float16" | ||
else: | ||
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp32 -> float32,不然跑不通吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是为了兼容原来的uie
@@ -0,0 +1,348 @@ | |||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2022 -> 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
@@ -0,0 +1,146 @@ | |||
# coding=utf-8 | |||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2022 -> 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
paddlenlp/taskflow/task.py
Outdated
self._config.set_cpu_math_library_num_threads(self._num_threads) | ||
self._config.switch_use_feed_fetch_ops(False) | ||
self._config.disable_glog_info() | ||
self._config.enable_memory_optim() | ||
|
||
self._config.delete_pass("fused_rotary_position_embedding_pass") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete_pass的代码放到 embedding_eltwise_layernorm_fuse_pass 一起
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
放在这里也能执行,看起来这来IR已经构造好了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
schema= ['时间', '选手', '赛事名称'], | ||
schema_lang="zh", | ||
batch_size=1, | ||
model='uie-llm-0.5b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pp-uie-0.5b
修改此处的模型名称
from paddlenlp.generation import GenerationConfig | ||
from paddlenlp.trl import llm_utils | ||
|
||
model_id = "paddlenlp/LLM-UIE-0114" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改此处的模型名称
sft_argument.json 的参考配置如下: | ||
```shell | ||
{ | ||
"model_name_or_path": "paddlenlp/LLM-UIE-0114", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/PP-UIE-0.5B
paddlenlp/taskflow/task.py
Outdated
@@ -54,7 +55,10 @@ def __init__(self, model, task, priority_path=None, **kwargs): | |||
self._param_updated = False | |||
|
|||
self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2) | |||
self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32" | |||
if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意此处的模型名称
paddlenlp/taskflow/taskflow.py
Outdated
@@ -314,16 +323,56 @@ | |||
}, | |||
"information_extraction": { | |||
"models": { | |||
"uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"}, | |||
"uie-llm-0.5b": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改此处的模型名称 pp-uie-0.5b, 以下模型也修改
paddlenlp/taskflow/taskflow.py
Outdated
@@ -693,6 +745,10 @@ | |||
} | |||
|
|||
support_schema_list = [ | |||
"uie-llm-0.5b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同修改
paddlenlp/taskflow/taskflow.py
Outdated
@@ -725,6 +781,8 @@ | |||
] | |||
|
|||
support_argument_list = [ | |||
"uie-llm-1.5b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同修改
PR types
New features
PR changes
APIs
Description
add qwen2 to Taskflow