[LLM-IE] Add pp-uie to Taskflow #9681

Fantasy-02 · 2024-12-24T03:51:34Z

PR types

New features

PR changes

APIs

Description

add qwen2 to Taskflow

paddle-bot · 2024-12-24T03:51:38Z

Thanks for your contribution!

codecov · 2024-12-24T04:26:03Z

Codecov Report

Attention: Patch coverage is 10.47120% with 171 lines in your changes missing coverage. Please review.

Project coverage is 52.13%. Comparing base (67bc4e2) to head (1c0f41b).
Report is 38 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/taskflow/information_extraction.py	9.09%	160 Missing ⚠️
paddlenlp/taskflow/task.py	21.42%	11 Missing ⚠️

❌ Your patch check has failed because the patch coverage (10.47%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (52.13%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9681      +/-   ##
===========================================
- Coverage    52.19%   52.13%   -0.06%     
===========================================
  Files          728      730       +2     
  Lines       117770   116043    -1727     
===========================================
- Hits         61470    60501     -969     
+ Misses       56300    55542     -758

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-12-30T06:46:33Z

paddlenlp/taskflow/taskflow.py

@@ -314,6 +314,17 @@
    },
    "information_extraction": {
        "models": {
+            "llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},


有Llama吗？

没有，这个是我当时测试的，可以删了

ZHUI · 2024-12-30T06:47:37Z

paddlenlp/taskflow/taskflow.py

+            "llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},
+            "qwen-1.5b": {
+                "task_class": QwenIETask,
+                "hidden_size": 768,


这些 hidden_size 不对吧 @wawltor zeyang看一下，这个 hidden_size 参数有用不？

这个hidden_size参数不需要用到

ZHUI · 2024-12-30T06:48:39Z

paddlenlp/taskflow/taskflow.py

@@ -314,6 +314,17 @@
    },
    "information_extraction": {
        "models": {
+            "llama": {"task_class": QwenIETask, "hidden_size": 768, "task_flag": "information_extraction-llama"},
+            "qwen-1.5b": {


看看名字要不要换，ie-qwen-1.5b 或者其他 @wawltor

ZHUI · 2024-12-30T06:49:50Z

paddlenlp/taskflow/text2text_generation.py

@@ -1,252 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.


这个文件是不需要了吗?

ZHUI · 2024-12-30T06:54:02Z

paddlenlp/taskflow/information_extraction.py

+        self._temperature = kwargs.get("temperature", 1.0)
+        self._decode_strategy = kwargs.get("decode_strategy", "sampling")
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+        self.prompt = """你是一个阅读理解专家，请提取所给句子与问题，提取实体。请注意，如果存在实体，则一定在原句中逐字出现，请输出对应实体的原文，不要进行额外修改；如果无法提取，请输出“无相应实体”。


写成全局变量，大写。放在类定义的外面。

QWEN_IE_PROMPT = """"xxx"""

paddlenlp/taskflow/information_extraction.py

ZHUI · 2024-12-30T07:21:59Z

llm/ie/README.md

@@ -0,0 +1,381 @@
+# 通用信息抽取 UIE(Universal Information Extraction)


Suggested change

# 通用信息抽取 UIE(Universal Information Extraction)

# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)

ZHUI · 2024-12-30T07:23:31Z

llm/ie/README.md

+  | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
+  | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
+  | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
+  | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 | -->


ZHUI · 2024-12-30T07:24:01Z

llm/ie/README.md

+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置 schema 的语言，默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同，因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。


还没吃吃的先删除吧、

ZHUI · 2024-12-30T07:24:10Z

llm/ie/README.md

+* `schema_lang`：设置 schema 的语言，默认为`zh`, 可选有`zh`和`en`。因为中英 schema 的构造有所不同，因此需要指定 schema 的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，默认为`qwen-0.5b`，可选有`qwen-0.5b`, `qwen-1.5b`。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快，支持 GPU 和 NPU 硬件环境。如果选择`fp16`，在 GPU 硬件环境下，请先确保机器正确安装 NVIDIA 相关驱动和基础软件，**确保 CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保 GPU 设备的 CUDA 计算能力（CUDA Compute Capability）大于7.0，典型的设备包括 V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于 CUDA Compute Capability 和精度支持情况请参考 NVIDIA 文档：[GPU 硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。


bf16 支持

DrownFish19 · 2025-01-06T07:11:35Z

Lint 问题需要安装pre-commit 后格式化代码，参考步骤如下：

# 安装
pip install pre-commit

# 在项目文件夹下注册pre-commit，每次commit提交时都会格式化代码
pre-commit install

# 单独处理之前的代码文件
pre-commit run --file XXXX.py

DrownFish19 · 2025-01-16T03:14:11Z

llm/Application/information_extraction/README.md

+
+
+## 2. 应用示例
+


缺少应用case

这里为啥不能直接用Taskflow，至少Taskflow放在前面

已经把taskflow移到前面

DrownFish19 · 2025-01-16T03:20:40Z

llm/Application/information_extraction/README.md

+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+标注步骤如下：


注意数据形式和训练形式的差异

考虑修改chat_template，避免重复设置

DrownFish19 · 2025-01-16T03:33:18Z

llm/Application/information_extraction/README.md

+如果在 GPU 环境中使用，可以指定 gpus 参数进行多卡训练：
+
+```shell
+python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/sft_argument.json


修改run_finetune.py 路径，引导至llm目录

DrownFish19 · 2025-01-16T03:33:38Z

llm/Application/information_extraction/README.md

+
+通过运行以下命令进行模型评估：
+```shell
+python ./predictor.py \


引导至llm目录进行推理

DrownFish19 · 2025-01-16T03:34:20Z

llm/Application/information_extraction/README.md

+
+## 3. 开箱即用
+
+```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**


增加llm直接调用case，如讨论

DrownFish19 · 2025-01-16T03:39:05Z

paddlenlp/taskflow/information_extraction.py

+"""
+
+
+class UIELLMTask(Task):


注意输入是否应用chat_template

DrownFish19 · 2025-01-16T03:42:06Z

paddlenlp/taskflow/information_extraction.py

+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = AutoModelForCausalLM.from_pretrained(self._task_path, dtype=self._infer_precision)


修改为对应模型ID

wawltor · 2025-01-16T06:24:52Z

llm/Application/information_extraction/README.md

@@ -0,0 +1,370 @@
+# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)


Application -> applications，保持和之前的小模型一致

wawltor · 2025-01-16T06:27:23Z

llm/Application/information_extraction/README.md

+
+## 1. 模型简介
+
+Yaojie Lu 等人在 ACL-2022中提出了通用信息抽取统一框架 UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模，并使得不同任务间具备良好的迁移和泛化能力。然而，该模型在零样本场景下的表现仍存在不足。为此，PaddleNLP 借鉴 UIE 的方法，基于 Qwen2.5-0.5B 预训练模型，训练并开源了一款面向中文通用信息抽取的大模型。


PaddleNLP 借鉴 UIE 的方法 -> PaddleNLP基于百度 UIE 的建模思路，通过大模型的能力来训练并开源了一款面向中文通用信息抽取的大模型。

这里可以先暂且不提 Qwen2.5-0.5B，首先我们开源不同尺寸的模型，后续要看看怎么突出Qwen模型

wawltor · 2025-01-16T06:29:16Z

llm/Application/information_extraction/README.md

+    schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    ie = Taskflow('information_extraction',
+                  schema= ['时间', '选手', '赛事名称'],
+                  schema_lang="zh",


这里需要schema_lang字段吗？

如果只是面向中文，就不需要

但好像yayi的数据集是有英文的，所以英文的应该也可以兼容下

wawltor · 2025-01-16T08:17:56Z

llm/Application/information_extraction/README.md

+
+
+## 2. 应用示例
+


这里为啥不能直接用Taskflow，至少Taskflow放在前面

wawltor · 2025-01-16T08:22:11Z

llm/Application/information_extraction/README.md

+ """
+
+sentences = [
+    "蒋经国在日记中也称蒋介石病逝时“天发雷电，继之以倾盆大雨，正是所谓风云异色，天地同哀",


政治性的case不用发加

wawltor · 2025-01-16T08:33:13Z

paddlenlp/taskflow/task.py

+        if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b":
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "float16"
+        else:
+            self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"


fp32 -> float32，不然跑不通吧

这里是为了兼容原来的uie

wawltor · 2025-01-16T08:39:25Z

llm/Application/information_extraction/utils.py

@@ -0,0 +1,348 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


2022 -> 2024

wawltor · 2025-01-16T08:39:36Z

llm/Application/information_extraction/doccano.py

@@ -0,0 +1,146 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


2022 -> 2024

wawltor · 2025-01-16T08:40:50Z

paddlenlp/taskflow/task.py

        self._config.set_cpu_math_library_num_threads(self._num_threads)
        self._config.switch_use_feed_fetch_ops(False)
        self._config.disable_glog_info()
        self._config.enable_memory_optim()
-
+        self._config.delete_pass("fused_rotary_position_embedding_pass")


delete_pass的代码放到 embedding_eltwise_layernorm_fuse_pass 一起

放在这里也能执行，看起来这来IR已经构造好了

paddlenlp/taskflow/information_extraction.py

DrownFish19 · 2025-02-06T07:42:08Z

llm/application/information_extraction/README.md

+                  schema= ['时间', '选手', '赛事名称'],
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='uie-llm-0.5b')


pp-uie-0.5b

修改此处的模型名称

DrownFish19 · 2025-02-06T07:42:27Z

llm/application/information_extraction/README.md

+from paddlenlp.generation import GenerationConfig
+from paddlenlp.trl import llm_utils
+
+model_id = "paddlenlp/LLM-UIE-0114"


修改此处的模型名称

DrownFish19 · 2025-02-06T07:42:46Z

llm/application/information_extraction/README.md

+sft_argument.json 的参考配置如下：
+```shell
+{
+    "model_name_or_path": "paddlenlp/LLM-UIE-0114",


paddlenlp/PP-UIE-0.5B

DrownFish19 · 2025-02-06T07:45:02Z

paddlenlp/taskflow/task.py

@@ -54,7 +55,10 @@ def __init__(self, model, task, priority_path=None, **kwargs):
        self._param_updated = False

        self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2)
-        self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
+        if self.task == "uie-llm-0.5b" or self.task == "uie-llm-1.5b":


注意此处的模型名称

DrownFish19 · 2025-02-06T07:46:28Z

paddlenlp/taskflow/taskflow.py

@@ -314,16 +323,56 @@
    },
    "information_extraction": {
        "models": {
-            "uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"},
+            "uie-llm-0.5b": {


修改此处的模型名称 pp-uie-0.5b, 以下模型也修改

DrownFish19 · 2025-02-06T07:47:13Z

paddlenlp/taskflow/taskflow.py

@@ -693,6 +745,10 @@
 }

 support_schema_list = [
+    "uie-llm-0.5b",


DrownFish19 · 2025-02-06T07:47:21Z

paddlenlp/taskflow/taskflow.py

@@ -725,6 +781,8 @@
 ]

 support_argument_list = [
+    "uie-llm-1.5b",


Fantasy-02 added 7 commits December 16, 2024 12:42

fix taskflow

ceeaf60

add llm/ie for SFT of ie task

31ae5f7

remove data folder

a99f41f

update readme

2eeb536

add taskflow

97f578e

remove taskflow

9abda7d

add sft for ie task

1458fe8

paddle-bot bot added the contributor label Dec 24, 2024

paddle-bot bot assigned wawltor Dec 24, 2024

ZHUI changed the title ~~add qwen2 to Taskflow~~ [LLM-IE] Add qwen2 to Taskflow Dec 24, 2024

add multi_stage_predict of taskflow

b64427c

ZHUI reviewed Dec 30, 2024

View reviewed changes

Fantasy-02 added 3 commits December 30, 2024 11:21

fix some bug and rename qwen2 as uie-llm

2f056bf

fixed test and lint

e5a1968

fixed test and lint

71d0060

Fantasy-02 added 10 commits January 6, 2025 07:25

fixed test and lint

5668e2c

Merge branch 'develop' of github.com:Fantasy-02/PaddleNLP into taskflow

712cf0a

update 0109

36c3228

update 0109

c118a87

update 0113

9d0bb53

updaate 0113

f6d6a8a

remove static mode in taskflow

3caabc3

update 0115

9d80118

update 0115

244b71d

update 0115

3e871a0

DrownFish19 reviewed Jan 16, 2025

View reviewed changes

update 0116

20faa12

Fantasy-02 added 3 commits January 16, 2025 06:32

update 0116

9b282ef

update readme.md

840b153

update readme.md

837a792

wawltor reviewed Jan 16, 2025

View reviewed changes

Fantasy-02 added 2 commits January 16, 2025 10:19

update 0116

4b92882

update taskflow.py and readme.md

992d20a

DrownFish19 changed the title ~~[LLM-IE] Add qwen2 to Taskflow~~ [LLM-IE] Add pp-uie to Taskflow Feb 6, 2025

DrownFish19 reviewed Feb 6, 2025

View reviewed changes

rename model name

1c0f41b

PaddlePaddle locked and limited conversation to collaborators Feb 7, 2025

PaddlePaddle unlocked this conversation Feb 7, 2025

Fantasy-02 closed this Feb 7, 2025

Fantasy-02 reopened this Feb 7, 2025

		@@ -1,252 +0,0 @@
		# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,381 @@
		# 通用信息抽取 UIE(Universal Information Extraction)

	# 通用信息抽取 UIE(Universal Information Extraction)
	# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)


		## 3. 开箱即用

		```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。实现开箱即用，并满足各类信息抽取需求

		@@ -0,0 +1,370 @@
		# 大模型信息抽取 LLM-IE(Large Language Model Information Extraction)


		## 1. 模型简介

		Yaojie Lu 等人在 ACL-2022中提出了通用信息抽取统一框架 UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模，并使得不同任务间具备良好的迁移和泛化能力。然而，该模型在零样本场景下的表现仍存在不足。为此，PaddleNLP 借鉴 UIE 的方法，基于 Qwen2.5-0.5B 预训练模型，训练并开源了一款面向中文通用信息抽取的大模型。

		@@ -0,0 +1,348 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,146 @@
		# coding=utf-8
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

[LLM-IE] Add pp-uie to Taskflow #9681

Are you sure you want to change the base?

[LLM-IE] Add pp-uie to Taskflow #9681

Conversation

Fantasy-02 commented Dec 24, 2024

PR types

PR changes

Description

paddle-bot bot commented Dec 24, 2024

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrownFish19 commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrownFish19 Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 24, 2024 •

edited

Loading

DrownFish19 Feb 6, 2025 •

edited

Loading