Skip to content

Commit

Permalink
Merge remote-tracking branch 'lemon/prompt_doc' into prompt_doc
Browse files Browse the repository at this point in the history
  • Loading branch information
LemonNoel committed Oct 10, 2022
2 parents 837a068 + 7d34a2f commit 9adebed
Show file tree
Hide file tree
Showing 86 changed files with 12,797 additions and 473 deletions.
15 changes: 1 addition & 14 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,19 +37,6 @@
* 🍭 AIGC 内容生成:新增代码生成 SOTA 模型[**CodeGen**](./examples/code_generation/codegen),支持多种编程语言代码生成;集成[**文图生成潮流模型**](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md#%E6%96%87%E5%9B%BE%E7%94%9F%E6%88%90) DALL·E Mini、Disco Diffusion、Stable Diffusion,更多趣玩模型等你来玩;新增[**中文文本摘要应用**](./applications/text_summarization),基于大规模语料的中文摘要模型首次发布,可支持 Taskflow 一键调用和定制训练;
* 💪 框架升级:[**模型自动压缩 API**](./docs/compression.md) 发布,自动对模型进行裁减和量化,大幅降低模型压缩技术使用门槛;[**小样本 Prompt**](./applications/text_classification/multi_class/few-shot)能力发布,集成 PET、P-Tuning、RGL 等经典算法。


* 👀 **2022.9.6 飞桨智慧金融行业系列直播课**

* 围绕深度学习技术在金融行业的产业实践与发展趋势,邀请行业内专家分享产业实践。探讨科技金融的未来发展;

* PaddleNLP配套课程发布产业实践范例:基于UIE的金融文件信息抽取;基于Pipelines的FAQ问答系统;

* **9月6日起每周二、周四19点直播**,扫码免费加入微信群获取直播链接,与行业专家深度交流:

<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/188596360-264415d4-5462-43ad-8517-5b7e690061ce.jpg" width="150" height="150" />
</div>

* 🔥 **2022.5.16 发布 [PaddleNLP v2.3](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.3.0)**
* 💎 发布通用信息抽取技术 [**UIE**](./model_zoo/uie),单模型支持实体识别、关系和事件抽取、情感分析等多种开放域信息抽取任务,不限领域和抽取目标,支持**零样本抽取**与全流程**小样本**高效定制开发;
* 😊 发布文心大模型 [**ERNIE 3.0**](./model_zoo/ernie-3.0) 轻量级模型,在 [CLUE ](https://www.cluebenchmarks.com/)上实现同规模结构效果最佳,并提供**🗜️无损压缩****⚙️全场景部署**方案;
Expand All @@ -58,7 +45,7 @@

## 社区交流

- 微信扫描二维码并填写问卷之后,加入交流群领取福利
- 微信扫描二维码并填写问卷,回复小助手关键词(NLP)之后,即可加入交流群领取福利
- 与众多社区开发者以及官方团队深度交流。
- 10G重磅NLP学习大礼包!

Expand Down
2 changes: 1 addition & 1 deletion README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ For more usage please refer to [Taskflow Docs](./docs/model_zoo/taskflow.md).

#### 🀄 Comprehensive Chinese Transformer Models

We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained mdoels of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP!
We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained models of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP!

```python
from paddlenlp.transformers import *
Expand Down
8 changes: 3 additions & 5 deletions applications/neural_search/recall/in_batch_negative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,9 +204,8 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
--hnsw_m 100 \
--hnsw_ef 100 \
--recall_num 50 \
--similar_text_pair "recall/dev.csv" \
--corpus_file "recall/corpus.csv" \
--similar_text_pair "recall/dev.csv"
--similar_text_pair_file "recall/dev.csv" \
--corpus_file "recall/corpus.csv"
```

参数含义说明
Expand All @@ -228,9 +227,8 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
* `hnsw_m`: hnsw 算法相关参数,保持默认即可
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
* `recall_num`: 对 1 个文本召回的相似文本数量
* `similar_text_pair`: 由相似文本对构成的评估集
* `similar_text_pair_file`: 由相似文本对构成的评估集
* `corpus_file`: 召回库数据 corpus_file
* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv

也可以使用bash脚本:

Expand Down
75 changes: 75 additions & 0 deletions docs/model_zoo/taskflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ PaddleNLP提供**开箱即用**的产业级NLP预置任务能力,无需训练
| [代码生成](#代码生成) | `Taskflow("code_generation")` |||| | | 代码生成大模型 |
| [文图生成](#文图生成) | `Taskflow("text_to_image")` |||| | | 文图生成大模型 |
| [文本摘要](#文本摘要) | `Taskflow("text_summarization")` ||||| | 文本摘要大模型 |
| [文档智能](#文档智能) | `Taskflow("document_intelligence")` ||||| | 基于跨模态通用文档预训练模型ERNIE-LayoutX |


## QuickStart
Expand Down Expand Up @@ -1546,6 +1547,80 @@ from paddlenlp import Taskflow

</div></details>

### 文档智能
<details><summary>&emsp; 基于跨模态通用文档预训练模型ERNIE-LayoutX </summary><div>

#### 输入格式

```
[
{"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]
```

默认使用PaddleOCR进行OCR识别,同时支持用户通过``word_boxes``传入自己的OCR结果,格式为``List[str, List[float, float, float, float]]``

```
[
{"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]
```

#### 支持单条、批量预测

- 支持本地图片路径输入

<div align="center">
<img src=https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/resume.png height=800 hspace='20'/>
</div>


```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence")
>>> docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])
[{'prompt': '五百丁本次想要担任的是什么职位?',
'result': [{'end': 183, 'prob': 1.0, 'start': 180, 'value': '客户经理'}]},
{'prompt': '五百丁是在哪里上的大学?',
'result': [{'end': 38, 'prob': 1.0, 'start': 32, 'value': '广州五百丁学院'}]},
{'prompt': '大学学的是什么专业?',
'result': [{'end': 45, 'prob': 0.74, 'start': 39, 'value': '金融学(本科)'}]}]
```

- http图片链接输入

<div align="center">
<img src=https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg height=400 hspace='10'/>
</div>


```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence")
>>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}])
[{'prompt': '发票号码是多少?',
'result': [{'end': 10, 'prob': 0.96, 'start': 7, 'value': 'No44527206'}]},
{'prompt': '校验码是多少?',
'result': [{'end': 271,
'prob': 1.0,
'start': 263,
'value': '01107 555427109891646'}]}]
```

#### 可配置参数说明
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1
* `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`
* `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1


</div></details>


## PART Ⅱ &emsp; 定制化训练

<details><summary>适配任务列表</summary><div>
Expand Down
2 changes: 1 addition & 1 deletion examples/code_generation/codegen/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ def do_train(args):
block_size)
dev_set = process_ds(dev_set, tokenizer, args.overwrite_cache, block_size)

batchify_fn = DataCollatorWithPadding(tokenizer)
batchify_fn = DataCollatorWithPadding(tokenizer, return_attention_mask=True)

train_batch_sampler = DistributedBatchSampler(
train_set, batch_size=args.train_batch_size, shuffle=True)
Expand Down
2 changes: 1 addition & 1 deletion examples/language_model/gpt-3/dygraph/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ def parse_args(MODEL_CLASSES):
parser.add_argument("--device",
type=str,
default="gpu",
choices=["cpu", "gpu", "xpu"],
choices=["cpu", "gpu", "xpu", "npu"],
help="select cpu, gpu, xpu devices.")
parser.add_argument("--lr_decay_style",
type=str,
Expand Down
22 changes: 20 additions & 2 deletions examples/language_model/gpt-3/dygraph/run_pretrain.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@
from paddle.distributed import fleet
from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer import DygraphShardingOptimizer
from paddle.fluid.dygraph.parallel import sync_params_buffers
from paddle.distributed.fleet.utils.hybrid_parallel_util import fused_allreduce_gradients

# add sharding stage2/3
from paddle.distributed.sharding import group_sharded_parallel
Expand Down Expand Up @@ -151,9 +153,10 @@ def do_train(args):
dp_rank = hcg.get_data_parallel_rank()
sharding_rank = hcg.get_sharding_parallel_rank()

# sharding stage2/3 not support hybrid parallel
# sharding stage2/3 not support hybrid parallel now
if args.sharding_stage in [2, 3]:
assert args.dp_degree == args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support hybrid parallel later"
assert args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support tensor/pipeline parallel later"
dp_group = hcg.get_data_parallel_group()

sharding_size = hcg.get_sharding_parallel_world_size()
data_world_rank = dp_rank * sharding_size + sharding_rank
Expand Down Expand Up @@ -275,6 +278,11 @@ def do_train(args):
# wrap sharding stage2/3 and add collective group
# TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
if args.sharding_stage in [2, 3]:
if args.dp_degree > 1:
sync_params_buffers(model,
comm_group=dp_group,
src_rank=dp_group.ranks[0])

scaler = scaler if args.use_pure_fp16 else None
model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler,
args.sharding_offload)
Expand Down Expand Up @@ -359,6 +367,16 @@ def do_train(args):
loss_mbs.backward()
loss = loss + loss_mbs

if args.sharding_stage in [2, 3] and args.dp_degree > 1:
fused_allreduce_gradients(model.parameters(), hcg)
if args.sharding_stage == 3:
for p in model.parameters():
if hasattr(p, "bw_storage"):
assert p.grad is None, "This case shouldn't happen."
p.bw_storage.scale_(1.0 / dp_group.nranks)
paddle.distributed.all_reduce(
p.bw_storage, group=dp_group)

if args.use_pure_fp16:
if args.sharding_stage in [2, 3]:
scaler.step(optimizer)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import sys

Expand Down Expand Up @@ -28,7 +42,7 @@ def parse_args():
parser.add_argument("--device",
default="gpu",
type=str,
choices=["gpu", "xpu", "cpu"],
choices=["gpu", "xpu", "cpu", "npu"],
help="Device to use during inference. ")
parser.add_argument("--use_mkl",
default=False,
Expand Down Expand Up @@ -131,7 +145,9 @@ def create_predictor(cls,
if args.device == "gpu":
config.enable_use_gpu(100, 0)
elif args.device == "xpu":
config.enable_xpu(100)
config.enable_xpu()
elif args.device == "npu":
config.enable_npu()
else:
# CPU
config.disable_gpu()
Expand Down
24 changes: 24 additions & 0 deletions examples/machine_translation/transformer/predict.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import yaml
import logging
Expand Down Expand Up @@ -60,6 +74,11 @@ def parse_args():
type=str,
help="The eos token. It should be provided when use custom vocab_file. "
)
parser.add_argument("--device",
default="gpu",
choices=["gpu", "cpu", "xpu", "npu"],
help="Device selected for inference.")

args = parser.parse_args()
return args

Expand All @@ -83,6 +102,10 @@ def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
def do_predict(args):
if args.device == "gpu":
place = "gpu"
elif args.device == "xpu":
place = "xpu"
elif args.device == "npu":
place = "npu"
else:
place = "cpu"

Expand Down Expand Up @@ -157,6 +180,7 @@ def do_predict(args):
args.unk_token = ARGS.unk_token
args.bos_token = ARGS.bos_token
args.eos_token = ARGS.eos_token
args.device = ARGS.device
pprint(args)

do_predict(args)
13 changes: 13 additions & 0 deletions examples/machine_translation/transformer/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,10 @@ def parse_args():
type=str,
choices=['true', 'false', 'True', 'False'],
help="Whether to use amp to train Transformer. ")
parser.add_argument("--device",
default="gpu",
choices=["gpu", "cpu", "xpu", "npu"],
help="Device selected for inference.")
parser.add_argument(
"--amp_level",
default=None,
Expand All @@ -126,6 +130,14 @@ def do_train(args):
if args.device == "gpu":
rank = dist.get_rank()
trainer_count = dist.get_world_size()
elif args.device == "npu":
rank = dist.get_rank()
trainer_count = dist.get_world_size()
paddle.set_device("npu")
elif args.device == "xpu":
rank = dist.get_rank()
trainer_count = dist.get_world_size()
paddle.set_device("xpu")
else:
rank = 0
trainer_count = 1
Expand Down Expand Up @@ -401,6 +413,7 @@ def do_train(args):
args.bos_token = ARGS.bos_token
args.eos_token = ARGS.eos_token
args.to_static = ARGS.to_static
args.device = ARGS.device
pprint(args)

args.profiler_options = ARGS.profiler_options
Expand Down
Loading

0 comments on commit 9adebed

Please sign in to comment.