Merge remote-tracking branch 'lemon/prompt_doc' into prompt_doc

PaddlePaddle · Oct 10, 2022 · 9adebed · 9adebed
2 parents 837a068 + 7d34a2f
commit 9adebed
Show file tree

Hide file tree

Showing 86 changed files with 12,797 additions and 473 deletions.
diff --git a/README_cn.md b/README_cn.md
@@ -37,19 +37,6 @@
   * 🍭 AIGC 内容生成：新增代码生成 SOTA 模型[**CodeGen**](./examples/code_generation/codegen)，支持多种编程语言代码生成；集成[**文图生成潮流模型**](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md#%E6%96%87%E5%9B%BE%E7%94%9F%E6%88%90) DALL·E Mini、Disco Diffusion、Stable Diffusion，更多趣玩模型等你来玩；新增[**中文文本摘要应用**](./applications/text_summarization)，基于大规模语料的中文摘要模型首次发布，可支持 Taskflow 一键调用和定制训练；
   * 💪 框架升级：[**模型自动压缩 API**](./docs/compression.md) 发布，自动对模型进行裁减和量化，大幅降低模型压缩技术使用门槛；[**小样本 Prompt**](./applications/text_classification/multi_class/few-shot)能力发布，集成 PET、P-Tuning、RGL 等经典算法。
 
-
-* 👀 **2022.9.6 飞桨智慧金融行业系列直播课**
-
-  * 围绕深度学习技术在金融行业的产业实践与发展趋势，邀请行业内专家分享产业实践。探讨科技金融的未来发展；
-
-  * PaddleNLP配套课程发布产业实践范例：基于UIE的金融文件信息抽取；基于Pipelines的FAQ问答系统；
-
-  * **9月6日起每周二、周四19点直播**，扫码免费加入微信群获取直播链接，与行业专家深度交流：
-
-    <div align="center">
-    <img src="https://user-images.githubusercontent.com/11793384/188596360-264415d4-5462-43ad-8517-5b7e690061ce.jpg" width="150" height="150" />
-    </div>
-
 * 🔥 **2022.5.16  发布 [PaddleNLP v2.3](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.3.0)**
   * 💎 发布通用信息抽取技术 [**UIE**](./model_zoo/uie)，单模型支持实体识别、关系和事件抽取、情感分析等多种开放域信息抽取任务，不限领域和抽取目标，支持**零样本抽取**与全流程**小样本**高效定制开发；
   * 😊 发布文心大模型 [**ERNIE 3.0**](./model_zoo/ernie-3.0) 轻量级模型，在 [CLUE ](https://www.cluebenchmarks.com/)上实现同规模结构效果最佳，并提供**🗜️无损压缩**和**⚙️全场景部署**方案；
@@ -58,7 +45,7 @@
 
 ## 社区交流
 
-- 微信扫描二维码并填写问卷之后，加入交流群领取福利
+- 微信扫描二维码并填写问卷，回复小助手关键词（NLP）之后，即可加入交流群领取福利
   - 与众多社区开发者以及官方团队深度交流。
   - 10G重磅NLP学习大礼包！
 

diff --git a/README_en.md b/README_en.md
@@ -81,7 +81,7 @@ For more usage please refer to [Taskflow Docs](./docs/model_zoo/taskflow.md).
 
 #### 🀄 Comprehensive Chinese Transformer Models
 
-We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained mdoels of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP!
+We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained models of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP!
 
 ```python
 from paddlenlp.transformers import *

diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md
@@ -204,9 +204,8 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
     --hnsw_m 100 \
     --hnsw_ef 100 \
     --recall_num 50 \
-    --similar_text_pair "recall/dev.csv" \
-    --corpus_file "recall/corpus.csv"  \
-    --similar_text_pair "recall/dev.csv"
+    --similar_text_pair_file "recall/dev.csv" \
+    --corpus_file "recall/corpus.csv"
 ```
 
 参数含义说明
@@ -228,9 +227,8 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
 * `hnsw_m`: hnsw 算法相关参数，保持默认即可
 * `hnsw_ef`: hnsw 算法相关参数，保持默认即可
 * `recall_num`: 对 1 个文本召回的相似文本数量
-* `similar_text_pair`: 由相似文本对构成的评估集
+* `similar_text_pair_file`: 由相似文本对构成的评估集
 * `corpus_file`: 召回库数据 corpus_file
-* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
 
 也可以使用bash脚本：
 

diff --git a/docs/model_zoo/taskflow.md b/docs/model_zoo/taskflow.md
@@ -43,6 +43,7 @@ PaddleNLP提供**开箱即用**的产业级NLP预置任务能力，无需训练
 | [代码生成](#代码生成)          | `Taskflow("code_generation")`        | ✅        | ✅        | ✅        |            |            | 代码生成大模型 |
 | [文图生成](#文图生成)          | `Taskflow("text_to_image")`        | ✅        | ✅        | ✅        |            |            | 文图生成大模型 |
 | [文本摘要](#文本摘要)          | `Taskflow("text_summarization")`        | ✅        | ✅        | ✅        | ✅          |            | 文本摘要大模型 |
+| [文档智能](#文档智能)          | `Taskflow("document_intelligence")`        | ✅        | ✅        | ✅        | ✅          |            | 基于跨模态通用文档预训练模型ERNIE-LayoutX |
 
 
 ## QuickStart
@@ -1546,6 +1547,80 @@ from paddlenlp import Taskflow
 
 </div></details>
 
+### 文档智能
+<details><summary>&emsp; 基于跨模态通用文档预训练模型ERNIE-LayoutX </summary><div>
+
+#### 输入格式
+
+```
+[
+  {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+默认使用PaddleOCR进行OCR识别，同时支持用户通过``word_boxes``传入自己的OCR结果，格式为``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+#### 支持单条、批量预测
+
+- 支持本地图片路径输入
+
+<div align="center">
+    <img src=https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/resume.png height=800 hspace='20'/>
+</div>
+
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> docprompt = Taskflow("document_intelligence")
+>>> docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}])
+[{'prompt': '五百丁本次想要担任的是什么职位?',
+  'result': [{'end': 183, 'prob': 1.0, 'start': 180, 'value': '客户经理'}]},
+ {'prompt': '五百丁是在哪里上的大学?',
+  'result': [{'end': 38, 'prob': 1.0, 'start': 32, 'value': '广州五百丁学院'}]},
+ {'prompt': '大学学的是什么专业?',
+  'result': [{'end': 45, 'prob': 0.74, 'start': 39, 'value': '金融学(本科）'}]}]
+```
+
+- http图片链接输入
+
+<div align="center">
+    <img src=https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg height=400 hspace='10'/>
+</div>
+
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> docprompt = Taskflow("document_intelligence")
+>>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}])
+[{'prompt': '发票号码是多少?',
+  'result': [{'end': 10, 'prob': 0.96, 'start': 7, 'value': 'No44527206'}]},
+ {'prompt': '校验码是多少?',
+  'result': [{'end': 271,
+              'prob': 1.0,
+              'start': 263,
+              'value': '01107 555427109891646'}]}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+* `topn`: 如果模型识别出多个结果，将返回前n个概率值最高的结果，默认为1。
+
+
+</div></details>
+
+
 ## PART Ⅱ &emsp; 定制化训练
 
 <details><summary>适配任务列表</summary><div>

diff --git a/examples/code_generation/codegen/run_clm.py b/examples/code_generation/codegen/run_clm.py
@@ -252,7 +252,7 @@ def do_train(args):
                            block_size)
     dev_set = process_ds(dev_set, tokenizer, args.overwrite_cache, block_size)
 
-    batchify_fn = DataCollatorWithPadding(tokenizer)
+    batchify_fn = DataCollatorWithPadding(tokenizer, return_attention_mask=True)
 
     train_batch_sampler = DistributedBatchSampler(
         train_set, batch_size=args.train_batch_size, shuffle=True)

diff --git a/examples/language_model/gpt-3/dygraph/args.py b/examples/language_model/gpt-3/dygraph/args.py
@@ -286,7 +286,7 @@ def parse_args(MODEL_CLASSES):
     parser.add_argument("--device",
                         type=str,
                         default="gpu",
-                        choices=["cpu", "gpu", "xpu"],
+                        choices=["cpu", "gpu", "xpu", "npu"],
                         help="select cpu, gpu, xpu devices.")
     parser.add_argument("--lr_decay_style",
                         type=str,

diff --git a/examples/language_model/gpt-3/dygraph/run_pretrain.py b/examples/language_model/gpt-3/dygraph/run_pretrain.py
@@ -37,6 +37,8 @@
 from paddle.distributed import fleet
 from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
 from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer import DygraphShardingOptimizer
+from paddle.fluid.dygraph.parallel import sync_params_buffers
+from paddle.distributed.fleet.utils.hybrid_parallel_util import fused_allreduce_gradients
 
 # add sharding stage2/3
 from paddle.distributed.sharding import group_sharded_parallel
@@ -151,9 +153,10 @@ def do_train(args):
     dp_rank = hcg.get_data_parallel_rank()
     sharding_rank = hcg.get_sharding_parallel_rank()
 
-    # sharding stage2/3 not support hybrid parallel
+    # sharding stage2/3 not support hybrid parallel now
     if args.sharding_stage in [2, 3]:
-        assert args.dp_degree == args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support hybrid parallel later"
+        assert args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support tensor/pipeline parallel later"
+        dp_group = hcg.get_data_parallel_group()
 
     sharding_size = hcg.get_sharding_parallel_world_size()
     data_world_rank = dp_rank * sharding_size + sharding_rank
@@ -275,6 +278,11 @@ def do_train(args):
     # wrap sharding stage2/3 and add collective group
     # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
     if args.sharding_stage in [2, 3]:
+        if args.dp_degree > 1:
+            sync_params_buffers(model,
+                                comm_group=dp_group,
+                                src_rank=dp_group.ranks[0])
+
         scaler = scaler if args.use_pure_fp16 else None
         model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler,
                                                      args.sharding_offload)
@@ -359,6 +367,16 @@ def do_train(args):
                             loss_mbs.backward()
                         loss = loss + loss_mbs
 
+                    if args.sharding_stage in [2, 3] and args.dp_degree > 1:
+                        fused_allreduce_gradients(model.parameters(), hcg)
+                        if args.sharding_stage == 3:
+                            for p in model.parameters():
+                                if hasattr(p, "bw_storage"):
+                                    assert p.grad is None, "This case shouldn't happen."
+                                    p.bw_storage.scale_(1.0 / dp_group.nranks)
+                                    paddle.distributed.all_reduce(
+                                        p.bw_storage, group=dp_group)
+
                     if args.use_pure_fp16:
                         if args.sharding_stage in [2, 3]:
                             scaler.step(optimizer)

diff --git a/examples/machine_translation/transformer/deploy/python/inference.py b/examples/machine_translation/transformer/deploy/python/inference.py
@@ -1,3 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import sys
 
@@ -28,7 +42,7 @@ def parse_args():
     parser.add_argument("--device",
                         default="gpu",
                         type=str,
-                        choices=["gpu", "xpu", "cpu"],
+                        choices=["gpu", "xpu", "cpu", "npu"],
                         help="Device to use during inference. ")
     parser.add_argument("--use_mkl",
                         default=False,
@@ -131,7 +145,9 @@ def create_predictor(cls,
             if args.device == "gpu":
                 config.enable_use_gpu(100, 0)
             elif args.device == "xpu":
-                config.enable_xpu(100)
+                config.enable_xpu()
+            elif args.device == "npu":
+                config.enable_npu()
             else:
                 # CPU
                 config.disable_gpu()

diff --git a/examples/machine_translation/transformer/predict.py b/examples/machine_translation/transformer/predict.py
@@ -1,3 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import yaml
 import logging
@@ -60,6 +74,11 @@ def parse_args():
         type=str,
         help="The eos token. It should be provided when use custom vocab_file. "
     )
+    parser.add_argument("--device",
+                        default="gpu",
+                        choices=["gpu", "cpu", "xpu", "npu"],
+                        help="Device selected for inference.")
+
     args = parser.parse_args()
     return args
 
@@ -83,6 +102,10 @@ def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
 def do_predict(args):
     if args.device == "gpu":
         place = "gpu"
+    elif args.device == "xpu":
+        place = "xpu"
+    elif args.device == "npu":
+        place = "npu"
     else:
         place = "cpu"
 
@@ -157,6 +180,7 @@ def do_predict(args):
     args.unk_token = ARGS.unk_token
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
+    args.device = ARGS.device
     pprint(args)
 
     do_predict(args)
diff --git a/examples/machine_translation/transformer/train.py b/examples/machine_translation/transformer/train.py
@@ -100,6 +100,10 @@ def parse_args():
                         type=str,
                         choices=['true', 'false', 'True', 'False'],
                         help="Whether to use amp to train Transformer. ")
+    parser.add_argument("--device",
+                        default="gpu",
+                        choices=["gpu", "cpu", "xpu", "npu"],
+                        help="Device selected for inference.")
     parser.add_argument(
         "--amp_level",
         default=None,
@@ -126,6 +130,14 @@ def do_train(args):
     if args.device == "gpu":
         rank = dist.get_rank()
         trainer_count = dist.get_world_size()
+    elif args.device == "npu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("npu")
+    elif args.device == "xpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("xpu")
     else:
         rank = 0
         trainer_count = 1
@@ -401,6 +413,7 @@ def do_train(args):
     args.bos_token = ARGS.bos_token
     args.eos_token = ARGS.eos_token
     args.to_static = ARGS.to_static
+    args.device = ARGS.device
     pprint(args)
 
     args.profiler_options = ARGS.profiler_options