Merge branch 'pip40' of https://github.com/w5688414/PaddleNLP into pip40

PaddlePaddle · Nov 16, 2022 · 3fc25aa · 3fc25aa
2 parents 1cd6407 + 5954d54
commit 3fc25aa
Show file tree

Hide file tree

Showing 331 changed files with 13,298 additions and 3,738 deletions.
diff --git a/README_cn.md b/README_cn.md
@@ -30,6 +30,9 @@
 **PaddleNLP**是一款**简单易用**且**功能强大**的自然语言处理开发库。聚合业界**优质预训练模型**并提供**开箱即用**的开发体验，覆盖NLP多场景的模型库搭配**产业实践范例**可满足开发者**灵活定制**的需求。
 
 ## News 📢
+
+* 🔥 **2022.11.12 PaddleNLP新增AutoPrompt自动化提示功能，登顶FewCLUE小样本学习榜单!**
+  * 🥇 PaddleNLP 团队开源了 **AutoPrompt** 方案，基于开源的文心 ERNIE 预训练语言模型 ，结合了领域预训练和自动化提示学习技术，以291M 参数量的模型在小样本权威学习榜单 FewCLUE 排名第一，[详见](https://mp.weixin.qq.com/s/_JPiAzFA1f0BZ0igdv-EKA)。
 * 🔥 **2022.10.27 发布 [PaddleNLP v2.4.2](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.2)**
   * NLG能力扩充：新增📄[**基于Pegasus的中文文本摘要方案**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_summarization/pegasus)，效果领先；新增❓[**问题生成解决方案**](./examples/question_generation)，提供基于业界领先模型UNIMO-Text和大规模多领域问题生成数据集训练的通用问题生成预训练模型。均支持Taskflow一键调用，支持FasterGeneration高性能推理，训练推理部署全流程打通。
   * 发布 🖼[**PPDiffusers**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers)：支持跨模态（如图像与语音）训练和推理的扩散模型（Diffusion Model）工具箱，可快速体验、二次开发 **Stable Diffusion**，持续支持更多模型。
@@ -234,7 +237,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
 
 ### 高性能分布式训练与推理
 
-#### ⚡ FasterTokenizer：高性能文本处理库
+#### ⚡ FastTokenizer：高性能文本处理库
 
 <div align="center">
     <img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
@@ -244,7 +247,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
 AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
 ```
 
-为了实现更极致的模型部署性能，安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考[FasterTokenizer文档](./faster_tokenizer)。
+为了实现更极致的模型部署性能，安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考[FastTokenizer文档](./fast_tokenizer)。
 
 #### ⚡️ FasterGeneration：高性能生成加速库
 

diff --git a/README_en.md b/README_en.md
@@ -30,6 +30,8 @@
 
 ## News 📢
 
+* 🔥 **2022.11.12 PaddleNLP added AutoPrompt and won the first place in FewCLUE!**
+  * 🥇 The PaddleNLP team has open-sourced the **AutoPrompt** solution, which is based on the open-source Wenxin ERNIE pre-training language model, combined with domain pre-training and automated prompt learning technology, and ranked first in FewCLUE (a authoritative few-sample learning contest) with a model with 291M parameters. [see details](https://mp.weixin.qq.com/s/_JPiAzFA1f0BZ0igdv-EKA).
 * 🔥 **2022.10.27 [PaddleNLP v2.4.2](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.2) Released!**
   * NLG Upgrade: 📄 Release [**Solution of Text Summarization**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_summarization/pegasus) based on Pegasus;❓ Release [**Solution of Problem Generation**](./examples/question_generation), providing **general problem generation pre-trained model** based on Baidu's UNIMO Text and large-scale multi domain problem generation dataset。Supporting high-performance inference ability based on FasterGeneration , and covering the whole process of training , inference and deployment.
 * 🔥 **2022.10.14 [PaddleNLP v2.4.1](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.1) Released!**

diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md
@@ -229,6 +229,9 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
 * `recall_num`: 对 1 个文本召回的相似文本数量
 * `similar_text_pair_file`: 由相似文本对构成的评估集
 * `corpus_file`: 召回库数据 corpus_file
+* `use_recompute`: 使用Recompute策略，用于节省显存，是一种以时间换空间的技术
+* `use_gradient_cache`: 使用Gradient Cache策略，用于节省显存，是一种以时间换空间的技术
+* `chunk_numbers`: 使用Gradient Cache策略的参数，表示的是同一个批次的样本分几次执行
 
 也可以使用bash脚本：
 

diff --git a/applications/neural_search/recall/in_batch_negative/batch_negative/model.py b/applications/neural_search/recall/in_batch_negative/batch_negative/model.py
@@ -60,14 +60,14 @@ def forward(self,
                                    title_cls_embedding,
                                    transpose_y=True)
 
-        # substract margin from all positive samples cosine_sim()
+        # Substract margin from all positive samples cosine_sim()
         margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]],
                                   fill_value=self.margin,
                                   dtype=paddle.get_default_dtype())
 
         cosine_sim = cosine_sim - paddle.diag(margin_diag)
 
-        # scale cosine to ease training converge
+        # Scale cosine to ease training converge
         cosine_sim *= self.sacle
 
         labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
@@ -76,3 +76,56 @@ def forward(self,
         loss = F.cross_entropy(input=cosine_sim, label=labels)
 
         return loss
+
+
+class SemanticIndexCacheNeg(SemanticIndexBase):
+
+    def __init__(self,
+                 pretrained_model,
+                 dropout=None,
+                 margin=0.3,
+                 scale=30,
+                 output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(self,
+                query_input_ids,
+                title_input_ids,
+                query_token_type_ids=None,
+                query_position_ids=None,
+                query_attention_mask=None,
+                title_token_type_ids=None,
+                title_position_ids=None,
+                title_attention_mask=None):
+
+        query_cls_embedding = self.get_pooled_embedding(query_input_ids,
+                                                        query_token_type_ids,
+                                                        query_position_ids,
+                                                        query_attention_mask)
+
+        title_cls_embedding = self.get_pooled_embedding(title_input_ids,
+                                                        title_token_type_ids,
+                                                        title_position_ids,
+                                                        title_attention_mask)
+
+        cosine_sim = paddle.matmul(query_cls_embedding,
+                                   title_cls_embedding,
+                                   transpose_y=True)
+
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]],
+                                  fill_value=self.margin,
+                                  dtype=cosine_sim.dtype)
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # Scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        return [cosine_sim, labels, query_cls_embedding, title_cls_embedding]
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh b/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh
@@ -1,6 +1,20 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # GPU version
 root_dir="checkpoints/inbatch" 
-python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
         recall.py \
         --device gpu \
         --recall_result_dir "recall_result_dir" \
@@ -11,7 +25,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
         --hnsw_ef 100 \
         --batch_size 64 \
         --output_emb_size 256\
-        --max_seq_length 60 \
+        --max_seq_length 64 \
         --recall_num 50 \
         --similar_text_pair "recall/dev.csv" \
         --corpus_file "recall/corpus.csv"