Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
w5688414 committed Nov 16, 2022
2 parents 1cd6407 + 5954d54 commit 3fc25aa
Show file tree
Hide file tree
Showing 331 changed files with 13,298 additions and 3,738 deletions.
7 changes: 5 additions & 2 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@
**PaddleNLP**是一款**简单易用****功能强大**的自然语言处理开发库。聚合业界**优质预训练模型**并提供**开箱即用**的开发体验,覆盖NLP多场景的模型库搭配**产业实践范例**可满足开发者**灵活定制**的需求。

## News 📢

* 🔥 **2022.11.12 PaddleNLP新增AutoPrompt自动化提示功能,登顶FewCLUE小样本学习榜单!**
* 🥇 PaddleNLP 团队开源了 **AutoPrompt** 方案,基于开源的文心 ERNIE 预训练语言模型 ,结合了领域预训练和自动化提示学习技术,以291M 参数量的模型在小样本权威学习榜单 FewCLUE 排名第一,[详见](https://mp.weixin.qq.com/s/_JPiAzFA1f0BZ0igdv-EKA)
* 🔥 **2022.10.27 发布 [PaddleNLP v2.4.2](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.2)**
* NLG能力扩充:新增📄[**基于Pegasus的中文文本摘要方案**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_summarization/pegasus),效果领先;新增❓[**问题生成解决方案**](./examples/question_generation),提供基于业界领先模型UNIMO-Text和大规模多领域问题生成数据集训练的通用问题生成预训练模型。均支持Taskflow一键调用,支持FasterGeneration高性能推理,训练推理部署全流程打通。
* 发布 🖼[**PPDiffusers**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers):支持跨模态(如图像与语音)训练和推理的扩散模型(Diffusion Model)工具箱,可快速体验、二次开发 **Stable Diffusion**,持续支持更多模型。
Expand Down Expand Up @@ -234,7 +237,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高

### 高性能分布式训练与推理

#### FasterTokenizer:高性能文本处理库
#### FastTokenizer:高性能文本处理库

<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
Expand All @@ -244,7 +247,7 @@ PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高
AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True)
```

为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FasterTokenizer文档](./faster_tokenizer)
为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_faster=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FastTokenizer文档](./fast_tokenizer)

#### ⚡️ FasterGeneration:高性能生成加速库

Expand Down
2 changes: 2 additions & 0 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@

## News 📢

* 🔥 **2022.11.12 PaddleNLP added AutoPrompt and won the first place in FewCLUE!**
* 🥇 The PaddleNLP team has open-sourced the **AutoPrompt** solution, which is based on the open-source Wenxin ERNIE pre-training language model, combined with domain pre-training and automated prompt learning technology, and ranked first in FewCLUE (a authoritative few-sample learning contest) with a model with 291M parameters. [see details](https://mp.weixin.qq.com/s/_JPiAzFA1f0BZ0igdv-EKA).
* 🔥 **2022.10.27 [PaddleNLP v2.4.2](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.2) Released!**
* NLG Upgrade: 📄 Release [**Solution of Text Summarization**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_summarization/pegasus) based on Pegasus;❓ Release [**Solution of Problem Generation**](./examples/question_generation), providing **general problem generation pre-trained model** based on Baidu's UNIMO Text and large-scale multi domain problem generation dataset。Supporting high-performance inference ability based on FasterGeneration , and covering the whole process of training , inference and deployment.
* 🔥 **2022.10.14 [PaddleNLP v2.4.1](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.1) Released!**
Expand Down
3 changes: 3 additions & 0 deletions applications/neural_search/recall/in_batch_negative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,9 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
* `recall_num`: 对 1 个文本召回的相似文本数量
* `similar_text_pair_file`: 由相似文本对构成的评估集
* `corpus_file`: 召回库数据 corpus_file
* `use_recompute`: 使用Recompute策略,用于节省显存,是一种以时间换空间的技术
* `use_gradient_cache`: 使用Gradient Cache策略,用于节省显存,是一种以时间换空间的技术
* `chunk_numbers`: 使用Gradient Cache策略的参数,表示的是同一个批次的样本分几次执行

也可以使用bash脚本:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,14 +60,14 @@ def forward(self,
title_cls_embedding,
transpose_y=True)

# substract margin from all positive samples cosine_sim()
# Substract margin from all positive samples cosine_sim()
margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]],
fill_value=self.margin,
dtype=paddle.get_default_dtype())

cosine_sim = cosine_sim - paddle.diag(margin_diag)

# scale cosine to ease training converge
# Scale cosine to ease training converge
cosine_sim *= self.sacle

labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
Expand All @@ -76,3 +76,56 @@ def forward(self,
loss = F.cross_entropy(input=cosine_sim, label=labels)

return loss


class SemanticIndexCacheNeg(SemanticIndexBase):

def __init__(self,
pretrained_model,
dropout=None,
margin=0.3,
scale=30,
output_emb_size=None):
super().__init__(pretrained_model, dropout, output_emb_size)
self.margin = margin
# Used scaling cosine similarity to ease converge
self.sacle = scale

def forward(self,
query_input_ids,
title_input_ids,
query_token_type_ids=None,
query_position_ids=None,
query_attention_mask=None,
title_token_type_ids=None,
title_position_ids=None,
title_attention_mask=None):

query_cls_embedding = self.get_pooled_embedding(query_input_ids,
query_token_type_ids,
query_position_ids,
query_attention_mask)

title_cls_embedding = self.get_pooled_embedding(title_input_ids,
title_token_type_ids,
title_position_ids,
title_attention_mask)

cosine_sim = paddle.matmul(query_cls_embedding,
title_cls_embedding,
transpose_y=True)

# Substract margin from all positive samples cosine_sim()
margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]],
fill_value=self.margin,
dtype=cosine_sim.dtype)

cosine_sim = cosine_sim - paddle.diag(margin_diag)

# Scale cosine to ease training converge
cosine_sim *= self.sacle

labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
labels = paddle.reshape(labels, shape=[-1, 1])

return [cosine_sim, labels, query_cls_embedding, title_cls_embedding]
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# GPU version
root_dir="checkpoints/inbatch"
python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
recall.py \
--device gpu \
--recall_result_dir "recall_result_dir" \
Expand All @@ -11,7 +25,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
--hnsw_ef 100 \
--batch_size 64 \
--output_emb_size 256\
--max_seq_length 60 \
--max_seq_length 64 \
--recall_num 50 \
--similar_text_pair "recall/dev.csv" \
--corpus_file "recall/corpus.csv"
Expand Down
Loading

0 comments on commit 3fc25aa

Please sign in to comment.