Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster transformer for decoding #37

Merged
merged 24 commits into from
Mar 5, 2021

Conversation

FrostML
Copy link
Contributor

@FrostML FrostML commented Feb 25, 2021

Add faster transformer for decoding.

The performance is as follow(V100):

  • PaddlePaddle dygraph without Faster Transformer: 257s
  • PaddlePaddle dygraph with Faster Transformer:
    • FP32: 25.18s
    • FP16: 13.38s

以上测试结果基于:

  • Paddle 2.0 重构后的动态图预测 和 使用 Faster Transformer 的动态图预测比较
    • 不过,Paddle 2.0 重构后的动态图预测,因组网上需要支持动静统一,在预测部分,动态图原生组网调整之后,性能上有所下降,可能不能代表之前的模型和 Faster Transformer 的比较
  • 测试样本:
    • 生成的长度会显著影响性能测试的结果,故固定采用 3003 条英德翻译句子,统计总计耗时,max output length 设定为 256
    • 生成任务上,不同的 batch size 对 QPS 有很大影响,batch size 较大可能有更高 QPS,故固定 batch size = 64 测试
  • 标准的 transformer 结构,具体模型组网可以参考 modeling.py
  • base 模型,具体的参数信息可以参考 transformer.base.yaml
    • d_model: 512
    • inner hidden dims: 2048
    • num of head: 8
    • num of layer: 6
    • beam size: 5
  • 测试机器:
    • GPU: V100
    • CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  • 实际性能以实际测试为准

.
├── sample/ # 基于 Transformer 机器翻译使用样例(beam search)
├── src/ # 自定义 OP C++ CUDA 代码
└── transformer/ # Python API 封装脚本
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample入口感觉最好是在example里的Transformer下,体现example的训练代码是能和预测优化打通的,是配套的,这个是我们希望突出的内容,另外也能和example去复用reader、config等代码。比如能否在example里的Transformer下加一个类似faster infer的目录来存放sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解可以在 example 下再加一个。不过我觉得当前的路径不用删除。
sample 这里不只是预测的脚本,还有比如 decoding_sample.py 这样的脚本用于验证 decoding 的性能(参考 Faster Transformer 的 repo)。另外,执行前需要执行 decoding_gemm,这个一般在 build 路径下,目前推荐的是在 ext_op 下编译,那么如上的 decoding_sample.py 这样的脚本放在 example 下就需要 ../../../../paddlenlp/ext_op/build/third_party/... 这样执行,太长了。
可以在 example 下面加一个预测走通的流程,保证 reader configs 部分的代码不会重复,不过编译的 lib 还是在 ext_op/build/lib 下面。指定 lib 路径的时候也会比较复杂。

经讨论:
已经将 sample 里面验证测试脚本换成以随机生成模型进行功能验证,另在 example 下面新增 faster_transformer 并新增相应的脚本、文档说明使用 Faster Transformer 进行预测。

d_model=args.d_model,
pad_idx=args.bos_idx,
weight_sharing=args.weight_sharing,
use_fp16_decoding=args.use_fp16_decoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数作为FasterTransformer的一个方法封装吧,也是和FasterTransformer配套的,能直接使用其中的参数和配置,这个接口能简单些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

decoding_params.stream = stream;
int device_id;
cudaGetDevice(&device_id);
fastertransformer::Allocator<AllocatorType::CUDA> allocator_(device_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是应该封装一个paddle原生的allocator在这里使用,直接使用CUDA的原生allocator应该会使用paddle allocator预分配以外的显存,之前batch_size超过32爆显存应该是这个问题。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果fastertransformer::Allocator<AllocatorType::Paddle>实在是需要修改fastertransformer中的allocator.h文件的话,还是尽量通过CMAKE来完成这一过程吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体的实现方式可以如下:
将 Faster Transformer 修改,使之依赖于 paddle 进行编译,完成后再编译自定义 op,使自定义 op 依赖于 Faster Transformer。
修改 Faster Transformer 方式,目前可以采用先编辑好相关代码,在编译时,替换掉 CMakeLists.txt 以及 allocator.h,进行编译。
以上方案目前没有参考,不确定编译是否容易解决,且目前已有雏形已经可以正常预测。经讨论,将会在后面的 PR 进行升级尝试。

Copy link
Contributor

@guoshengCS guoshengCS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也请 @ZeyuChen 看看整个代码组织是否合适

transformer = FasterTransformer.load_dygraph_ckpt(
transformer,
init_from_params=args.init_from_params,
trg_vocab_size=args.trg_vocab_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里load_dygraph_ckpt不要用staticmethod,直接transformer.load(args.init_from_params)就可以了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

public:
void Compute(const framework::ExecutionContext& ctx) const override {
PADDLE_THROW("CPU is not support for this kernel now. Please use GPU. ");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是CPU的NotImpleKernel是必要的吗,如果不是必要的话就去掉吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

必要吧,CPU 是 OP 的默认 kernel,如果没有编译会挂。

@@ -0,0 +1,132 @@
# Faster Transformer 预测


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先简单概述下这个做了什么,如通过自定义OP的方式集成了NV的Faster Transformer,只包含使用beam search 的decoding部分,打通训练和预测加速

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.


翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 base model 的配置。

需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果sample部分是随机数据的话这些评估相关的内容去掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

"Please set init_from_params to load the infer model.")

model_dict = paddle.load(
os.path.join(init_from_params, "transformer.pdparams"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不要用staticmethod,就直接使用self中的参数。另外"transformer.pdparams"需要拼上的话在predict代码中补上吧,和训练时save一致。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@FrostML FrostML requested a review from ZeyuChen March 4, 2021 04:39
@FrostML
Copy link
Contributor Author

FrostML commented Mar 4, 2021

目前代码路径组织方式:
在 example/ 下,提供基于 Faster Transformer 且与动态图对齐的预测脚本:
image
在 paddlenlp/ext_op/ 下加入自定义 op 实现代码,并完成封装,编译目前也在当前目录进行:
image
transformer/ 路径为封装的 API
src/ 路径为自定义 op 实现
sample/ 路径为 API 调用样例


return ids

def load_dygraph_ckpt(self, init_from_params, max_length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如讨论,这里就叫load吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

transformer.load_dygraph_ckpt(
init_from_params=os.path.join(args.init_from_params,
"transformer.pdparams"),
max_length=args.max_length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下这里和上面的max_length=args.max_length + 1,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@@ -0,0 +1,85 @@
import paddle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改文件名建议重命名为faster_transformer.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@@ -0,0 +1,2 @@
from .transformer.decoding import *
from .transformer.fastertransformer import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的模块重命名为faster_transformer会更清晰

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.



def do_predict(args):
place = "gpu:0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要精准的控制到gpu:0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测会写入文件,并且需要确保顺序一致以保证 bleu 计算,所以使用单卡预测。这里写成 0,目的是通过设置 CUDA VISIBLE DEVICES 来选择具体的卡。
另,因 Faster Transformer 未实现 CPU,故限制为 "gpu:0"。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接只使用gpu也会是一样的效果,倾向于把0去掉吧,其他模型应该都是直接用的gpu,统一成gpu吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

from paddlenlp.transformers import position_encoding_init
from paddlenlp.ext_op import FasterTransformer

sys.path.append("../")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一定需要通过sys.path的方式添加吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reader 是在 examples/machine_translation/transformer/,因为是放在 example 里面,似乎没有太多选择,不过顾虑应该也不多。
有什么更好的示例推荐么?

@guoshengCS guoshengCS merged commit 80c1f77 into PaddlePaddle:develop Mar 5, 2021
bmers pushed a commit to bmers/PaddleNLP that referenced this pull request Jan 20, 2024
DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants