-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add faster transformer for decoding #37
Conversation
…nto faster-transformer
. | ||
├── sample/ # 基于 Transformer 机器翻译使用样例(beam search) | ||
├── src/ # 自定义 OP C++ CUDA 代码 | ||
└── transformer/ # Python API 封装脚本 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample入口感觉最好是在example里的Transformer下,体现example的训练代码是能和预测优化打通的,是配套的,这个是我们希望突出的内容,另外也能和example去复用reader、config等代码。比如能否在example里的Transformer下加一个类似faster infer的目录来存放sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解可以在 example 下再加一个。不过我觉得当前的路径不用删除。
sample 这里不只是预测的脚本,还有比如 decoding_sample.py 这样的脚本用于验证 decoding 的性能(参考 Faster Transformer 的 repo)。另外,执行前需要执行 decoding_gemm,这个一般在 build 路径下,目前推荐的是在 ext_op 下编译,那么如上的 decoding_sample.py 这样的脚本放在 example 下就需要 ../../../../paddlenlp/ext_op/build/third_party/... 这样执行,太长了。
可以在 example 下面加一个预测走通的流程,保证 reader configs 部分的代码不会重复,不过编译的 lib 还是在 ext_op/build/lib 下面。指定 lib 路径的时候也会比较复杂。
经讨论:
已经将 sample 里面验证测试脚本换成以随机生成模型进行功能验证,另在 example 下面新增 faster_transformer 并新增相应的脚本、文档说明使用 Faster Transformer 进行预测。
d_model=args.d_model, | ||
pad_idx=args.bos_idx, | ||
weight_sharing=args.weight_sharing, | ||
use_fp16_decoding=args.use_fp16_decoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数作为FasterTransformer
的一个方法封装吧,也是和FasterTransformer
配套的,能直接使用其中的参数和配置,这个接口能简单些。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
decoding_params.stream = stream; | ||
int device_id; | ||
cudaGetDevice(&device_id); | ||
fastertransformer::Allocator<AllocatorType::CUDA> allocator_(device_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里还是应该封装一个paddle原生的allocator在这里使用,直接使用CUDA的原生allocator应该会使用paddle allocator预分配以外的显存,之前batch_size超过32爆显存应该是这个问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果fastertransformer::Allocator<AllocatorType::Paddle>
实在是需要修改fastertransformer中的allocator.h文件的话,还是尽量通过CMAKE来完成这一过程吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
具体的实现方式可以如下:
将 Faster Transformer 修改,使之依赖于 paddle 进行编译,完成后再编译自定义 op,使自定义 op 依赖于 Faster Transformer。
修改 Faster Transformer 方式,目前可以采用先编辑好相关代码,在编译时,替换掉 CMakeLists.txt 以及 allocator.h,进行编译。
以上方案目前没有参考,不确定编译是否容易解决,且目前已有雏形已经可以正常预测。经讨论,将会在后面的 PR 进行升级尝试。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也请 @ZeyuChen 看看整个代码组织是否合适
transformer = FasterTransformer.load_dygraph_ckpt( | ||
transformer, | ||
init_from_params=args.init_from_params, | ||
trg_vocab_size=args.trg_vocab_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里load_dygraph_ckpt
不要用staticmethod,直接transformer.load(args.init_from_params)
就可以了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
public: | ||
void Compute(const framework::ExecutionContext& ctx) const override { | ||
PADDLE_THROW("CPU is not support for this kernel now. Please use GPU. "); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是CPU的NotImpleKernel是必要的吗,如果不是必要的话就去掉吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
必要吧,CPU 是 OP 的默认 kernel,如果没有编译会挂。
@@ -0,0 +1,132 @@ | |||
# Faster Transformer 预测 | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
先简单概述下这个做了什么,如通过自定义OP的方式集成了NV的Faster Transformer,只包含使用beam search 的decoding部分,打通训练和预测加速
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
paddlenlp/ext_op/README.md
Outdated
|
||
翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录,更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项,程序将默认使用 base model 的配置。 | ||
|
||
需要注意的是,目前预测仅实现了单卡的预测,原因在于,翻译后面需要的模型评估依赖于预测结果写入文件顺序,多卡情况下,目前暂未支持将结果按照指定顺序写入文件。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果sample部分是随机数据的话这些评估相关的内容去掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
"Please set init_from_params to load the infer model.") | ||
|
||
model_dict = paddle.load( | ||
os.path.join(init_from_params, "transformer.pdparams")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不要用staticmethod,就直接使用self中的参数。另外"transformer.pdparams"需要拼上的话在predict代码中补上吧,和训练时save一致。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
|
||
return ids | ||
|
||
def load_dygraph_ckpt(self, init_from_params, max_length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如讨论,这里就叫load吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
transformer.load_dygraph_ckpt( | ||
init_from_params=os.path.join(args.init_from_params, | ||
"transformer.pdparams"), | ||
max_length=args.max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认下这里和上面的max_length=args.max_length + 1,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
@@ -0,0 +1,85 @@ | |||
import paddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改文件名建议重命名为faster_transformer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
paddlenlp/ext_op/__init__.py
Outdated
@@ -0,0 +1,2 @@ | |||
from .transformer.decoding import * | |||
from .transformer.fastertransformer import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的模块重命名为faster_transformer会更清晰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
|
||
|
||
def do_predict(args): | ||
place = "gpu:0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要精准的控制到gpu:0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预测会写入文件,并且需要确保顺序一致以保证 bleu 计算,所以使用单卡预测。这里写成 0,目的是通过设置 CUDA VISIBLE DEVICES 来选择具体的卡。
另,因 Faster Transformer 未实现 CPU,故限制为 "gpu:0"。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里直接只使用gpu也会是一样的效果,倾向于把0去掉吧,其他模型应该都是直接用的gpu,统一成gpu吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
from paddlenlp.transformers import position_encoding_init | ||
from paddlenlp.ext_op import FasterTransformer | ||
|
||
sys.path.append("../") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一定需要通过sys.path的方式添加吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reader 是在 examples/machine_translation/transformer/
,因为是放在 example 里面,似乎没有太多选择,不过顾虑应该也不多。
有什么更好的示例推荐么?
Add Graph Normalization Layers
Add faster transformer for decoding.
The performance is as follow(V100):
以上测试结果基于: