Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paddle-Inference] Matmul_int8_convert: tensor*tensor #37285

Merged
merged 6 commits into from
Nov 24, 2021

Conversation

Wangzheee
Copy link
Contributor

PR types

Others

PR changes

Others

Describe

增加matmul int8 量化的推理 op_convert 和 plugin:通过调用nvidia 显卡的 Tensor Core提高矩阵乘的计算速度,plugin 的实现包括 int8、fp16、fp32;通过将alpha传入plugin内与矩阵乘一起进行计算,实现matmul+scale的融合,加速推理;增加 dynload 动态加载 libcublasLt.so 的实现;增加对应量化的单测

性能测试:A(1, 28, 256, 1024)*B(1, 28, 1024, 256)

kernel(matmul和scale融合)的执行时间:

matmul int8 layer matmul half layer matmul float32 layer
0.027ms 0.123ms 0.751ms

单OP(matmul和scale融合)网络的执行时间:(int8 的matmul 需要对输入数据重新排布来支持 tensor core,反而会增加耗时,只有在矩阵规模十分庞大时,才能体现矩阵计算的加速效果;本op的实现中可根据对tensor的预分析,自动判断选择性能最佳的 int8、fp16、fp32的plugin)

matmul int8 op matmul half op matmul float32 op
37.2ms 35.6ms 57.1ms

kernel的执行时间:

matmul int8 layer + scale layer matmul half layer + scale layer matmul float32 layer + scale layer
0.053ms 0.152ms 0.815ms

单OP网络的执行时间:

matmul int8 op + scale op matmul half op + scale op matmul float32 op + scale op
41.2ms 39.1ms 65.1ms

总结:当矩阵较大时,matmul int8 op的加速性能较为明显;当存在scale的op融合时,加速性能比较明显
另:matmul int8的显存会有约 5% 的略微减小

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Member

@shangzhizhou shangzhizhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE

int32_t pos, nvinfer1::PluginTensorDesc const* inOut, int32_t nbInputs,
int32_t nbOutputs) const TRT_NOEXCEPT {
PADDLE_ENFORCE_EQ(nbInputs, 2,
platform::errors::InvalidArgument("Must have 2 inputs, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议报错信息带一些环境信息,这样报错,用户可能不知道是什么场景?什么地方?需要2个输入,后续可以再补充一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,下次pr我加一下~ thanks~

@shangzhizhou shangzhizhou merged commit 1659079 into PaddlePaddle:develop Nov 24, 2021
Zjq9409 pushed a commit to Zjq9409/Paddle that referenced this pull request Dec 10, 2021
…7285)

* matmul_convert_int8

* matmul_convert_int8

* matmulconvert_int8

* Matmul_int8_convert: tensor*tensor

* Matmul_int8_convert: tensor*tensor

* Matmul_int8_convert: tensor*tensor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants