optimize performance of lookup_table_v2_op #39856

limin2021 · 2022-02-23T09:23:10Z

PR types

Performance optimization

PR changes

OPs

Describe

Optimize the lookup_table_v2_op cuda implementation :

modifying block configs to improve parallelism and replacing eigen fill to memset operations.
Use vectorized half2 to accelerate lookup_table_v2_grad of fp16 type.

Performance results:

weight type： fp32

前向：

#rows	28672	14336	7168	3584	1792	896	448	224	112	56
每次平均时间（ns）：
torch	388132.6945	198200.7055	102779.22	55352.4235	30584.144	18323.924	11965.79025	9187.1955	7767.1485	6862.596
paddle	297181.339	151734.7933	80506.91075	44245.821	25124.6865	15403.02575	11295.889	7945.50275	7471.5335	7235.57725
paddle_opt	196586.3	102349.4288	54887.281	31177.76275	19231.03625	12930.68825	8984.6565	7957.018	7522.75525	7291.37975
加速比：
nv/paddle	1.306046658	1.306231097	1.276650899	1.251020373	1.217294552	1.189631459	1.059304872	1.156276171	1.039565506	0.948451763
nv/paddle_opt	1.974362885	1.936510129	1.872550765	1.77538151	1.590353406	1.417088066	1.331802752	1.15460283	1.032487199	0.941193057
paddle/paddle_opt	1.511709305	1.482517246	1.466768061	1.41914676	1.306465558	1.191199219	1.257242166	0.998552818	0.993191092	0.992346785

结论：优化后相比优化前快0.99-1.51x，相比竞品torch快0.94-1.97x。

反向：

#rows	28672	14336	7168	3584	1792	896	448	224	112	56
每次平均时间（ns）：
诸多kernel	1030748.672	800320.438	638265.6688	539160.6598	480192.4818	432415.758	407163.4263	395145.133	390088.3185	385025.1205
paddle	2137766.472	`1073019`.323	544362.5275	274703.8445	139399.8438	73540.7325	39879.46975	24690.522	15014.83375	10034.43125
paddle_opt	274375.2188	141527.2588	78582.06275	43410.83175	26401.143	15902.55775	12274.17275	10537.92	10016.00775	9673.0915
加速比：
nv/paddle	0.482161492	0.745858365	1.172501112	1.962697904	3.444713199	5.87994902	10.20985055	16.00391976	25.98019565	38.37039797
nv/paddle_opt	3.756711983	5.654885462	8.122281936	12.41995691	18.18832169	27.19158545	33.17237215	37.49745045	38.94648729	39.80372981
paddle/paddle_opt	7.791397787	7.581714872	6.927312779	6.328002331	5.280068509	4.624459389	3.249055603	2.343016648	1.499083679	1.037355146

结论：优化后相比优化前快1.03-7.79x，相比竞品torch快3.75-39.8x。

weight type： fp16

config-0：	paddle	torch	paddle-opt	paddle/paddle-opt	torch/paddle-opt
	8204884.6	126850.5498	24090.1035	340.59x	5.26x

Note: config-0: table shape = [2, 768], index shape=[16*128]

All #row from 56 to 28672:	paddle	torch	paddle-opt	paddle/paddle-opt	torch/paddle-opt
	102822673.7	897208.361	226602.5423	453.75x	3.95x

结论：优化后比优化前获得百倍以上加速；优化后比竞品在大部分case下快3x多。

paddle-bot-old · 2022-02-23T09:23:14Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

sneaxiy

Almost LGTM.

sneaxiy · 2022-02-23T10:07:08Z

paddle/fluid/platform/device/gpu/gpu_primitives.h

+  CudaAtomicAdd(arr + index, value);
+}
+
+#if 0


Useless code? Please remove it.

sneaxiy

LGTM. Great job!

optimize fp16 lookup_table_v2_grad.

4b5c5a2

sneaxiy reviewed Feb 23, 2022

View reviewed changes

Remove useless code.

bff5240

sneaxiy approved these changes Feb 23, 2022

View reviewed changes

limin2021 merged commit d6038c2 into PaddlePaddle:develop Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize performance of lookup_table_v2_op #39856

optimize performance of lookup_table_v2_op #39856

limin2021 commented Feb 23, 2022 •

edited

Loading

paddle-bot-old bot commented Feb 23, 2022

sneaxiy left a comment

sneaxiy Feb 23, 2022

limin2021 Feb 23, 2022

sneaxiy left a comment

optimize performance of lookup_table_v2_op #39856

optimize performance of lookup_table_v2_op #39856

Conversation

limin2021 commented Feb 23, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 23, 2022

sneaxiy left a comment

Choose a reason for hiding this comment

sneaxiy Feb 23, 2022

Choose a reason for hiding this comment

limin2021 Feb 23, 2022

Choose a reason for hiding this comment

sneaxiy left a comment

Choose a reason for hiding this comment

limin2021 commented Feb 23, 2022 •

edited

Loading