Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize performance of lookup_table_v2_op #39856

Merged
merged 2 commits into from
Feb 24, 2022

Conversation

limin2021
Copy link
Contributor

@limin2021 limin2021 commented Feb 23, 2022

PR types

Performance optimization

PR changes

OPs

Describe

Optimize the lookup_table_v2_op cuda implementation :

  1. modifying block configs to improve parallelism and replacing eigen fill to memset operations.
  2. Use vectorized half2 to accelerate lookup_table_v2_grad of fp16 type.

Performance results:

weight type: fp32

前向:

#rows 28672 14336 7168 3584 1792 896 448 224 112 56
每次平均时间(ns):                    
torch 388132.6945 198200.7055 102779.22 55352.4235 30584.144 18323.924 11965.79025 9187.1955 7767.1485 6862.596
paddle 297181.339 151734.7933 80506.91075 44245.821 25124.6865 15403.02575 11295.889 7945.50275 7471.5335 7235.57725
paddle_opt 196586.3 102349.4288 54887.281 31177.76275 19231.03625 12930.68825 8984.6565 7957.018 7522.75525 7291.37975
加速比:                    
nv/paddle 1.306046658 1.306231097 1.276650899 1.251020373 1.217294552 1.189631459 1.059304872 1.156276171 1.039565506 0.948451763
nv/paddle_opt 1.974362885 1.936510129 1.872550765 1.77538151 1.590353406 1.417088066 1.331802752 1.15460283 1.032487199 0.941193057
paddle/paddle_opt 1.511709305 1.482517246 1.466768061 1.41914676 1.306465558 1.191199219 1.257242166 0.998552818 0.993191092 0.992346785

结论:优化后相比优化前快0.99-1.51x,相比竞品torch快0.94-1.97x。

反向:

#rows 28672 14336 7168 3584 1792 896 448 224 112 56
每次平均时间(ns):                    
诸多kernel 1030748.672 800320.438 638265.6688 539160.6598 480192.4818 432415.758 407163.4263 395145.133 390088.3185 385025.1205
paddle 2137766.472 1073019.323 544362.5275 274703.8445 139399.8438 73540.7325 39879.46975 24690.522 15014.83375 10034.43125
paddle_opt 274375.2188 141527.2588 78582.06275 43410.83175 26401.143 15902.55775 12274.17275 10537.92 10016.00775 9673.0915
加速比:                    
nv/paddle 0.482161492 0.745858365 1.172501112 1.962697904 3.444713199 5.87994902 10.20985055 16.00391976 25.98019565 38.37039797
nv/paddle_opt 3.756711983 5.654885462 8.122281936 12.41995691 18.18832169 27.19158545 33.17237215 37.49745045 38.94648729 39.80372981
paddle/paddle_opt 7.791397787 7.581714872 6.927312779 6.328002331 5.280068509 4.624459389 3.249055603 2.343016648 1.499083679 1.037355146

结论:优化后相比优化前快1.03-7.79x,相比竞品torch快3.75-39.8x。

weight type: fp16

config-0: paddle torch paddle-opt paddle/paddle-opt torch/paddle-opt
  8204884.6 126850.5498 24090.1035 340.59x 5.26x

Note: config-0: table shape = [2, 768], index shape=[16*128]

All #row from 56 to 28672: paddle torch paddle-opt paddle/paddle-opt torch/paddle-opt
  102822673.7 897208.361 226602.5423 453.75x 3.95x

结论:优化后比优化前获得百倍以上加速;优化后比竞品在大部分case下快3x多。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM.

CudaAtomicAdd(arr + index, value);
}

#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless code? Please remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great job!

@limin2021 limin2021 merged commit d6038c2 into PaddlePaddle:develop Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants