-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Mod in elementwise system #33052
Support Mod in elementwise system #33052
Conversation
Thanks for your contribution! |
a58de7f
to
8b06d3c
Compare
Sorry to inform you that 8b06d3c's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021-06-22 20:01:35 [check_op_benchmark_result.py:80] [INFO] ------ OP: remainder_2 (forward) ------
2021-06-22 20:01:35 [check_op_benchmark_result.py:82] [INFO] GPU time change: 7.25394% (develop: 0.0223620 -> PR: 0.0239841)
2021-06-22 20:01:35 [check_op_benchmark_result.py:84] [INFO] Total time change: 1.92571% (develop: 0.0402996 -> PR: 0.0410757)
2021-06-22 20:01:35 [check_op_benchmark_result.py:85] [INFO] backward: False
2021-06-22 20:01:35 [check_op_benchmark_result.py:86] [INFO] parameters:
2021-06-22 20:01:35 [check_op_benchmark_result.py:88] [INFO] x (Variable) - dtype: float32, shape: [16, 2048, 7, 7]
2021-06-22 20:01:35 [check_op_benchmark_result.py:88] [INFO] y (Variable) - dtype: float32, shape: [16, 2048]
2021-06-22 20:01:35 [check_op_benchmark_result.py:88] [INFO] axis (int): 0
2021-06-22 20:01:35 [check_op_benchmark_result.py:153] [ERROR] Check speed result with case "remainder_2 (forward)" failed.
CI中该配置性能下降7%,但其他配置均有性能提升,故可以先合入该PR。
PR types
Performance optimization
PR changes
OPs
Describe
Basing on new elementwise + broadcast system support binary functors below : Mod
The performance variation is recorded in the statics table below:
As can be seen from the table, in most cases, the mod operation costs less CUDA operation time after adopting new elementwise + broadcast system. However, the old broadcast branch work better in the 2nd test case, the old broadcast branch consists of perf-optimized branch and common broadcast branch, the former one works well when the quantity of input tensor data is not big enough and the input tensor`s dim meet the special demands. Apparently, 2nd test case perfectly meets the that demands and data quantity is relatively small, therefore, it beats the new elementwise + broadcast op in 2nd case. But introducing of this branch dose make the code less compactness and hard to maintain, and working area of this is not big enough. Furthermore, the elementwise op in paddle is suggested to dealing with NN whose data quantity is often large, so i suggest Approval of 2nd test case and adopt the new elementwise + broadcast system in mod op.