Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize elementwise_mul_grad using new interfaces #37728

Merged
merged 34 commits into from
Jan 5, 2022

Conversation

AshburnLee
Copy link
Contributor

@AshburnLee AshburnLee commented Nov 30, 2021

PR types

Performance optimization

PR changes

OPs

Describe

功能

该PR使用新的接口优化了elementwise_mul的反向计算

opbenchmark 表现

截屏2021-12-30 14 53 57

关于CI-opbenchmark 性能较develop性能变差的3个配置:使用reduce接口优化后,前3个配置较dev下降15%左右;适配多输出代码优化后,该3个配置与dev打平,其他超越(1.85x~12.16x)或打平。整体上,不差于dev和竞品。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

std::vector<int> reduce_dims = GetReduceDim(x->dims(), out->dims(), axis);
gpuStream_t stream = ctx.cuda_device_context().stream();

framework::Tensor wayto_dx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名最好可以修改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wayto_dx.Resize(dout->dims());
default_elementwise_mul<DeviceContext, T>(ctx, dout, y, &wayto_dx);

const framework::Tensor* const_to_dx =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这句可以和下面合并

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -48,6 +49,17 @@ template <typename T>
struct MulFunctor {
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
};

template <typename T>
struct MulFunctor<paddle::platform::complex<T>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数可以删除,可以判断当复数形式时从原来的y(y.real, y.imag)构造y_conj(y.real, -y.imag);传入乘法就行

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此函数放在这里确实不合适,与MulFunctor语义冲突。已修改

inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
};
template <typename T>
struct MulDxDyFunctor<paddle::platform::complex<T>> {
Copy link
Contributor Author

@AshburnLee AshburnLee Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数MulGradDY接受4个参数,如果使用这个函数,就要将调用者改为函数ElemwiseGradCompute,该函数最终调用kernel是优化前的kernel,并且意义不同,此处的功能是elemwiseCompute而非ElemwiseGradCompute。所以我认为,此处复用不了MulGradDY,故提供了MulDxDyFunctor

inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
};
template <typename T>
struct MulDxDyFunctor<paddle::platform::complex<T>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

名字可以修改一下,并且提取到elementwise_functor.h公共文件中

std::vector<int> reduce_dims = GetReduceDim(x->dims(), out->dims(), axis);
gpuStream_t stream = ctx.cuda_device_context().stream();

framework::Tensor dx_tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议修改命名方式

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. 改为dx_origin_dims,表示reduce之前的dx结果。

template <typename DeviceContext, typename T>
typename std::enable_if<
std::is_same<DeviceContext, platform::CPUDeviceContext>::value>::type
default_elementwise_mul_grad(const framework::ExecutionContext& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_elementwise_mul_grad和elementwise_mul_grad代码存在重复

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done。

@@ -114,6 +116,73 @@ __global__ void SimpleElemwiseMulGradCUDAKernel<plat::complex<double>>(
}
}

template <typename T>
struct MulDxDyFunctor {
inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的参数a,b与下面的参数 x、y不统一

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除无效的注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

template <typename DeviceContext, typename T>
typename std::enable_if<
std::is_same<DeviceContext, platform::CUDADeviceContext>::value>::type
default_elementwise_mul_grad(const framework::ExecutionContext& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分的代码根据Zjq9409的最新合入PR修改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done。

@@ -113,6 +114,181 @@ __global__ void SimpleElemwiseMulGradCUDAKernel<plat::complex<double>>(
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleElemwiseMulGradCUDAKernel函数代码可以删除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this pr, if other reviewers also agree with it, then it can be merged.

Copy link
Contributor

@Zjq9409 Zjq9409 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include "paddle/fluid/operators/elementwise/elementwise_op_broadcast.cu.h"
#include "paddle/fluid/platform/complex.h"
#include "paddle/fluid/platform/float16.h"

elementwise_mul_op.cu文件中以上头文件可以删除,可以在下个PR中删掉

@JamesLim-sy JamesLim-sy merged commit 36a102f into PaddlePaddle:develop Jan 5, 2022
@AshburnLee AshburnLee deleted the elem_mul_grad branch January 5, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants