-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380
Conversation
1978538
to
bef6003
Compare
@Roshrini ^^ in case you're interested. |
Thanks @mkolod for making this change. This will definitely be helpful. |
663cd7e
to
a840509
Compare
@marcoabreu It seems like all tests pass on all plaforms, except for Windows-GPU, which is failing on all tests with |
@@ -209,7 +211,9 @@ def batch_norm(attrs, inputs, proto_obj): | |||
'is_test': 'fix_gamma'}) | |||
new_attrs = translation_utils._remove_attributes(new_attrs, | |||
['spatial', 'consumed_inputs']) | |||
new_attrs = translation_utils._add_extra_attributes(new_attrs, {'cudnn_off': 1}) | |||
cudnn_eps = get_cudnn_epsilon() | |||
cudnn_off = 0 if not math.isnan(cudnn_eps) and attrs['epsilon'] >= cudnn_eps else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if attrs
does not contain 'epsilon', then it should be defaulted to 1e-5. (reference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudhacharya I just updated the PR. Let me know if you need anything else.
a840509
to
33d4e7b
Compare
33d4e7b
to
2ee33a7
Compare
@szha @marcoabreu Can this PR's CI be re-run once the Caffe CI issue is resolved? Thanks! |
Sure thing! I have retriggered. It might be possible that it didn't purge the git cache (thus merging onto a commit with the broken Caffe test). If that's the case, please rebase. |
2ee33a7
to
5e5cc50
Compare
@szha @marcoabreu I rebased because of the git caching issue with the Caffe test that you mentioned. It looks like this will require a new review because of that. |
@marcoabreu The Python3-MKLDNN-GPU build is failing with "CUDA: unspecified launch failure." Every other build seems to have succeeded, along with the tests. What do we do now? I reported a similar problem before for the same PR (see here) but that time it was the Windows-GPU. As you can see, Python2-GPU, Python2-MKLDNN-GPU, Python2-Quantize-GPU, Python2-GPU-Win, Perl-GPU, Cpp-MKLDNN-GPU, Cpp-GPU, R-GPU and Scala-GPU, as well as all CPU builds and unit test runs, were OK. It seems like a random failure to me. What now? |
I'm not sure about adding this API. This doesn't seem like a scalable way of managing these constants but rather like a one-off case. |
@szha Makes sense. In that case, I'd propose the following. How about I remove the Python function to query that cuDNN constant and inline that call in the ONNX importer, so it's not exposed as an API if it's indeed a one-off? That said, even in that case, I'd need the actual C function because that constant is not a static const float that can be accessed, and it's not a cuDNN query function either. It's a macro, so it needs the C query function. The use in Python could be inlined though. I hope that we agree on the issue that it would be useful not to disable cuDNN BN for ONNX imports if the epsilon is big enough that it is equal to, or exceeds, the cuDNN minimum. cuDNN BN is faster for both fp32 and fp16. |
@marcoabreu No problem, I'm trying to address @szha's latest concern and will have to update the PR anyway. Thank you for your help. |
5e5cc50
to
5206ce2
Compare
I think adding a CAPI just for this is way over kill |
@piiswrong Do you have suggestions as to what to do instead? (1) We can hard-code the value to 1e-5 in Python, and not check what cuDNN provides. In that case, we have no C API, but we risk getting the wrong value in case cuDNN changes the constant. It's extremely unlikely that cuDNN will ever change it, but it's not a guarantee. (2) Ignore the feature altogether. This results in ONNX imported models working more slowly than the MxNet checkpointed models. To the extent that there are any numerical differences, or differences in memory use, that will make the ONNX experience different from the MxNet checkpoint experience in more ways than just lower performance. What should I do next? |
5206ce2
to
d798d35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkolod please work with @piiswrong and find a more suitable solution. Thanks.
@szha Will do. @piiswrong, please see my previous comment where I proposed 2 solutions. Please either choose among them, or suggest something else. Thank you. |
I think constant 1e-5 is good enough |
include/mxnet/c_api.h
Outdated
* \brief Query cuDNN for minimum permissible epsilon for BatchNorm. If not installed, return NaN. | ||
* \param result the minimum epsilon provided by cuDNN | ||
*/ | ||
MXNET_DLL int MXGetCudnnBnEpsilon(float* result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be renamed to MXGetCudnnBNEpsilon for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larroy Makes sense, but given what @piiswrong said, I'll get rid of the C call altogether and check against the 1e-5 magic value instead of adding the API that expands the macro from the particular version of cudnn.h
.
@szha I addressed @piiswrong's request from here. If you agree, can you approve this change? Thanks! |
@mkolod did you intend to remove the C API? |
d798d35
to
8a6d909
Compare
@szha Please check the commit now. |
8a6d909
to
9f1ec76
Compare
@marcoabreu Note that all tests succeed except for MKLDNN-GPU, which gives: test_operator_gpu.test_sparse_hybrid_block_grad ... [19:36:39] src/operator/tensor/./.././../common/../operator/mxnet_op.h:576: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered No other GPU build reports such an issue. 48 tests fail this way on that particular runner, but they don't fail for any other GPU build. Seems like that runner is having issues. Note that my code only pertains to the ONNX importer (see here), and these tests aren't even related to the ONNX importer. |
@marcoabreu ping |
Hi Marek, I'm currently on vacation. @KellenSunderland could you follow up please? If you can't solve the problem, please don't hesitate to ping me again and I'll check what I can do :) |
Thanks @marcoabreu. @KellenSunderland, could you check this out? |
Hey, I can take a look (but it might take a day or so)
…On Fri, Jul 6, 2018, 5:11 PM Marek Kolodziej ***@***.***> wrote:
Thanks @marcoabreu <https://github.com/marcoabreu>. @KellenSunderland
<https://github.com/KellenSunderland>, could you check this out?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11380 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHGTE5oj4UMc9Aw7iQfueI0U-afUq6Zjks5uD341gaJpZM4U1Bwl>
.
|
9f1ec76
to
b57d5bf
Compare
@larroy Ping. |
@mkolod please rebase and push, the gluon tests is disabled now. |
b57d5bf
to
610aff9
Compare
@larroy It looks like the CI is happy now. |
Description
The ONNX importer always disables cuDNN BatchNorm when importing models. The reason for this is that if the epsilon chosen in the model is less than the cuDNN minimum (CUDNN_BN_MIN_EPSILON), the cuDNN call will fail. However, cuDNN BatchNorm need not be always disabled when importing models. This PR adds an ability to query the value of CUDNN_BN_MIN_EPSILON, and only disable cuDNN BatchNorm during ONNX import if the value from the ONNX model is less than the cuDNN minimum. The ability to query the minimum cuDNN value can also be helpful in other scenarios, such as when verifying if the value in a model is too small for numeric stability (the cuDNN value is quite reasonable as a minimum).
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
mx.base.get_cudnn_epsilon()
implementation (python/mxnet/base.py
)get_cudnn_epsilon()
attests/python/gpu/test_cudnn_eps.py
python/mxnet/contrib/onnx/_import/op_translations.py
)