Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

mkolod · 2018-06-24T05:17:57Z

Description

The ONNX importer always disables cuDNN BatchNorm when importing models. The reason for this is that if the epsilon chosen in the model is less than the cuDNN minimum (CUDNN_BN_MIN_EPSILON), the cuDNN call will fail. However, cuDNN BatchNorm need not be always disabled when importing models. This PR adds an ability to query the value of CUDNN_BN_MIN_EPSILON, and only disable cuDNN BatchNorm during ONNX import if the value from the ONNX model is less than the cuDNN minimum. The ability to query the minimum cuDNN value can also be helpful in other scenarios, such as when verifying if the value in a model is too small for numeric stability (the cuDNN value is quite reasonable as a minimum).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
This is a tiny change so only the call to get the value of CUDNN_BN_MIN_EPSILON (mx.base.get_cudnn_epsilon) needs documentation
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

mx.base.get_cudnn_epsilon() implementation (python/mxnet/base.py)
test of get_cudnn_epsilon() at tests/python/gpu/test_cudnn_eps.py
adaptation of the ONNX importer to only disable cuDNN BatchNorm if cuDNN is not available, or the epsilon from the ONNX model is less than CUDNN_BN_MIN_EPSILON (python/mxnet/contrib/onnx/_import/op_translations.py)

mkolod · 2018-06-25T16:31:55Z

@Roshrini ^^ in case you're interested.

Roshrini · 2018-06-25T18:37:26Z

Thanks @mkolod for making this change. This will definitely be helpful.
@anirudhacharya @spidydev

mkolod · 2018-06-26T03:14:44Z

@marcoabreu It seems like all tests pass on all plaforms, except for Windows-GPU, which is failing on all tests with CUDA: unspecified launch failure. It seems like the issue may be with the Windows runner.

anirudhacharya · 2018-06-26T06:39:34Z

python/mxnet/contrib/onnx/onnx2mx/_op_translations.py

@@ -209,7 +211,9 @@ def batch_norm(attrs, inputs, proto_obj):
                                                               'is_test': 'fix_gamma'})
    new_attrs = translation_utils._remove_attributes(new_attrs,
                                                     ['spatial', 'consumed_inputs'])
-    new_attrs = translation_utils._add_extra_attributes(new_attrs, {'cudnn_off': 1})
+    cudnn_eps = get_cudnn_epsilon()
+    cudnn_off = 0 if not math.isnan(cudnn_eps) and attrs['epsilon'] >= cudnn_eps else 1


if attrs does not contain 'epsilon', then it should be defaulted to 1e-5. (reference)

Good point!

@anirudhacharya I just updated the PR. Let me know if you need anything else.

mkolod · 2018-06-27T22:25:04Z

@szha @marcoabreu Can this PR's CI be re-run once the Caffe CI issue is resolved? Thanks!

marcoabreu · 2018-06-27T22:48:20Z

Sure thing! I have retriggered. It might be possible that it didn't purge the git cache (thus merging onto a commit with the broken Caffe test). If that's the case, please rebase.

mkolod · 2018-06-28T01:12:16Z

@szha @marcoabreu I rebased because of the git caching issue with the Caffe test that you mentioned. It looks like this will require a new review because of that.

mkolod · 2018-06-28T03:19:25Z

@marcoabreu The Python3-MKLDNN-GPU build is failing with "CUDA: unspecified launch failure." Every other build seems to have succeeded, along with the tests. What do we do now? I reported a similar problem before for the same PR (see here) but that time it was the Windows-GPU. As you can see, Python2-GPU, Python2-MKLDNN-GPU, Python2-Quantize-GPU, Python2-GPU-Win, Perl-GPU, Cpp-MKLDNN-GPU, Cpp-GPU, R-GPU and Scala-GPU, as well as all CPU builds and unit test runs, were OK. It seems like a random failure to me. What now?

szha · 2018-06-28T06:44:18Z

I'm not sure about adding this API. This doesn't seem like a scalable way of managing these constants but rather like a one-off case.

marcoabreu · 2018-06-28T08:01:27Z

Sorry Marek, we identified the issue at #11395 and disabled the problematic test at #11440. Please rebase and everything should be going smooth again.

mkolod · 2018-06-28T15:45:11Z

@szha Makes sense. In that case, I'd propose the following. How about I remove the Python function to query that cuDNN constant and inline that call in the ONNX importer, so it's not exposed as an API if it's indeed a one-off? That said, even in that case, I'd need the actual C function because that constant is not a static const float that can be accessed, and it's not a cuDNN query function either. It's a macro, so it needs the C query function. The use in Python could be inlined though. I hope that we agree on the issue that it would be useful not to disable cuDNN BN for ONNX imports if the epsilon is big enough that it is equal to, or exceeds, the cuDNN minimum. cuDNN BN is faster for both fp32 and fp16.

mkolod · 2018-06-28T15:45:54Z

@marcoabreu No problem, I'm trying to address @szha's latest concern and will have to update the PR anyway. Thank you for your help.

piiswrong · 2018-06-28T16:47:50Z

I think adding a CAPI just for this is way over kill

mkolod · 2018-06-28T18:01:45Z

@piiswrong Do you have suggestions as to what to do instead? CUDNN_BN_MIN_EPSILON is a macro, not a constant that can be called from ctypes, so it requires a C wrapper to extract the macro's value at compile time. If adding the C API is overkill, then I see 2 options:

(1) We can hard-code the value to 1e-5 in Python, and not check what cuDNN provides. In that case, we have no C API, but we risk getting the wrong value in case cuDNN changes the constant. It's extremely unlikely that cuDNN will ever change it, but it's not a guarantee.

(2) Ignore the feature altogether. This results in ONNX imported models working more slowly than the MxNet checkpointed models. To the extent that there are any numerical differences, or differences in memory use, that will make the ONNX experience different from the MxNet checkpoint experience in more ways than just lower performance.

What should I do next?

szha

@mkolod please work with @piiswrong and find a more suitable solution. Thanks.

mkolod · 2018-06-28T22:19:16Z

@szha Will do. @piiswrong, please see my previous comment where I proposed 2 solutions. Please either choose among them, or suggest something else. Thank you.

piiswrong · 2018-07-02T19:17:47Z

I think constant 1e-5 is good enough

larroy · 2018-07-02T19:25:50Z

include/mxnet/c_api.h

+ * \brief Query cuDNN for minimum permissible epsilon for BatchNorm. If not installed, return NaN.
+ * \param result the minimum epsilon provided by cuDNN
+ */
+MXNET_DLL int MXGetCudnnBnEpsilon(float* result);


should it be renamed to MXGetCudnnBNEpsilon for consistency?

@larroy Makes sense, but given what @piiswrong said, I'll get rid of the C call altogether and check against the 1e-5 magic value instead of adding the API that expands the macro from the particular version of cudnn.h.

mkolod · 2018-07-03T17:12:46Z

@szha I addressed @piiswrong's request from here. If you agree, can you approve this change? Thanks!

szha · 2018-07-03T18:01:14Z

@mkolod did you intend to remove the C API?

mkolod · 2018-07-03T18:32:07Z

@szha Please check the commit now.

problem corrected

mkolod · 2018-07-03T21:01:04Z

@marcoabreu Note that all tests succeed except for MKLDNN-GPU, which gives:

test_operator_gpu.test_sparse_hybrid_block_grad ... [19:36:39] src/operator/tensor/./.././../common/../operator/mxnet_op.h:576: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

No other GPU build reports such an issue. 48 tests fail this way on that particular runner, but they don't fail for any other GPU build. Seems like that runner is having issues. Note that my code only pertains to the ONNX importer (see here), and these tests aren't even related to the ONNX importer.

mkolod · 2018-07-06T00:59:35Z

@marcoabreu ping

marcoabreu · 2018-07-06T14:48:02Z

Hi Marek, I'm currently on vacation. @KellenSunderland could you follow up please? If you can't solve the problem, please don't hesitate to ping me again and I'll check what I can do :)

mkolod · 2018-07-06T15:11:06Z

Thanks @marcoabreu. @KellenSunderland, could you check this out?

KellenSunderland · 2018-07-06T15:33:52Z

Hey, I can take a look (but it might take a day or so)

…

On Fri, Jul 6, 2018, 5:11 PM Marek Kolodziej ***@***.***> wrote: Thanks @marcoabreu <https://github.com/marcoabreu>. @KellenSunderland <https://github.com/KellenSunderland>, could you check this out? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11380 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHGTE5oj4UMc9Aw7iQfueI0U-afUq6Zjks5uD341gaJpZM4U1Bwl> .

mkolod · 2018-07-10T17:22:06Z

@larroy Ping.

larroy · 2018-07-10T18:13:54Z

@mkolod please rebase and push, the gluon tests is disabled now.

… ONNX import

mkolod · 2018-07-10T22:01:19Z

@larroy It looks like the CI is happy now.

… ONNX import (apache#11380)

mkolod requested a review from szha as a code owner June 24, 2018 05:17

mkolod changed the title ~~Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. aps.~~ Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. Jun 24, 2018

mkolod force-pushed the cudnn_epsilon_query branch 2 times, most recently from 1978538 to bef6003 Compare June 25, 2018 16:31

mkolod force-pushed the cudnn_epsilon_query branch 4 times, most recently from 663cd7e to a840509 Compare June 25, 2018 23:42

anirudhacharya suggested changes Jun 26, 2018

View reviewed changes

mkolod force-pushed the cudnn_epsilon_query branch from a840509 to 33d4e7b Compare June 26, 2018 17:46

anirudhacharya approved these changes Jun 26, 2018

View reviewed changes

mkolod force-pushed the cudnn_epsilon_query branch from 33d4e7b to 2ee33a7 Compare June 26, 2018 22:07

mkolod force-pushed the cudnn_epsilon_query branch from 2ee33a7 to 5e5cc50 Compare June 28, 2018 01:11

marcoabreu approved these changes Jun 28, 2018

View reviewed changes

mkolod force-pushed the cudnn_epsilon_query branch from 5e5cc50 to 5206ce2 Compare June 28, 2018 16:03

mkolod force-pushed the cudnn_epsilon_query branch from 5206ce2 to d798d35 Compare June 28, 2018 18:20

szha previously requested changes Jun 28, 2018

View reviewed changes

larroy reviewed Jul 2, 2018

View reviewed changes

mkolod force-pushed the cudnn_epsilon_query branch from d798d35 to 8a6d909 Compare July 3, 2018 18:31

mkolod force-pushed the cudnn_epsilon_query branch from 8a6d909 to 9f1ec76 Compare July 3, 2018 18:33

mkolod force-pushed the cudnn_epsilon_query branch from 9f1ec76 to b57d5bf Compare July 9, 2018 19:37

Added query for cuDNN BN min. epsilon. Enabled choice of BN impl. for…

610aff9

… ONNX import

mkolod force-pushed the cudnn_epsilon_query branch from b57d5bf to 610aff9 Compare July 10, 2018 18:26

szha merged commit d814733 into apache:master Jul 11, 2018

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Added query for cuDNN BN min. epsilon. Enabled choice of BN impl. for…

947ae37

… ONNX import (apache#11380)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

mkolod commented Jun 24, 2018

mkolod commented Jun 25, 2018

Roshrini commented Jun 25, 2018

mkolod commented Jun 26, 2018

anirudhacharya Jun 26, 2018

mkolod Jun 26, 2018

mkolod Jun 26, 2018

mkolod commented Jun 27, 2018

marcoabreu commented Jun 27, 2018

mkolod commented Jun 28, 2018

mkolod commented Jun 28, 2018 •

edited

Loading

szha commented Jun 28, 2018

marcoabreu commented Jun 28, 2018

mkolod commented Jun 28, 2018

mkolod commented Jun 28, 2018

piiswrong commented Jun 28, 2018

mkolod commented Jun 28, 2018 •

edited

Loading

szha left a comment

mkolod commented Jun 28, 2018

piiswrong commented Jul 2, 2018

larroy Jul 2, 2018

mkolod Jul 3, 2018

mkolod commented Jul 3, 2018

szha commented Jul 3, 2018

mkolod commented Jul 3, 2018

mkolod commented Jul 3, 2018 •

edited

Loading

mkolod commented Jul 6, 2018

marcoabreu commented Jul 6, 2018 •

edited

Loading

mkolod commented Jul 6, 2018

KellenSunderland commented Jul 6, 2018 via email

mkolod commented Jul 10, 2018

larroy commented Jul 10, 2018

mkolod commented Jul 10, 2018

Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

Conversation

mkolod commented Jun 24, 2018

Description

Checklist

Essentials

Changes

mkolod commented Jun 25, 2018

Roshrini commented Jun 25, 2018

mkolod commented Jun 26, 2018

anirudhacharya Jun 26, 2018

Choose a reason for hiding this comment

mkolod Jun 26, 2018

Choose a reason for hiding this comment

mkolod Jun 26, 2018

Choose a reason for hiding this comment

mkolod commented Jun 27, 2018

marcoabreu commented Jun 27, 2018

mkolod commented Jun 28, 2018

mkolod commented Jun 28, 2018 • edited Loading

szha commented Jun 28, 2018

marcoabreu commented Jun 28, 2018

mkolod commented Jun 28, 2018

mkolod commented Jun 28, 2018

piiswrong commented Jun 28, 2018

mkolod commented Jun 28, 2018 • edited Loading

szha left a comment

Choose a reason for hiding this comment

mkolod commented Jun 28, 2018

piiswrong commented Jul 2, 2018

larroy Jul 2, 2018

Choose a reason for hiding this comment

mkolod Jul 3, 2018

Choose a reason for hiding this comment

mkolod commented Jul 3, 2018

szha commented Jul 3, 2018

mkolod commented Jul 3, 2018

mkolod commented Jul 3, 2018 • edited Loading

mkolod commented Jul 6, 2018

marcoabreu commented Jul 6, 2018 • edited Loading

mkolod commented Jul 6, 2018

KellenSunderland commented Jul 6, 2018 via email

mkolod commented Jul 10, 2018

larroy commented Jul 10, 2018

mkolod commented Jul 10, 2018

mkolod commented Jun 28, 2018 •

edited

Loading

mkolod commented Jun 28, 2018 •

edited

Loading

mkolod commented Jul 3, 2018 •

edited

Loading

marcoabreu commented Jul 6, 2018 •

edited

Loading