MXNet GPU MKLDNN Build failure #15084

roywei · 2019-05-28T09:22:34Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15082/2/pipeline

This is currently blocking all PRs in CI.

mxnet-label-bot · 2019-05-28T09:22:37Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Build

jlcontreras · 2019-05-28T13:12:56Z

Thanks for pointing it out! We are having a look to try and identify the problem. Any ideas what might have caused it?

frankfliu · 2019-05-28T16:23:12Z

@mxnet-label-bot add [mkldnn, build, CI]

aaronmarkham · 2019-05-29T18:48:10Z

Thanks for pointing it out! We are having a look to try and identify the problem. Any ideas what might have caused it?

If we look at the trend, there's a cut off point. Should these commits be rolled back?
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-gpu/job/master/

roywei · 2019-05-30T07:07:06Z

see detailed discussion in #15099
We need to keep this open until root cause found.

As next step maybe try remove some build flags and see if it helps.

anirudh2290 · 2019-06-05T22:13:52Z

I tried again with

make DEV=1 ENABLE_TESTCOVERAGE=1 USE_CPP_PACKAGE=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 'CUDA_ARCH=-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_70,code=sm_70' USE_SIGNAL_HANDLER=1 -j$(nproc)

and now I am not able to reproduce the bug :). Would be really helpful to me if you can try the same @roywei .

In the meanwhile I am looking more into this, seeing if something in the archive utility could have caused this.

anirudh2290 · 2019-06-05T23:02:28Z

This seems to be related to the archive utility ar bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14625 . It has been fixed in a later version of archive utility in this patch : https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=e6cc316af931911da20249e19f9342e5cf8aeeff. I am trying to reproduce the issue and have also downloaded the binutils source for the version where the issue is fixed. It is very likely that the mxnet archive crossed the 4G limit. If I am able to confirm this, this will require the CI to use a newer version of the archive utility.

anirudh2290 · 2019-06-06T02:46:21Z

Alright I was able to reproduce the issue with my commit and I was able to fix it on the commit when using ar version 2.27: GNU ar (GNU Binutils) 2.27. The CI needs to be upgraded to this AR version.
@marcoabreu @jlcontreras @lebeg @perdasilva @Chancebair Please help with the upgrade. This is blocking an important perf regression fix currently and can reoccur: #15033

roywei · 2019-06-06T05:30:19Z

Maybe we can add a script to upgrade the archive utility during the build stage of GPU: MKLDNN if this only happens for this set of build flags? This will ensure all auto-scaled machines to be upgraded.

anirudh2290 · 2019-06-06T06:23:26Z

@roywei yes good point. This will be required in all stages sooner or later, but right now gpu stages should be enough I think.

Chancebair · 2019-06-06T12:47:03Z

Our CI instances use Ubuntu 16.04 so the latest version of ar available is 2.26

sudo apt-get install binutils
Reading package lists... Done
Building dependency tree
Reading state information... Done
binutils is already the newest version (2.26.1-1ubuntu1~16.04.8).
0 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.

larroy · 2019-06-07T20:05:54Z

Would upgrading the containers to 18.04 alleviate this and help move forward our CI?
I think is better to not deviate from the distribution if possible.

This reverts commit ea7dd32.

* Initial AMP commit * Fix * Merge AMP Changes * AMP Changes to support conditional op names switch * Add example and fix issues with AMP conversion * Remove amp convert symbol test * Fix comment for inference use case * Remove input_names for convert_hybrid_block * Check all conditions * Fix lint * Fix error_str for load_dict * Fix lint, Add tests, fix bugs, add examples * Fix warnings * Add license for example script * Remove gpu test and move tests to test_contrib_amp * Clean up AMP tests * Add additional comments, add tutorial * Move the test to gpu dir * Make the code python3 compatible * Upgrade archive utility, fixes: #15084 * Allow AR path to be chosen by user * Use current_context in tutorial * Update __all__ * Merge with load params API changes * Revert "Allow AR path to be chosen by user" This reverts commit 94156b6. * Revert "Upgrade archive utility, fixes: #15084" This reverts commit ea7dd32. * Set numpy dtype to float32 * Address review comments * Add range based for * Change quantized to low precision * Fix lint * Fix pylint * Forward args for Node::Create * Fixes * Add dtype casting wherever needed * Fix lint in source * Add cast_optional_params to example * Tweak example * Add README * Add README * Add cast_optional_params test for convert_model and convert_hybrid_bloc

marcoabreu added Build CI MKLDNN labels May 28, 2019

roywei mentioned this issue May 29, 2019

Revert "Improve FC perf when no_bias=False" #15099

Merged

anirudh2290 mentioned this issue Jun 7, 2019

Upgrade archive utility and add back FC improvement #15171

Merged

7 tasks

anirudh2290 mentioned this issue Jun 7, 2019

Reduce MXNet Code bloat #15182

Open

anirudh2290 added a commit to anirudh2290/mxnet that referenced this issue Jun 10, 2019

Upgrade archive utility, fixes: apache#15084

ea7dd32

anirudh2290 added a commit to anirudh2290/mxnet that referenced this issue Jun 10, 2019

Revert "Upgrade archive utility, fixes: apache#15084"

80dd7bc

This reverts commit ea7dd32.

anirudh2290 closed this as completed in #15171 Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNet GPU MKLDNN Build failure #15084

MXNet GPU MKLDNN Build failure #15084

roywei commented May 28, 2019

mxnet-label-bot commented May 28, 2019

jlcontreras commented May 28, 2019

frankfliu commented May 28, 2019

aaronmarkham commented May 29, 2019 •

edited

Loading

roywei commented May 30, 2019

anirudh2290 commented Jun 5, 2019

anirudh2290 commented Jun 5, 2019

anirudh2290 commented Jun 6, 2019 •

edited

Loading

roywei commented Jun 6, 2019 •

edited

Loading

anirudh2290 commented Jun 6, 2019

Chancebair commented Jun 6, 2019

larroy commented Jun 7, 2019 •

edited

Loading

MXNet GPU MKLDNN Build failure #15084

MXNet GPU MKLDNN Build failure #15084

Comments

roywei commented May 28, 2019

mxnet-label-bot commented May 28, 2019

jlcontreras commented May 28, 2019

frankfliu commented May 28, 2019

aaronmarkham commented May 29, 2019 • edited Loading

roywei commented May 30, 2019

anirudh2290 commented Jun 5, 2019

anirudh2290 commented Jun 5, 2019

anirudh2290 commented Jun 6, 2019 • edited Loading

roywei commented Jun 6, 2019 • edited Loading

anirudh2290 commented Jun 6, 2019

Chancebair commented Jun 6, 2019

larroy commented Jun 7, 2019 • edited Loading

aaronmarkham commented May 29, 2019 •

edited

Loading

anirudh2290 commented Jun 6, 2019 •

edited

Loading

roywei commented Jun 6, 2019 •

edited

Loading

larroy commented Jun 7, 2019 •

edited

Loading