Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet GPU MKLDNN Build failure #15084

Closed
roywei opened this issue May 28, 2019 · 12 comments · Fixed by #15171
Closed

MXNet GPU MKLDNN Build failure #15084

roywei opened this issue May 28, 2019 · 12 comments · Fixed by #15171

Comments

@roywei
Copy link
Member

roywei commented May 28, 2019

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15082/2/pipeline

This is currently blocking all PRs in CI.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Build

@jlcontreras
Copy link
Contributor

Thanks for pointing it out! We are having a look to try and identify the problem. Any ideas what might have caused it?

@frankfliu
Copy link
Contributor

@mxnet-label-bot add [mkldnn, build, CI]

@aaronmarkham
Copy link
Contributor

aaronmarkham commented May 29, 2019

Thanks for pointing it out! We are having a look to try and identify the problem. Any ideas what might have caused it?

If we look at the trend, there's a cut off point. Should these commits be rolled back?
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-gpu/job/master/

@roywei
Copy link
Member Author

roywei commented May 30, 2019

see detailed discussion in #15099
We need to keep this open until root cause found.

As next step maybe try remove some build flags and see if it helps.

@anirudh2290
Copy link
Member

I tried again with

make DEV=1 ENABLE_TESTCOVERAGE=1 USE_CPP_PACKAGE=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 'CUDA_ARCH=-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_70,code=sm_70' USE_SIGNAL_HANDLER=1 -j$(nproc)

and now I am not able to reproduce the bug :). Would be really helpful to me if you can try the same @roywei .

In the meanwhile I am looking more into this, seeing if something in the archive utility could have caused this.

@anirudh2290
Copy link
Member

This seems to be related to the archive utility ar bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14625 . It has been fixed in a later version of archive utility in this patch : https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=e6cc316af931911da20249e19f9342e5cf8aeeff. I am trying to reproduce the issue and have also downloaded the binutils source for the version where the issue is fixed. It is very likely that the mxnet archive crossed the 4G limit. If I am able to confirm this, this will require the CI to use a newer version of the archive utility.

@anirudh2290
Copy link
Member

anirudh2290 commented Jun 6, 2019

Alright I was able to reproduce the issue with my commit and I was able to fix it on the commit when using ar version 2.27: GNU ar (GNU Binutils) 2.27. The CI needs to be upgraded to this AR version.
@marcoabreu @jlcontreras @lebeg @perdasilva @Chancebair Please help with the upgrade. This is blocking an important perf regression fix currently and can reoccur: #15033

@roywei
Copy link
Member Author

roywei commented Jun 6, 2019

Maybe we can add a script to upgrade the archive utility during the build stage of GPU: MKLDNN if this only happens for this set of build flags? This will ensure all auto-scaled machines to be upgraded.

@anirudh2290
Copy link
Member

@roywei yes good point. This will be required in all stages sooner or later, but right now gpu stages should be enough I think.

@Chancebair
Copy link
Contributor

Our CI instances use Ubuntu 16.04 so the latest version of ar available is 2.26

sudo apt-get install binutils
Reading package lists... Done
Building dependency tree
Reading state information... Done
binutils is already the newest version (2.26.1-1ubuntu1~16.04.8).
0 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.

@larroy
Copy link
Contributor

larroy commented Jun 7, 2019

Would upgrading the containers to 18.04 alleviate this and help move forward our CI?
I think is better to not deviate from the distribution if possible.

anirudh2290 added a commit to anirudh2290/mxnet that referenced this issue Jun 10, 2019
anirudh2290 added a commit to anirudh2290/mxnet that referenced this issue Jun 10, 2019
ptrendx pushed a commit that referenced this issue Jun 28, 2019
* Initial AMP commit

* Fix

* Merge AMP Changes

* AMP Changes to support conditional op names switch

* Add example and fix issues with AMP conversion

* Remove amp convert symbol test

* Fix comment for inference use case

* Remove input_names for convert_hybrid_block

* Check all conditions

* Fix lint

* Fix error_str for load_dict

* Fix lint, Add tests, fix bugs, add examples

* Fix warnings

* Add license for example script

* Remove gpu test and move tests to test_contrib_amp

* Clean up AMP tests

* Add additional comments, add tutorial

* Move the test to gpu dir

* Make the code python3 compatible

* Upgrade archive utility, fixes: #15084

* Allow AR path to be chosen by user

* Use current_context in tutorial

* Update __all__

* Merge with load params API changes

* Revert "Allow AR path to be chosen by user"

This reverts commit 94156b6.

* Revert "Upgrade archive utility, fixes: #15084"

This reverts commit ea7dd32.

* Set numpy dtype to float32

* Address review comments

* Add range based for

* Change quantized to low precision

* Fix lint

* Fix pylint

* Forward args for Node::Create

* Fixes

* Add dtype casting wherever needed

* Fix lint in source

* Add cast_optional_params to example

* Tweak example

* Add README

* Add README

* Add cast_optional_params test for convert_model and convert_hybrid_bloc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants