-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MXNet GPU MKLDNN Build failure #15084
Comments
Hey, this is the MXNet Label Bot. |
Thanks for pointing it out! We are having a look to try and identify the problem. Any ideas what might have caused it? |
@mxnet-label-bot add [mkldnn, build, CI] |
If we look at the trend, there's a cut off point. Should these commits be rolled back? |
see detailed discussion in #15099 As next step maybe try remove some build flags and see if it helps. |
I tried again with
and now I am not able to reproduce the bug :). Would be really helpful to me if you can try the same @roywei . In the meanwhile I am looking more into this, seeing if something in the archive utility could have caused this. |
This seems to be related to the archive utility ar bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14625 . It has been fixed in a later version of archive utility in this patch : https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=e6cc316af931911da20249e19f9342e5cf8aeeff. I am trying to reproduce the issue and have also downloaded the binutils source for the version where the issue is fixed. It is very likely that the mxnet archive crossed the 4G limit. If I am able to confirm this, this will require the CI to use a newer version of the archive utility. |
Alright I was able to reproduce the issue with my commit and I was able to fix it on the commit when using ar version 2.27: GNU ar (GNU Binutils) 2.27. The CI needs to be upgraded to this AR version. |
Maybe we can add a script to upgrade the archive utility during the build stage of |
@roywei yes good point. This will be required in all stages sooner or later, but right now gpu stages should be enough I think. |
Our CI instances use Ubuntu 16.04 so the latest version of ar available is 2.26
|
Would upgrading the containers to 18.04 alleviate this and help move forward our CI? |
This reverts commit ea7dd32.
* Initial AMP commit * Fix * Merge AMP Changes * AMP Changes to support conditional op names switch * Add example and fix issues with AMP conversion * Remove amp convert symbol test * Fix comment for inference use case * Remove input_names for convert_hybrid_block * Check all conditions * Fix lint * Fix error_str for load_dict * Fix lint, Add tests, fix bugs, add examples * Fix warnings * Add license for example script * Remove gpu test and move tests to test_contrib_amp * Clean up AMP tests * Add additional comments, add tutorial * Move the test to gpu dir * Make the code python3 compatible * Upgrade archive utility, fixes: #15084 * Allow AR path to be chosen by user * Use current_context in tutorial * Update __all__ * Merge with load params API changes * Revert "Allow AR path to be chosen by user" This reverts commit 94156b6. * Revert "Upgrade archive utility, fixes: #15084" This reverts commit ea7dd32. * Set numpy dtype to float32 * Address review comments * Add range based for * Change quantized to low precision * Fix lint * Fix pylint * Forward args for Node::Create * Fixes * Add dtype casting wherever needed * Fix lint in source * Add cast_optional_params to example * Tweak example * Add README * Add README * Add cast_optional_params test for convert_model and convert_hybrid_bloc
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15082/2/pipeline
This is currently blocking all PRs in CI.
The text was updated successfully, but these errors were encountered: