Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.9.x] mkldnn build and test failures #20643

Closed
josephevans opened this issue Oct 7, 2021 · 3 comments
Closed

[v1.9.x] mkldnn build and test failures #20643

josephevans opened this issue Oct 7, 2021 · 3 comments
Labels

Comments

@josephevans
Copy link
Contributor

josephevans commented Oct 7, 2021

Description

The following tests keep failing consistently in the v1.9.x branch:

  1. tests/python/unittest/test_gluon.py/test_gluon.py - test_hybrid_static_memory_switching
  2. tests/cpp/operator/mkldnn_test.cc:103 - MKLDNN_UTIL_FUNC.MemFormat
  3. tests/cpp/thread_safety/thread_safety_test.cc:314 - ThreadSafety.CachedOpFullModel

Also, the mkldnn windows builds keep failing in the v1.9.x branch. Some example failures:

Occurrences

  1. test_gluon.test_hybrid_static_memory_switching examples:
  1. MKLDNN_UTIL_FUNC.MemFormat examples:
  1. ThreadSafety.CachedOpFullModel examples:

Test/Build Failure Log Output

  1. test_gluon.test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z] ======================================================================
[2021-10-07T01:27:03.835Z] ERROR: test_gluon.test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z] ----------------------------------------------------------------------
[2021-10-07T01:27:03.835Z] Traceback (most recent call last):
[2021-10-07T01:27:03.835Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
[2021-10-07T01:27:03.835Z]     self.test(*self.arg)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/common.py", line 218, in test_new
[2021-10-07T01:27:03.835Z]     orig_test(*args, **kwargs)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1760, in test_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z]     check_hybrid_static_memory_switching(static_alloc=True)
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1755, in check_hybrid_static_memory_switching
[2021-10-07T01:27:03.835Z]     mx.nd.waitall()
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
[2021-10-07T01:27:03.835Z]     check_call(_LIB.MXNDArrayWaitAll())
[2021-10-07T01:27:03.835Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
[2021-10-07T01:27:03.835Z]     raise get_last_ffi_error()
[2021-10-07T01:27:03.835Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2021-10-07T01:27:03.835Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7f9183955ee7]
[2021-10-07T01:27:03.835Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2d8) [0x7f9183942738]
[2021-10-07T01:27:03.835Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1c6) [0x7f9183940056]
[2021-10-07T01:27:03.835Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f91838680d7]
[2021-10-07T01:27:03.835Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x293) [0x7f9183867f43]
[2021-10-07T01:27:03.835Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5658020) [0x7f91835bb020]
[2021-10-07T01:27:03.835Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::MKLDNNRun(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x264) [0x7f917f3eacf4]
[2021-10-07T01:27:03.835Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x4f0) [0x7f917f3d6280]
[2021-10-07T01:27:03.835Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x580) [0x7f917f3d5540]
[2021-10-07T01:27:03.835Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f917ea45852]
[2021-10-07T01:27:03.835Z]   File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
[2021-10-07T01:27:03.835Z] MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): 
[2021-10-07T01:27:03.835Z] -------------------- >> begin captured logging << --------------------
[2021-10-07T01:27:03.835Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=1188622132 to reproduce.
[2021-10-07T01:27:03.835Z] --------------------- >> end captured logging << ---------------------
  1. MKLDNN_UTIL_FUNC.MemFormat
[2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC
[2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.AlignMem
[2021-10-07T02:39:35.601Z] [       OK ] MKLDNN_UTIL_FUNC.AlignMem (1 ms)
[2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.MemFormat
[2021-10-07T02:39:35.601Z] unknown file: Failure
[2021-10-07T02:39:35.601Z] C++ exception with description "[02:39:59] /work/mxnet/tests/cpp/operator/mkldnn_test.cc:103: Check failed: (dnnl_format_tag_last) == (222) 
[2021-10-07T02:39:35.601Z] 
[2021-10-07T02:39:35.601Z] " thrown in the test body.
[2021-10-07T02:39:35.601Z] [  FAILED  ] MKLDNN_UTIL_FUNC.MemFormat (0 ms)
[2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC (1 ms total)
  1. ThreadSafety.CachedOpFullModel
[2021-10-07T02:32:31.459Z] [ RUN      ] ThreadSafety.CachedOpFullModel
[2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:208: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:216: Symbol successfully upgraded!
[2021-10-07T02:32:34.725Z] terminate called after throwing an instance of 'dmlc::Error'
[2021-10-07T02:32:34.725Z]   what():  [02:32:57] tests/cpp/thread_safety/thread_safety_test.cc:314: MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): 
[2021-10-07T02:32:34.725Z] Stack trace:
[2021-10-07T02:32:34.725Z]   File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] 
[2021-10-07T02:32:34.725Z] /work/runtime_functions.sh: line 1306:  1730 Aborted                 (core dumped) build/tests/cpp/mxnet_unit_tests --gtest_filter="ThreadSafety.*"
  1. Windows build failures:
[2021-10-07T01:49:05.521Z] [748/749] Linking CXX shared library libmxnet.dll
[2021-10-07T01:49:05.521Z] FAILED: libmxnet.dll libmxnet.lib 
[2021-10-07T01:49:05.521Z] cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=CMakeFiles\mxnet.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100162~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100162~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\mxnet.rsp  /out:libmxnet.dll /implib:libmxnet.lib /pdb:libmxnet.pdb /dll /version:0.0 /machine:x64  /INCREMENTAL:NO /OPT:REF /OPT:ICF  && cmd.exe /C "cd /D C:\jenkins_slave\workspace\build-cpu-mkldnn\build && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E copy C:/jenkins_slave/workspace/build-cpu-mkldnn/build/3rdparty/mkldnn/include/oneapi/dnnl/dnnl_config.h C:/jenkins_slave/workspace/build-cpu-mkldnn/include/mkldnn/oneapi/dnnl/ && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E copy C:/jenkins_slave/workspace/build-cpu-mkldnn/build/3rdparty/mkldnn/include/oneapi/dnnl/dnnl_version.h C:/jenkins_slave/workspace/build-cpu-mkldnn/include/mkldnn/oneapi/dnnl/""
[2021-10-07T01:49:05.521Z] LINK: command "C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\mxnet.rsp /out:libmxnet.dll /implib:libmxnet.lib /pdb:libmxnet.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /OPT:REF /OPT:ICF /MANIFEST /MANIFESTFILE:libmxnet.dll.manifest" failed (exit code 1120) with the following output:
[2021-10-07T01:49:05.521Z]    Creating library libmxnet.lib and object libmxnet.exp
[2021-10-07T01:49:05.521Z] LINK : warning LNK4098: defaultlib 'MSVCRT' conflicts with use of other libs; use /NODEFAULTLIB:library
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol '_wcsdup' defined in 'libucrt.lib(wcsdup.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_domain_createW_init_3_0'
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol 'strncpy_s' defined in 'libucrt.lib(strncpy_s.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_get_groups'
[2021-10-07T01:49:05.521Z] LINK : warning LNK4217: symbol 'malloc' defined in 'libucrt.lib(malloc.obj)' is imported by 'dnnl.lib(ittnotify_static.c.obj)' in function '__itt_domain_createA_init_3_0'
[2021-10-07T01:49:05.521Z] dnnl.lib(ittnotify_static.c.obj) : error LNK2019: unresolved external symbol __imp__strdup referenced in function __itt_domain_createA_init_3_0
[2021-10-07T01:49:05.521Z] libmxnet.dll : fatal error LNK1120: 1 unresolved externals
[2021-10-07T01:49:05.521Z] ninja: build stopped: subcommand failed.
[2021-10-07T01:49:05.521Z] 2021-10-07 01:49:29,320 5 build(s) have failed
[2021-10-07T01:49:05.521Z] 2021-10-07 01:49:29,320 Build failed
@josephevans josephevans changed the title [v1.9.x] mkldnn test failures [v1.9.x] mkldnn build and test failures Oct 7, 2021
josephevans added a commit to josephevans/mxnet that referenced this issue Oct 7, 2021
@bartekkuncer
Copy link
Contributor

@josephevans Hi, all the problems you are experiencing are caused by bumping version of oneDNN from 2.0 to 2.3.2 in your PR (PR) without all the necessary changes around it. If you want to have v2.3.2 of oneDNN on branch v1.9.x (and presumably 1.9.0 release) let me know and I will create complete PR for you.

josephevans added a commit to josephevans/mxnet that referenced this issue Oct 7, 2021
@josephevans
Copy link
Contributor Author

@bartekkuncer Thanks so much for pointing this out! I accidentally updated the submodule and didn't realize it. Reverted in my PR and will close this when it passes CI. :)

@josephevans
Copy link
Contributor Author

Closing, as this problem is resolved. Thanks again @bartekkuncer!

josephevans added a commit that referenced this issue Oct 7, 2021
* Update LICENSE to include symlinks in include/dmlc licensed under non-ASF-2.0 licenses.

* Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)

* Update PDL package before installing PDL::CCS to prevent dependency issue.

* Install latest ca-certificates package on aarch64 as well.

* Change libtiff download URL to http to prevent let's encrypt CA chain issue.

* update Dockerfile.build.ubuntu_cpu_jekyll

* Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue.

* Lock down perl PDL version to specific version to prevent test failures.

* No need to source rvm.sh in environment now that we are using a different container.

* Update license_header.py tool to trigger error when two licenses are found.

* Remove expired CA cert from ubuntu14.04 containers.

* Revert "Change libtiff download URL to http to prevent let's encrypt CA chain issue."

This reverts commit 3ae1192.

* Revert "Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue."

This reverts commit 92432a6.

* Back off retry count for windows builds to reduce cost.

* Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures.

* Skip mkldnn test, tracking at #20643.

* Fix lint

* Attempt to fix windows build parameters with MKLDNN builds - do not use debug builds when linking against MKLDNN.

* Revert "Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)"

This reverts commit 8b64859.

* Add back change after revert.

* Revert "Fix lint"

This reverts commit 34b430c.

* Revert "Skip mkldnn test, tracking at #20643."

This reverts commit f45a6e3.

* Revert "Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures."

This reverts commit 23db9ba.

* Revert changing windows build flags.

Co-authored-by: Wei Chu <[email protected]>
josephevans added a commit to josephevans/mxnet that referenced this issue Oct 7, 2021
* Update LICENSE to include symlinks in include/dmlc licensed under non-ASF-2.0 licenses.

* Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)

* Update PDL package before installing PDL::CCS to prevent dependency issue.

* Install latest ca-certificates package on aarch64 as well.

* Change libtiff download URL to http to prevent let's encrypt CA chain issue.

* update Dockerfile.build.ubuntu_cpu_jekyll

* Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue.

* Lock down perl PDL version to specific version to prevent test failures.

* No need to source rvm.sh in environment now that we are using a different container.

* Update license_header.py tool to trigger error when two licenses are found.

* Remove expired CA cert from ubuntu14.04 containers.

* Revert "Change libtiff download URL to http to prevent let's encrypt CA chain issue."

This reverts commit 3ae1192.

* Revert "Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue."

This reverts commit 92432a6.

* Back off retry count for windows builds to reduce cost.

* Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures.

* Skip mkldnn test, tracking at apache#20643.

* Fix lint

* Attempt to fix windows build parameters with MKLDNN builds - do not use debug builds when linking against MKLDNN.

* Revert "Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)"

This reverts commit 8b64859.

* Add back change after revert.

* Revert "Fix lint"

This reverts commit 34b430c.

* Revert "Skip mkldnn test, tracking at apache#20643."

This reverts commit f45a6e3.

* Revert "Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures."

This reverts commit 23db9ba.

* Revert changing windows build flags.

Co-authored-by: Wei Chu <[email protected]>
josephevans added a commit that referenced this issue Oct 8, 2021
* [v1.9.x] LICENSE and CI fixes for 1.9 release (#20626)

* Update LICENSE to include symlinks in include/dmlc licensed under non-ASF-2.0 licenses.

* Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)

* Update PDL package before installing PDL::CCS to prevent dependency issue.

* Install latest ca-certificates package on aarch64 as well.

* Change libtiff download URL to http to prevent let's encrypt CA chain issue.

* update Dockerfile.build.ubuntu_cpu_jekyll

* Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue.

* Lock down perl PDL version to specific version to prevent test failures.

* No need to source rvm.sh in environment now that we are using a different container.

* Update license_header.py tool to trigger error when two licenses are found.

* Remove expired CA cert from ubuntu14.04 containers.

* Revert "Change libtiff download URL to http to prevent let's encrypt CA chain issue."

This reverts commit 3ae1192.

* Revert "Use http to download libcurl to avoid let's encrypt intermediate CA cert expiration issue."

This reverts commit 92432a6.

* Back off retry count for windows builds to reduce cost.

* Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures.

* Skip mkldnn test, tracking at #20643.

* Fix lint

* Attempt to fix windows build parameters with MKLDNN builds - do not use debug builds when linking against MKLDNN.

* Revert "Update ca-certificates package on centos7 due to let's encrypt recent issue (see https://blog.devgenius.io/rhel-centos-7-fix-for-lets-encrypt-change-8af2de587fe4)"

This reverts commit 8b64859.

* Add back change after revert.

* Revert "Fix lint"

This reverts commit 34b430c.

* Revert "Skip mkldnn test, tracking at #20643."

This reverts commit f45a6e3.

* Revert "Split test_hybrid_static_memory_switching() into 3 tests in order to isolate failures."

This reverts commit 23db9ba.

* Revert changing windows build flags.

Co-authored-by: Wei Chu <[email protected]>

* Update openssl package in ubuntu_core.sh (used in ubuntu 16.04 images) to avoid bug triggered by let's encrypt expired ca cert.

Co-authored-by: Wei Chu <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants