Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon-based image-classification failed to run with dummy data on CPU machine when set as "symbolic" mode #10765

Closed
juliusshufan opened this issue May 1, 2018 · 4 comments

Comments

@juliusshufan
Copy link
Contributor

juliusshufan commented May 1, 2018

Description

The example/gluon/image_classification.py failed to run on CPU machine when set as "symbolic" mode,
i.e., when the command is like below:
python image_classification.py --model=alexnet --mode=symbolic --dataset=dummy
Once model is NOT set as "symbolic", it can be executed smoothly.
Please kindly note: Build w.o. MKLDNN has no issue. (Build command: make -j($nproc) USE_OPENCV=1 USE_BLAS=openblas)

Environment info (Required)

CentOS-7.2

What to do:
Run python image_classification.py --model=alexnet **_--mode=symbolic_** --dataset=dummy


Package used (Python/R/Scala/Julia):
Python

Build info (Required if built from source)

GCC 4.8.5

MXNet commit hash:
9f8f042

Build config:
Build with or w.o. MKLDNN will all trigger this issue.
make -j($nproc) USE_MKLDNN=1 USE_OPENCV=1 USE_BLAS=mkl

Error Message:

Traceback (most recent call last):
File "image_classification.py", line 290, in
main()
File "image_classification.py", line 269, in main
initializer = mx.init.Xavier(magnitude=2))
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/module/base_module.py", line 520, in fit
self.update_metric(eval_metric, data_batch.label)
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/module/module.py", line 757, in update_metric
self.exec_group.update_metric(eval_metric, labels)
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels
, preds)
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/metric.py", line 132, in update_dict
self.update(label, pred)
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/metric.py", line 418, in update
pred_label = pred_label.asnumpy().astype('int32')
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
ctypes.c_size_t(data.size)))
File "/ec/fm/disks/nrv_algo_home01/shufanwu/pythonenv/py2.7_1/lib/python3.4/site-packages/mxnet-1.2.0-py3.4.egg/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [07:30:00] src/operator/tensor/./././elemwise_unary_op.h:302: Check failed: inputs[0].dptr
== outputs[0].dptr_ (0x7ff485a590c0 vs. 0x7ff485b79100)

Stack trace returned 10 entries:
[bt] (0) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(dmlc::StackTrace()+0x3f) [0x7ff591c05edf]
[bt] (1) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0x7ff591c062b1]
[bt] (2) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(void mxnet::op::UnaryOp::IdentityComputemshadow::cpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x940) [0x7ff59238b6f0]
[bt] (3) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0xe9) [0x7ff5945baaf9]
[bt] (4) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(+0x2f40023) [0x7ff594584023]
[bt] (5) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x589) [0x7ff5945142a9]
[bt] (6) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent)+0x92) [0x7ff594524e62]
[bt] (7) /nfs/site/home/shufanwu/workspace/mxnet/v2/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x44) [0x7ff594521e54]
[bt] (8) /lib64/libstdc++.so.6(+0xb52b0) [0x7ff636eb22b0]
[bt] (9) /lib64/libpthread.so.0(+0x7e25) [0x7ff64372ae25]

@roywei
Copy link
Member

roywei commented May 1, 2018

Able to reproduce on Ubuntu with gcc 5.4.0 and MacOS 10.12.6 with Apple LLVM version 9.0.0 (clang-900.0.38). Tested with pip installed packages. However, different from @juliusshufan observation (built from source I guess), only mxnet-mkl pip package will cause this issue.
Installed packages using:
pip install mxnet --pre
pip install mxnet-mkl --pre

@sandeep-krishnamurthy could you help to label Bug, Gluon, Symbol. Thanks!!

@juliusshufan
Copy link
Contributor Author

@roywei Thanks for your comments and double check, I retry the non-MKLDNN build from source, and GPU build, looks like the symbolic working well. I think this aligned with your observation.
Sorry for the confusion. I modify the issue description accordingly.

@juliusshufan
Copy link
Contributor Author

@roywei I tried the PR from @zheng-da #10651, and which seems like addressing similar issue, and can solve the issue I reported as well. @pengzhao-intel
As the corresponding PR pended to merge, can we keep this issue open?

@piiswrong
Copy link
Contributor

bug fix merged

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants