Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

float64 data backward error using gluon #9156

Closed
zhaoningning opened this issue Dec 20, 2017 · 10 comments
Closed

float64 data backward error using gluon #9156

zhaoningning opened this issue Dec 20, 2017 · 10 comments

Comments

@zhaoningning
Copy link

zhaoningning commented Dec 20, 2017

I write a custom loss in gluon,and when using float32 data type, everything is ok, but whien I changed to use float64 data type,there is a error says:
“include/mxnet/././tensor_blob.h:217: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 1”,
this happend after loss is calculated ,when loss.backward () is executed.
mxnet version is 1.0.0, ubuntu 14.04,python2.7

@sxjscience
Copy link
Member

@zhaoningning You need to cast the type to float32 explicitly. Use arr.astype(np.float32) to cast the data type.

@zhaoningning
Copy link
Author

@sxjscience But I use float64 data and float64 parameters, still need to cast the loss to float32 ? I have to use float64 data type because it may generate very small values in the forward process.

@sxjscience
Copy link
Member

@zhaoningning You can try to explicitly set the dtype of all the ndarray weights/biases to float64. Also, would float64 be a must? Most deep learning algorithms can run in float32 types.

@zhaoningning
Copy link
Author

@sxjscience I have already cast all data to float64,so forward is OK ,but backward give error.....
I have to use float64 because it may produce very small values during calculate loss (e.g, 1e-60...)

@Soonhwan-Kwon
Copy link
Contributor

Soonhwan-Kwon commented Jan 29, 2018

Same error occurs when I use float16 and I'm not using gluon.
"mxnet.base.MXNetError: [05:42:23] include/mxnet/././tensor_blob.h:217: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 2"
And also when it's on backward and ok with forward.

@Soonhwan-Kwon
Copy link
Contributor

Soonhwan-Kwon commented Jan 29, 2018

...
  File "/data/ecg_2018/train.py", line 93, in do_training
    module.forward_backward(data_batch)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/module/base_module.py", line 192, in forward_backward
    self.backward()
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/module/bucketing_module.py", line 444, in backward
    self._curr_module.backward(out_grads=out_grads)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/module/module.py", line 627, in backward
    self._exec_group.backward(out_grads=out_grads)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/module/executor_group.py", line 580, in backward
    exec_.backward(out_grads=out_grads_slice)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/executor.py", line 234, in backward
    ctypes.c_int(is_train)))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [05:42:23] include/mxnet/././tensor_blob.h:217: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 2

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f03ecd9bcda]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f03ecd9c878]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mshadow::half::half_t* mxnet::TBlob::dptr<mshadow::half::half_t>() const+0xd7) [0x7f03ecdb74a7]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mshadow::Tensor<mshadow::gpu, 3, mshadow::half::half_t> mxnet::TBlob::get_with_shape<mshadow::gpu, 3, mshadow::half::half_t>(mshadow::Shape<3> const&, mshadow::Stream<mshadow::gpu>*) const+0x56c) [0x7f03ef94f84c]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mxnet::op::SliceChannelOp<mshadow::gpu, mshadow::half::half_t>::Backward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x9a7) [0x7f03f0ab9be7]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mxnet::op::OperatorState::Backward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x767) [0x7f03ef2f18a7]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool)+0x69) [0x7f03ef876429]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(+0x339a050) [0x7f03ef849050]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::NaiveEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x61) [0x7f03ef79ef61]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet-1.0.1-py2.7.egg/mxnet/libmxnet.so(mxnet::engine::NaiveEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)+0x4da) [0x7f03ef7aac8a]

@indhub
Copy link
Contributor

indhub commented Apr 10, 2018

@Soonhwan-Kwon Could you please add a small example that reproduces the problem?

@sandeep-krishnamurthy
Copy link
Contributor

@Soonhwan-Kwon / @zhaoningning - Can you please provide a small code sample for reproducing the issue.

@zhaoningning
Copy link
Author

@sandeep-krishnamurthy sorry ,I have moved to other solutions for float64 training, and also can not reproduce this issue because the code is missing after such a long time....

@sandeep-krishnamurthy
Copy link
Contributor

This PR - #12412 should fix using other than FP32 params in Gluon. Resolving. Please reopen if closed in error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants