Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Floating point exception in mxnet.ndarray.Correlation #18942

Closed
leeyeetonn opened this issue Aug 16, 2020 · 3 comments · Fixed by #18997
Closed

Floating point exception in mxnet.ndarray.Correlation #18942

leeyeetonn opened this issue Aug 16, 2020 · 3 comments · Fixed by #18997
Assignees
Labels
Bug C++ Related to C++ good first issue Operator v1.x Targeting v1.x branch

Comments

@leeyeetonn
Copy link

Description

(A clear and concise description of what the bug is.)
mxnet.ndarray.Correlation has floating point exception when given stride2=0. Please see the provided code for example.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

Floating point exception (core dumped)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

import mxnet
import numpy as np
data1 = mxnet.nd.array(np.random.rand(1,1,1,1))
data2 = mxnet.nd.array(np.random.rand(1,1,1,1))
mxnet.ndarray.Correlation(data1=data1, data2=data2, stride2=0)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. run the provided code in python interpreter or as a script

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

Got 404 when trying to get the script.

Some environment information:

  • OS: ubuntu 18.04
  • Python: 3.7.6
  • pip: 20.0.2
  • numpy: 1.18.5
  • mxnet: 1.6.0
@szha
Copy link
Member

szha commented Aug 21, 2020

So here's the problem:

% DMLC_LOG_STACK_TRACE_DEPTH=150 MXNET_ENGINE_TYPE=NaiveEngine lldb python3.7 -- test_18942.py
(lldb) target create "python3.7"
Current executable set to 'python3.7' (x86_64).
(lldb) settings set -- target.run-args  "test_18942.py"
(lldb) run
Process 82919 launched: '/usr/local/bin/python3.7' (x86_64)
Process 82919 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100006000 dyld`_dyld_start
dyld`_dyld_start:
->  0x100006000 <+0>: popq   %rdi
    0x100006001 <+1>: pushq  $0x0
    0x100006003 <+3>: movq   %rsp, %rbp
    0x100006006 <+6>: andq   $-0x10, %rsp
(lldb) cont
Process 82919 resuming
[23:55:46] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[23:55:46] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
Process 82919 stopped
* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_ARITHMETIC (code=EXC_I386_DIV, subcode=0x0)
    frame #0: 0x0000000112b06279 libmxnet.dylib`mxnet::op::CorrelationProp::InferShape(this=0x000000010056ffa0, in_shape=0x00007ffeefbfc1d0, out_shape=0x00000001005baeb0, aux_shape=0x00007ffeefbfc1b0) const at correlation-inl.h:219
   216 	     / static_cast<float>(stride1));
   217 	    top_height_ = std::ceil(static_cast<float>(paddedbottomheight - border_size_ * 2)\
   218 	     / static_cast<float>(stride1));
-> 219 	    neighborhood_grid_radius_ = param_.max_displacement / stride2;
   220 	    neighborhood_grid_width_ = neighborhood_grid_radius_ * 2 + 1;
   221 	    top_channels_ = neighborhood_grid_width_ * neighborhood_grid_width_;
   222 	    CHECK_GE(top_width_, 1U) <<

https://github.com/apache/incubator-mxnet/blob/9bdd4d6347c284770ee5bfe5ae98f1dabc283829/src/operator/correlation-inl.h#L219

The code needs to guard against zero-size array for right operand of /, and we should add a smoke test to guard against such problem in this op, similar to https://github.com/apache/incubator-mxnet/pull/18972/files

@ekdnam
Copy link
Contributor

ekdnam commented Aug 24, 2020

@szha can I work on this issue? Would need some advice on how to solve it

@szha
Copy link
Member

szha commented Aug 24, 2020

@ekdnam yes. I think in this case the problem is that both stride1 and stride2 should not be zero. The existing mechanism for declaring operator parameter ranges is in the declaration of the parameter object. Currently, it only has default values:
https://github.com/apache/incubator-mxnet/blob/0de7484884292eb028342b1e5669233792429af0/src/operator/correlation-inl.h#L57-L60
To fix it, we will need to do something similar to https://github.com/apache/incubator-mxnet/pull/18857/files

ekdnam added a commit to ekdnam/incubator-mxnet that referenced this issue Aug 24, 2020
in issue: apache#18942, an error was occurring where strides became zero. to solve the issue, the lower bounds of stride1 and stride2 have been set to 1.
szha added a commit that referenced this issue Aug 31, 2020
* set_lower_bound(1) so that stride is not zero

in issue: #18942, an error was occurring where strides became zero. to solve the issue, the lower bounds of stride1 and stride2 have been set to 1.

* small typo

'use_unifrom' changed to 'use_uniform'

* add tests to check lower bound

checking that stride1 and stride2 are be greater than zero

* Update test_operator.py

Co-authored-by: Sheng Zha <[email protected]>
szha added a commit that referenced this issue Sep 2, 2020
* set_lower_bound(1) so that stride is not zero

in issue: #18942, an error was occurring where strides became zero. to solve the issue, the lower bounds of stride1 and stride2 have been set to 1.

* small typo

'use_unifrom' changed to 'use_uniform'

* add tests to check lower bound

checking that stride1 and stride2 are be greater than zero

* Update test_operator.py

* documentation error

name is not a parameter. issue: #19001

* data is the parameter, not input

solves issue: #19000

* fix docs

threshold is not a parameter. issue: #18999

* fix docs

context 'ctx' is not a parameter. issue: #18990

* fix indentation

issue: #18988

Co-authored-by: Sheng Zha <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug C++ Related to C++ good first issue Operator v1.x Targeting v1.x branch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants