Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

Closed
sandeep-krishnamurthy opened this issue May 1, 2019 · 6 comments

Comments

@sandeep-krishnamurthy
Copy link
Contributor

Problem:

With mxnet-mkl (1.4.0)
If number of input symbols > 4 and I perform add_n after a FC layer produces wrong results.
i.e.,

data_0 -> fc_0  \
data_1 -> fc_1   \ 
data_2 -> fc_2      => add_n
data_3 -> fc_3  /
data_4 -> fc_4 /

Minimum reproducible code below:

Run below code which is full network:

import mxnet as mx

num_inp_symbols = 5
data_shape = (5,5)
hidden_layer_size = 8

input_symbols = [mx.sym.var('data_'+str(i)) for i in range(num_inp_symbols)]
fully_connected_symbols = [mx.sym.FullyConnected(data=input_symbols[i], 
                                                                                          num_hidden=hidden_layer_size, 
                                                                                          name='fc_'+str(i))
                                                for i in range(num_datasets)]

#Create final symbol
net = mx.sym.add_n(*fully_connected_symbols)
#Validate topology
#mx.viz.plot_network(net)

mod = mx.mod.Module(symbol=net, data_names=['data_0', 'data_1', 'data_2', 'data_3', 'data_4'], label_names=None)
mod.bind(for_training=False, data_shapes=[('data_0', data_shape), ('data_1', data_shape), ('data_2', data_shape), ('data_3', data_shape), ('data_4', data_shape)])
mod.set_params(full_module.get_params()[0], full_module.get_params()[1])

mod.forward(mx.io.DataBatch([mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape)]))
print(mod.get_outputs()[0])

Output

[[ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]]
<NDArray 5x8 @cpu(0)>

However, Let us now compute output of each FC in above network (fc0_output, fc1_output,... fc4_output). What I observe is the if I do individual fc output calculation and sum it up it is not same result as running everything together.

constituent_fc0 = fully_connected_symbols[0]
print(constituent_fc0.get_internals().list_outputs())

mod_cons_fc0 = mx.mod.Module(symbol=constituent_fc0, data_names=['data_0'], label_names=None)
mod_cons_fc0.bind(for_training=False, data_shapes=[('data_0', data_shape)])
mod_cons_fc0.set_params(mod.get_params()[0], mod.get_params()[1])
mod_cons_fc0.forward(mx.io.DataBatch([mx.nd.ones(data_shape)]))
o1 = mod_cons_fc0.get_outputs()[0]

#and so on for fc1, fc2, fc3, fc4
#and then do
print(nd.add_n(o1, o2, o3, o4, o5))

@ZhennanQin @pengzhao-intel - Can you please help debug this issue?
Please Note:

  1. storage type is all dense
  2. Number of inputs > 4
  3. Happens only from Module APIs and from mxnet-mkl 1.3.0 version onwards.
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@pengzhao-intel
Copy link
Contributor

we are on the public holiday this week :)
@rongzha1 please help take a look for this issue after the holiday.

@rongzha1
Copy link
Contributor

rongzha1 commented May 5, 2019

add_n output mem overlap with input mem (due to FInplaceOption), but in Forward function : ElementwiseSumContainsDnsImpl( ), output mem was set_zero which makes input zero: Kernel<set_zero, cpu>::Launch(s, out_data.Size(), out_data.dptr());

In fact, this is not a MKLDNN Bug.

@pengzhao-intel @zheng-da @TaoLv

@pengzhao-intel
Copy link
Contributor

@rongzha1 thanks for the analysis. Would you mind to file a PR and fix this issue?

@pengzhao-intel
Copy link
Contributor

btw, please add the example as a test case.

@pengzhao-intel
Copy link
Contributor

Good catch and fixed now :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants