add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

sandeep-krishnamurthy · 2019-05-01T23:56:52Z

Problem:

With mxnet-mkl (1.4.0)
If number of input symbols > 4 and I perform add_n after a FC layer produces wrong results.
i.e.,

data_0 -> fc_0  \
data_1 -> fc_1   \ 
data_2 -> fc_2      => add_n
data_3 -> fc_3  /
data_4 -> fc_4 /

Minimum reproducible code below:

Run below code which is full network:

import mxnet as mx

num_inp_symbols = 5
data_shape = (5,5)
hidden_layer_size = 8

input_symbols = [mx.sym.var('data_'+str(i)) for i in range(num_inp_symbols)]
fully_connected_symbols = [mx.sym.FullyConnected(data=input_symbols[i], 
                                                                                          num_hidden=hidden_layer_size, 
                                                                                          name='fc_'+str(i))
                                                for i in range(num_datasets)]

#Create final symbol
net = mx.sym.add_n(*fully_connected_symbols)
#Validate topology
#mx.viz.plot_network(net)

mod = mx.mod.Module(symbol=net, data_names=['data_0', 'data_1', 'data_2', 'data_3', 'data_4'], label_names=None)
mod.bind(for_training=False, data_shapes=[('data_0', data_shape), ('data_1', data_shape), ('data_2', data_shape), ('data_3', data_shape), ('data_4', data_shape)])
mod.set_params(full_module.get_params()[0], full_module.get_params()[1])

mod.forward(mx.io.DataBatch([mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape), mx.nd.ones(data_shape)]))
print(mod.get_outputs()[0])

Output

[[ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]
 [ 2.2989948  -3.3271918   0.64880913  2.2778904   0.9859241   2.0046096
  -1.6065626   1.5986269 ]]
<NDArray 5x8 @cpu(0)>

However, Let us now compute output of each FC in above network (fc0_output, fc1_output,... fc4_output). What I observe is the if I do individual fc output calculation and sum it up it is not same result as running everything together.

constituent_fc0 = fully_connected_symbols[0]
print(constituent_fc0.get_internals().list_outputs())

mod_cons_fc0 = mx.mod.Module(symbol=constituent_fc0, data_names=['data_0'], label_names=None)
mod_cons_fc0.bind(for_training=False, data_shapes=[('data_0', data_shape)])
mod_cons_fc0.set_params(mod.get_params()[0], mod.get_params()[1])
mod_cons_fc0.forward(mx.io.DataBatch([mx.nd.ones(data_shape)]))
o1 = mod_cons_fc0.get_outputs()[0]

#and so on for fc1, fc2, fc3, fc4
#and then do
print(nd.add_n(o1, o2, o3, o4, o5))

@ZhennanQin @pengzhao-intel - Can you please help debug this issue?
Please Note:

storage type is all dense
Number of inputs > 4
Happens only from Module APIs and from mxnet-mkl 1.3.0 version onwards.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-05-01T23:56:55Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

pengzhao-intel · 2019-05-02T14:48:13Z

we are on the public holiday this week :)
@rongzha1 please help take a look for this issue after the holiday.

rongzha1 · 2019-05-05T15:09:31Z

add_n output mem overlap with input mem (due to FInplaceOption), but in Forward function : ElementwiseSumContainsDnsImpl( ), output mem was set_zero which makes input zero: Kernel<set_zero, cpu>::Launch(s, out_data.Size(), out_data.dptr());

In fact, this is not a MKLDNN Bug.

@pengzhao-intel @zheng-da @TaoLv

pengzhao-intel · 2019-05-06T04:43:05Z

@rongzha1 thanks for the analysis. Would you mind to file a PR and fix this issue?

pengzhao-intel · 2019-05-06T04:45:11Z

btw, please add the example as a test case.

pengzhao-intel · 2019-05-07T05:33:05Z

Good catch and fixed now :)

sandeep-krishnamurthy added MKLDNN Bug Operator labels May 1, 2019

rongzha1 mentioned this issue May 6, 2019

fix add_n bug: when input mem overlap with output mem, results is wrong #14889

Merged

7 tasks

pengzhao-intel closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

sandeep-krishnamurthy commented May 1, 2019

mxnet-label-bot commented May 1, 2019

pengzhao-intel commented May 2, 2019

rongzha1 commented May 5, 2019

pengzhao-intel commented May 6, 2019

pengzhao-intel commented May 6, 2019

pengzhao-intel commented May 7, 2019

add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

add_n operator with MXNet-MKL producing wrong results when input count >4 #14858

Comments

sandeep-krishnamurthy commented May 1, 2019

mxnet-label-bot commented May 1, 2019

pengzhao-intel commented May 2, 2019

rongzha1 commented May 5, 2019

pengzhao-intel commented May 6, 2019

pengzhao-intel commented May 6, 2019

pengzhao-intel commented May 7, 2019