Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

Closed
sxjscience opened this issue Jul 30, 2020 · 6 comments

Comments

@sxjscience
Copy link
Member

The CPU version of mx.npx.leaky_relu(x, act_type='gelu') has different precision from PyTorch.

The minimal reproducible example:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

The GPU version has no issue:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,), ctx=mx.gpu()) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

@pengzhao-intel @ciyongch

Error:

<ipython-input-48-6f3377797f65> in <module>
      9 b_torch = torch.nn.functional.gelu(a_torch)
     10 assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)
---> 11 assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
   1526     header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
   1527     assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1528                          verbose=verbose, header=header, equal_nan=equal_nan)
   1529 
   1530 

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Not equal to tolerance rtol=0.0001, atol=0.0001

Mismatched elements: 2258 / 10000 (22.6%)
Max absolute difference: 0.0004735
Max relative difference: 0.8255573
 x: array([ 0.684651,  0.508604, -0.165598, ...,  1.706593,  0.288036,
        1.006167], dtype=float32)
 y: array([ 0.68455 ,  0.508554, -0.165716, ...,  1.706508,  0.288026,
        1.005966], dtype=float32)
@TaoLv
Copy link
Member

TaoLv commented Jul 30, 2020

@sxjscience Can you confirm the operator runs into its mkldnn version?

@sxjscience
Copy link
Member Author

sxjscience commented Jul 30, 2020 via email

@TaoLv
Copy link
Member

TaoLv commented Jul 30, 2020

In fact, I cannot correctly run the reproducer. I try to fix the precision problem with #18827. Please let me know if it works for you. Thanks.

@sxjscience
Copy link
Member Author

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

@pengzhao-intel
Copy link
Contributor

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)  

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

Does the issue still exist after Tao's PR?

@sxjscience
Copy link
Member Author

Yes, it's solved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants