Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_operator_gpu.test_embedding_with_type 'an illegal memory access was encountered' #17713

Closed
leezu opened this issue Feb 27, 2020 · 7 comments
Labels

Comments

@leezu
Copy link
Contributor

leezu commented Feb 27, 2020

Description

Embedding operator in test_operator_gpu.test_embedding_with_type triggers illegal memory access error deterministically on G4 instance.

Error Message

% nosetests --verbose --stop ../tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type
/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py:2402: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if req_dict['data'] is 'write':
/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py:2404: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if req_dict['grid'] is 'write':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3738: SyntaxWarning: "is" with a literal. Did you mean "=="?
  assert_almost_equal(exe.outputs[0], np_out, rtol=1e-2 if dtype is 'float16' else 1e-5, atol=1e-5)
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3897: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out = l1norm(in_data, i) if order is 1 else l2norm(in_data, i)
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3898: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out_backward = np.sign(in_data) if order is 1 else in_data/npy_out
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3914: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out = l1norm(in_data, (i, i+1)) if order is 1 else l2norm(in_data, (i, i+1))
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3915: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out_backward = np.sign(in_data) if order is 1 else in_data/npy_out
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2293: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if len(func_data) is 4:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2763: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2775: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if shape[axis] is 0 else shape[axis]
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2816: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2836: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if x.shape[axis] is 0 else random.randint(1,x.shape[axis])
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2871: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2888: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if axis_size is 0 else axis_size
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:6194: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis is -1:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_random.py:579: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if len(x.shape) is 1:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_ndarray.py:487: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_ndarray.py:1043: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:363: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'default' and rhs_stype is 'row_sparse') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:363: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'default' and rhs_stype is 'row_sparse') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:364: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'default' and rhs_stype is 'csr') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:364: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'default' and rhs_stype is 'csr') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:365: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (rhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:365: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (rhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:387: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (lhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:387: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (lhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1272: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1797: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if dim is 2:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1805: SyntaxWarning: "is" with a literal. Did you mean "=="?
  stypes = ([all_stypes[pick_type]] if pick_side is 0 else []) + stypes + ([all_stypes[pick_type]] if pick_side is 1 else [])
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1805: SyntaxWarning: "is" with a literal. Did you mean "=="?
  stypes = ([all_stypes[pick_type]] if pick_side is 0 else []) + stypes + ([all_stypes[pick_type]] if pick_side is 1 else [])
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=94434600 to reproduce.
test_operator_gpu.test_embedding_with_type ... [23:05:48] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=457645343 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_embedding_with_type
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.8.2/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/common.py", line 215, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py", line 1639, in test_embedding_with_type
    test_embedding_helper(data_types, weight_types, 5, 5)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py", line 1634, in test_embedding_helper
    check_consistency(sym, ctx_list, grad_req={'embedding_data': 'null','embedding_weight': 'write'},
  File "/home/ubuntu/src/mxnet-master/python/mxnet/test_utils.py", line 1572, in check_consistency
    assert_almost_equal(arr, gtarr, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/home/ubuntu/src/mxnet-master/python/mxnet/test_utils.py", line 602, in assert_almost_equal
    if output.asnumpy() == 1:
  File "/home/ubuntu/src/mxnet-master/python/mxnet/ndarray/ndarray.py", line 2558, in asnumpy
    check_call(_LIB.MXNDArraySyncCopyToCPU(
  File "/home/ubuntu/src/mxnet-master/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mshadow/./stream_gpu-inl.h", line 62
CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
-------------------- >> begin captured logging << --------------------
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=94434600 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=457645343 to reproduce.
--------------------- >> end captured logging << ---------------------

To Reproduce

nosetests --verbose tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type

Steps to reproduce

  1. On a EC2 g4 instance, compile MXNet with Cuda support, for example: CC=clang-9 CXX=clang++-9 cmake -GNinja -DUSE_MKLDNN=1 -DUSE_CUDA=ON .. ; ninja
  2. Run nosetests --verbose --stop ../tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type

Environment

----------Python Info----------
Version      : 3.8.2
Compiler     : GCC 7.4.0
Build        : ('default', 'Feb 27 2020 21:58:06')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /home/ubuntu/.pyenv/versions/3.8.2/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/src/mxnet-master/python/mxnet
Num GPUs     : 1
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-1054-aws-x86_64-with-glibc2.27
system       : Linux
node         : ip-172-31-54-76
release      : 4.15.0-1054-aws
version      : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3223.515
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0017 sec, LOAD: 0.5014 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0002 sec, LOAD: 0.4109 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1224 sec, LOAD: 0.0909 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0020 sec, LOAD: 0.0959 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0077 sec, LOAD: 0.1124 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1721 sec, LOAD: 0.4875 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0021 sec, LOAD: 0.0662 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0020 sec, LOAD: 0.0678 sec.

@leezu
Copy link
Contributor Author

leezu commented Feb 28, 2020

May be a bug in Cuda 10.0. Can't reproduce on 10.1.
However, https://docs.nvidia.com/cuda/archive/10.1/cuda-toolkit-release-notes/index.html doesn't seem to list a related fix. So maybe it's nevertheless a bug in MXNet.

@ptrendx
Copy link
Member

ptrendx commented Feb 28, 2020

@MoisesHer Could you take a look?

@MoisesHer
Copy link
Contributor

Yes, will take a look ASAP

@MoisesHer
Copy link
Contributor

We have investigated this issue and we found a bug in CUDA 10.0 compiler affecting only code generated for NVIDIA Turing architecture, i.e. SM_75.
NVIDIA compiler team confirmed this bug was fixed in CUDA 10.1 compiler (and beyond).

We suggest to remove SM_75 architecture when building Mxnet using CUDA Toolkit 10.0, but keeping SM_70 architecture. Note that code generated for SM_70 architecture is compatible with Turing, thus Turing GPUs are able to execute this code without any problem.
This compatibility is documented here:
https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#turing-volta-compatibility

@leezu
Copy link
Contributor Author

leezu commented Apr 29, 2020

@MoisesHer thanks for investigating the issue. Could you adapt https://github.com/apache/incubator-mxnet/blob/master/config/distribution/linux_cu100.cmake#L36 accordingly and add a comment inline?

@szha

This comment has been minimized.

@leezu
Copy link
Contributor Author

leezu commented May 12, 2020

Close as it's a cuda bug

@leezu leezu closed this as completed May 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants