Increase amp support for Bi-lstm and Concat operators in gluon #15716

fierceX · 2019-08-01T02:26:42Z

Now amp does not support the bi-lstm and concat operators in gluon. I am getting the following error in converting a network with bi-lstm:

Traceback (most recent call last):
  File "predict.py", line 60, in <module>
    net = amp.convert_hybrid_block(model)
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/contrib/amp/amp.py", line 636, in convert_hybrid_block
    cast_optional_params=cast_optional_params)
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/contrib/amp/amp.py", line 505, in convert_symbol
    keys))
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator lstm0__rnn_param_concat0: [09:35:15] src/operator/nn/concat.cc:158: Not enough information to infer type in Concat.
Stack trace:
  [bt] (0) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4ee77b) [0x7f213625c77b]
  [bt] (1) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x8ff5d0) [0x7f213666d5d0]
  [bt] (2) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297275e) [0x7f21386e075e]
  [bt] (3) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297bafe) [0x7f21386e9afe]
  [bt] (4) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297c54a) [0x7f21386ea54a]
  [bt] (5) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(MXReducePrecisionSymbol+0x1610) [0x7f213864e600]
  [bt] (6) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f21bc290ec0]
  [bt] (7) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f21bc29087d]
  [bt] (8) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f21c3dfeede]

Hardware and version information:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
Stepping:              4
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat spec_ctrl intel_stibp flush_l1d
----------Python Info----------
Version      : 3.6.9
Compiler     : GCC 7.3.0
Build        : ('default', 'Jul 30 2019 19:07:31')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet
Commit Hash   : 0f28f5b827c718dcab7bbf2617e16a59ad3f601c
Library      : ['/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
system       : Linux
node         : dp-prod-dc3-gpu01
release      : 3.10.0-862.14.4.el7.x86_64
version      : #1 SMP Wed Sep 26 15:12:11 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0033 sec, LOAD: 0.9312 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0025 sec, LOAD: 1.1405 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 2.9024 sec, LOAD: 2.5129 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 1.4263 sec, LOAD: 5.9768 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1766 sec, LOAD: 5.7293 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 1.0143 sec, LOAD: 0.8343 sec.

The mxnet version is:mxnet-cu100==1.6.0b20190730，GPU is NVIDIA V100 16G.

The text was updated successfully, but these errors were encountered:

ptrendx · 2019-08-01T20:17:26Z

Hi @fierceX, do you have any small example that shows this problem? I will look into it.

ptrendx · 2019-08-01T23:27:31Z

There seem to be 2 problems here. On 1 hand, the ConcatType function seems to be too strict in what it thinks it needs to be correct (and so that error should not be there in the first place as type could be inferred during later stage of InferType pass) - I will make a PR fixing that tomorrow. On the other hand, I don't quite see how you could end up with this situation by just adding AMP so again, a small example would be really nice.

vrakesh · 2019-08-01T23:30:34Z

@mxnet-label-bot add [Pending Requester Info]

fierceX · 2019-08-02T02:17:44Z

Hi @ptrendx ,The following code should be able to reproduce this error.

import mxnet as mx
from mxnet import nd
from mxnet.gluon import nn,rnn
from mxnet.contrib import amp

model = nn.HybridSequential()
model.add(rnn.LSTM(hidden_size=10,num_layers=2,bidirectional=True))
model.add(nn.Dense(2))

model.initialize()
model.hybridize()
model(nd.ones((2,3,4)))

new_model = amp.convert_hybrid_block(model)

ptrendx · 2019-08-02T15:45:24Z

Thanks! I will look into this.

anirudh2290 · 2019-08-02T22:06:27Z

thanks @ptrendx for looking at this. let me know if I can help here.

ptrendx · 2019-08-05T17:35:56Z

Ok, so after applying PR #15740 I can successfully run the example when using amp.init:

import mxnet as mx
from mxnet import nd
from mxnet.gluon import nn,rnn
from mxnet.contrib import amp

amp.init()

model = nn.HybridSequential()
model.add(rnn.LSTM(hidden_size=10,num_layers=2,bidirectional=True))
model.add(nn.Dense(2))

model.initialize(ctx=mx.gpu(0))
model.hybridize()
model(nd.ones((2,3,4), ctx=mx.gpu(0)))

# new_model = amp.convert_hybrid_block(model)

while the amp.convert_hybrid_block still fails with the same error in concat - @anirudh2290, could you take a look at this?

anirudh2290 · 2019-08-05T18:32:21Z

@ptrendx will take a look.

anirudh2290 · 2019-08-09T19:35:36Z

@ptrendx @fierceX I have added a fix in #15829. Please help review.

marcoabreu added the Pending Requester Info label Aug 1, 2019

ptrendx self-assigned this Aug 2, 2019

anirudh2290 mentioned this issue Aug 9, 2019

Fix ConcatType backward type inference #15829

Merged

7 tasks

anirudh2290 added the Bug label Aug 9, 2019

anirudh2290 closed this as completed in #15829 Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase amp support for Bi-lstm and Concat operators in gluon #15716

Increase amp support for Bi-lstm and Concat operators in gluon #15716

fierceX commented Aug 1, 2019

ptrendx commented Aug 1, 2019

ptrendx commented Aug 1, 2019

vrakesh commented Aug 1, 2019

fierceX commented Aug 2, 2019

ptrendx commented Aug 2, 2019

anirudh2290 commented Aug 2, 2019

ptrendx commented Aug 5, 2019

anirudh2290 commented Aug 5, 2019

anirudh2290 commented Aug 9, 2019

Increase amp support for Bi-lstm and Concat operators in gluon #15716

Increase amp support for Bi-lstm and Concat operators in gluon #15716

Comments

fierceX commented Aug 1, 2019

ptrendx commented Aug 1, 2019

ptrendx commented Aug 1, 2019

vrakesh commented Aug 1, 2019

fierceX commented Aug 2, 2019

ptrendx commented Aug 2, 2019

anirudh2290 commented Aug 2, 2019

ptrendx commented Aug 5, 2019

anirudh2290 commented Aug 5, 2019

anirudh2290 commented Aug 9, 2019