Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Increase amp support for Bi-lstm and Concat operators in gluon #15716

Closed
fierceX opened this issue Aug 1, 2019 · 9 comments · Fixed by #15829
Closed

Increase amp support for Bi-lstm and Concat operators in gluon #15716

fierceX opened this issue Aug 1, 2019 · 9 comments · Fixed by #15829

Comments

@fierceX
Copy link
Contributor

fierceX commented Aug 1, 2019

Now amp does not support the bi-lstm and concat operators in gluon. I am getting the following error in converting a network with bi-lstm:

Traceback (most recent call last):
  File "predict.py", line 60, in <module>
    net = amp.convert_hybrid_block(model)
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/contrib/amp/amp.py", line 636, in convert_hybrid_block
    cast_optional_params=cast_optional_params)
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/contrib/amp/amp.py", line 505, in convert_symbol
    keys))
  File "/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator lstm0__rnn_param_concat0: [09:35:15] src/operator/nn/concat.cc:158: Not enough information to infer type in Concat.
Stack trace:
  [bt] (0) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4ee77b) [0x7f213625c77b]
  [bt] (1) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x8ff5d0) [0x7f213666d5d0]
  [bt] (2) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297275e) [0x7f21386e075e]
  [bt] (3) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297bafe) [0x7f21386e9afe]
  [bt] (4) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x297c54a) [0x7f21386ea54a]
  [bt] (5) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so(MXReducePrecisionSymbol+0x1610) [0x7f213864e600]
  [bt] (6) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f21bc290ec0]
  [bt] (7) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f21bc29087d]
  [bt] (8) /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f21c3dfeede]

Hardware and version information:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
Stepping:              4
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat spec_ctrl intel_stibp flush_l1d
----------Python Info----------
Version      : 3.6.9
Compiler     : GCC 7.3.0
Build        : ('default', 'Jul 30 2019 19:07:31')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet
Commit Hash   : 0f28f5b827c718dcab7bbf2617e16a59ad3f601c
Library      : ['/home/tiger/anaconda3/envs/mx1.6/lib/python3.6/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
system       : Linux
node         : dp-prod-dc3-gpu01
release      : 3.10.0-862.14.4.el7.x86_64
version      : #1 SMP Wed Sep 26 15:12:11 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0033 sec, LOAD: 0.9312 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0025 sec, LOAD: 1.1405 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 2.9024 sec, LOAD: 2.5129 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 1.4263 sec, LOAD: 5.9768 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1766 sec, LOAD: 5.7293 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 1.0143 sec, LOAD: 0.8343 sec.

The mxnet version is:mxnet-cu100==1.6.0b20190730,GPU is NVIDIA V100 16G.

@ptrendx
Copy link
Member

ptrendx commented Aug 1, 2019

Hi @fierceX, do you have any small example that shows this problem? I will look into it.

@ptrendx
Copy link
Member

ptrendx commented Aug 1, 2019

There seem to be 2 problems here. On 1 hand, the ConcatType function seems to be too strict in what it thinks it needs to be correct (and so that error should not be there in the first place as type could be inferred during later stage of InferType pass) - I will make a PR fixing that tomorrow. On the other hand, I don't quite see how you could end up with this situation by just adding AMP so again, a small example would be really nice.

@vrakesh
Copy link
Contributor

vrakesh commented Aug 1, 2019

@mxnet-label-bot add [Pending Requester Info]

@fierceX
Copy link
Contributor Author

fierceX commented Aug 2, 2019

Hi @ptrendx ,The following code should be able to reproduce this error.

import mxnet as mx
from mxnet import nd
from mxnet.gluon import nn,rnn
from mxnet.contrib import amp

model = nn.HybridSequential()
model.add(rnn.LSTM(hidden_size=10,num_layers=2,bidirectional=True))
model.add(nn.Dense(2))

model.initialize()
model.hybridize()
model(nd.ones((2,3,4)))

new_model = amp.convert_hybrid_block(model)

@ptrendx
Copy link
Member

ptrendx commented Aug 2, 2019

Thanks! I will look into this.

@ptrendx ptrendx self-assigned this Aug 2, 2019
@anirudh2290
Copy link
Member

thanks @ptrendx for looking at this. let me know if I can help here.

@ptrendx
Copy link
Member

ptrendx commented Aug 5, 2019

Ok, so after applying PR #15740 I can successfully run the example when using amp.init:

import mxnet as mx
from mxnet import nd
from mxnet.gluon import nn,rnn
from mxnet.contrib import amp

amp.init()

model = nn.HybridSequential()
model.add(rnn.LSTM(hidden_size=10,num_layers=2,bidirectional=True))
model.add(nn.Dense(2))

model.initialize(ctx=mx.gpu(0))
model.hybridize()
model(nd.ones((2,3,4), ctx=mx.gpu(0)))

# new_model = amp.convert_hybrid_block(model)

while the amp.convert_hybrid_block still fails with the same error in concat - @anirudh2290, could you take a look at this?

@anirudh2290
Copy link
Member

@ptrendx will take a look.

@anirudh2290
Copy link
Member

@ptrendx @fierceX I have added a fix in #15829. Please help review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants