MaskRCNN unable to train with master, works with previous revisions #16675

larroy · 2019-10-30T22:05:29Z

Description

I can't train mask rcnn with latest revisions of MXNet:

https://gluon-cv.mxnet.io/build/examples_instance/train_mask_rcnn_coco.html

This revision works:

e9e267e - (Sat, 14 Sep 2019 09:33:08 -0700) reminisce - Fix remaining errors reported by D2L (#16157)

This doesn't:

86ed5f5 - (Mon, 28 Oct 2019 01:24:05 -0700) Huang, Gua.. - [NumPy][Operator] NumPy operator may_share_memory and shares_memory (#16533) (upstream/v1.6.x)

I see very low throughput, high CPU usage and low GPU usage or it gets stuck completely.

This can be reproduced either from source or from the latest pip builds, so I don't think it's my environment or my build options.

This is my build environment:

USE_CUDA: "ON" # Build with CUDA support
USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
USE_NCCL: "ON" # Use NVidia NCCL with CUDA
USE_OPENCV: "ON" # Build with OpenCV support
USE_OPENMP: "PLATFORM" # Build with Openmp support
USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for search path
USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects support if "ON"
USE_LAPACK: "ON" # Build with lapack support
USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
USE_JEMALLOC: "ON" # Build with Jemalloc support
USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
USE_CPP_PACKAGE: "OFF" # Build C++ Package
USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
USE_GPROF: "OFF" # Compile with gprof (profiling) flag
USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set VTUNE_ROOT for search path
ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
INSTALL_EXAMPLES: "OFF" # Install the example source files.
USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric output
CMAKE_BUILD_TYPE: "Release"
CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
CMAKE_C_COMPILER_LAUNCHER: "ccache"
CMAKE_CXX_COMPILER_LAUNCHER: "ccache"

Diagnose

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:            4
CPU MHz:             3134.070
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
----------Python Info----------
Version      : 3.6.8
Compiler     : GCC 8.3.0
Build        : ('default', 'Oct  7 2019 12:59:55')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/piotr/mxnet/py3_venv/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/piotr/mxnet/python/mxnet
Commit hash file "/home/piotr/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/piotr/mxnet/python/mxnet/../../build/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✔ F16C
✔ JEMALLOC
✔ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-4.15.0-1052-aws-x86_64-with-Ubuntu-18.04-bionic
system       : Linux
node         : 18-232-106-45
release      : 4.15.0-1052-aws
version      : #54-Ubuntu SMP Tue Oct 1 15:43:26 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.4104 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0190 sec, LOAD: 0.0444 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0222 sec, LOAD: 0.3929 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0184 sec, LOAD: 0.3812 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0017 sec, LOAD: 0.0803 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0063 sec, LOAD: 0.0893 sec.
----------Environment----------
(END)

The text was updated successfully, but these errors were encountered:

samskalicky · 2019-11-11T19:57:49Z

@zachgk assign @szha
@Jerryzcn @zhreshold is this a GluonCV issue?

zhreshold · 2019-11-11T20:06:19Z

@larroy So are you fixing the version of GluonCV, only comparing the mxnet versions?

larroy · 2019-11-11T21:19:45Z

@zhreshold comparing MXNet versions. I think we should add training tests to Gluon CV CI, at least run a quick test to see that the model trains. Where is gluon cv CI hosted?

zhreshold · 2019-11-11T22:22:53Z

@larroy CI for GluonCV is hosted separately alongside with GluonNLP, GluonTS for example.
So far we don't have nightly test and per-PR based training tests are too expensive.

larroy · 2019-11-11T22:43:27Z

I suggested to @Jerryzcn that training can be done for a few minutes to collect throughput and see that it works. You don't need to train a full model.

larroy added the Bug label Oct 30, 2019

zachgk assigned szha Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaskRCNN unable to train with master, works with previous revisions #16675

MaskRCNN unable to train with master, works with previous revisions #16675

larroy commented Oct 30, 2019

samskalicky commented Nov 11, 2019

zhreshold commented Nov 11, 2019

larroy commented Nov 11, 2019

zhreshold commented Nov 11, 2019

larroy commented Nov 11, 2019

MaskRCNN unable to train with master, works with previous revisions #16675

MaskRCNN unable to train with master, works with previous revisions #16675

Comments

larroy commented Oct 30, 2019

Description

Diagnose

samskalicky commented Nov 11, 2019

zhreshold commented Nov 11, 2019

larroy commented Nov 11, 2019

zhreshold commented Nov 11, 2019

larroy commented Nov 11, 2019