Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

getting segfault while running train_cifar10.py program in example directory #12800

Closed
Vikas-kum opened this issue Oct 11, 2018 · 9 comments
Closed

Comments

@Vikas-kum
Copy link
Contributor

Vikas-kum commented Oct 11, 2018

I am trying to run this command:
python example/image-classification/train_cifar10.py

And getting the segfault. It is very consistent.

INFO:root:Epoch[0] Batch [20]	Speed: 18.95 samples/sec	accuracy=0.164807
INFO:root:Epoch[0] Batch [40]	Speed: 20.34 samples/sec	accuracy=0.246875
INFO:root:Epoch[0] Batch [60]	Speed: 21.66 samples/sec	accuracy=0.308984
INFO:root:Epoch[0] Batch [80]	Speed: 22.15 samples/sec	accuracy=0.328125
INFO:root:Epoch[0] Batch [100]	Speed: 22.38 samples/sec	accuracy=0.362500
INFO:root:Epoch[0] Batch [120]	Speed: 21.52 samples/sec	accuracy=0.378125
INFO:root:Epoch[0] Batch [140]	Speed: 22.96 samples/sec	accuracy=0.417969
INFO:root:Epoch[0] Batch [160]	Speed: 20.04 samples/sec	accuracy=0.426563
INFO:root:Epoch[0] Batch [180]	Speed: 18.58 samples/sec	accuracy=0.430078
INFO:root:Epoch[0] Batch [200]	Speed: 21.39 samples/sec	accuracy=0.443750
INFO:root:Epoch[0] Batch [220]	Speed: 17.69 samples/sec	accuracy=0.469531
INFO:root:Epoch[0] Batch [240]	Speed: 18.59 samples/sec	accuracy=0.469141
INFO:root:Epoch[0] Batch [260]	Speed: 21.68 samples/sec	accuracy=0.470313
INFO:root:Epoch[0] Batch [280]	Speed: 21.05 samples/sec	accuracy=0.487891
INFO:root:Epoch[0] Batch [300]	Speed: 22.30 samples/sec	accuracy=0.503125
INFO:root:Epoch[0] Batch [320]	Speed: 20.98 samples/sec	accuracy=0.534766
INFO:root:Epoch[0] Batch [340]	Speed: 15.53 samples/sec	accuracy=0.526172
INFO:root:Epoch[0] Batch [360]	Speed: 13.88 samples/sec	accuracy=0.528516
Traceback (most recent call last):
  File "example/image-classification/train_cifar10.py", line 79, in <module>
    fit.fit(args, sym, data.get_rec_iter)
  File "/Users/vikumar/incubator-mxnet/example/image-classification/common/fit.py", line 333, in fit
    monitor=monitor)
  File "/Users/vikumar/incubator-mxnet/python/mxnet/module/base_module.py", line 563, in fit
    next_data_batch = next(data_iter)
  File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__
    return self.next()
  File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 840, in next
    check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
  File "/Users/vikumar/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:58:31] src/io/recordio_split.cc:29: Check failed: (reinterpret_cast<size_t>(end) & 3UL) == 0U (1 vs. 0) 

Stack trace returned 10 entries:
[bt] (0) 0   libmxnet.so                         0x000000010ae2cf30 dmlc::StackTrace() + 272
[bt] (1) 1   libmxnet.so                         0x000000010ae2ccdf dmlc::LogMessageFatal::~LogMessageFatal() + 47
[bt] (2) 2   libmxnet.so                         0x000000010c68a01c dmlc::io::RecordIOSplitter::FindLastRecordBegin(char const*, char const*) + 444
[bt] (3) 3   libmxnet.so                         0x000000010c68d10c dmlc::io::InputSplitBase::ReadChunk(void*, unsigned long*) + 284
[bt] (4) 4   libmxnet.so                         0x000000010c68d230 dmlc::io::InputSplitBase::Chunk::Load(dmlc::io::InputSplitBase*, unsigned long) + 144
[bt] (5) 5   libmxnet.so                         0x000000010c6a0b8f dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()::operator()() const + 895
[bt] (6) 6   libmxnet.so                         0x000000010c6a071d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()> >(void*) + 45
[bt] (7) 7   libsystem_pthread.dylib             0x00007fff903c293b _pthread_body + 180
[bt] (8) 8   libsystem_pthread.dylib             0x00007fff903c2887 _pthread_body + 0
[bt] (9) 9   libsystem_pthread.dylib             0x00007fff903c208d thread_start + 13

Environment info

build command - make -j8 USE_DIST_KVSTORE=1
os: mac

What to do:
1. o/p of diagnose script:
88e9fe53272d:incubator-mxnet vikumar$ python /tmp/lk.py
----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build        : ('default', 'Apr 26 2018 08:42:37')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /Users/vikumar/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/vikumar/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
objc[59758]: Class CaptureDelegate is implemented in both /usr/local/opt/opencv/lib/libopencv_videoio.3.4.dylib (0x10ada5938) and /Users/vikumar/anaconda3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x1a17065ce0). One of the two will be used. Which one is undefined.
Version      : 1.3.0
Directory    : /Users/vikumar/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 88e9fe53272d.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0115 sec, LOAD: 0.5897 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0183 sec, LOAD: 0.2578 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0227 sec, LOAD: 0.1714 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0186 sec, LOAD: 0.2626 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 0.3084 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0148 sec, LOAD: 0.0534 sec.

I am trying to build and repro on linux machine.

@piyushghai
Copy link
Contributor

Thanks for reporting this issue @Vikas89
@mxnet-label-bot [Example, Bug]

@Vikas-kum
Copy link
Contributor Author

I think there was a fix made in relevant area of code: dmlc/dmlc-core@e3377de#diff-855ba648d1f4003608aa37ba3d060043

I will retry this on latest code and check if I can reproduce this again. I will keep this thread updated.

@Vikas-kum
Copy link
Contributor Author

I am not seeing this issue now, closing this

@Vikas-kum
Copy link
Contributor Author

I encountered the issue again, sorry for the confusion.

If I revert the changes made in this commit in dmlc_core submodule, the issue is gone
dmlc/dmlc-core@e3377de#diff-855ba648d1f4003608aa37ba3d060043

My changes in dmlc-core/src/io/input_split_base.cc which fixes the problem

+  if (nread != max_size) {
+    *size = nread;
+    return true;
+  } else {
+    const char *bptr = reinterpret_cast<const char*>(buf);
+    // return the last position where a record starts
+    const char *bend = this->FindLastRecordBegin(bptr, bptr + max_size);
+    *size = bend - bptr;
+    overflow_.resize(max_size - *size);
+    if (overflow_.length() != 0) {
+      std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
+    }
+  
+    return true;
+  }
+    /* COMMENTOUT
   if (nread == olen) {  // add extra newline to handle files with NOEOL
     char *bufptr = reinterpret_cast<char*>(buf);
     bufptr[nread] = '\n';
     nread++;
+    LOG(INFO) << " Added extra line to the file";
   }
 
   const char *bptr = reinterpret_cast<const char*>(buf);
@@ -241,6 +258,7 @@ bool InputSplitBase::ReadChunk(void *buf, size_t *size) {
     std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
   }
   return true;
+  */

@Vikas-kum Vikas-kum reopened this Oct 18, 2018
hcho3 added a commit to hcho3/dmlc-core that referenced this issue Oct 18, 2018
Adding newlines caused apache/mxnet#12800.

TODO: Add a test case
@hcho3
Copy link
Contributor

hcho3 commented Oct 19, 2018

@Vikas89 I was not able to reproduce the bug with dmlc-core hash 0a0e8addf92e1287fd7a25c6314016b8c0138dee. On the other hand, I was able to reproduce it when the dmlc-core submodule was brought up to the latest master. Which commit hash for dmlc-core were you using when running the example?

hcho3 added a commit to hcho3/dmlc-core that referenced this issue Oct 19, 2018
Adding newlines caused apache/mxnet#12800.

TODO: Add a test case
@hcho3
Copy link
Contributor

hcho3 commented Oct 19, 2018

@Vikas89 @piyushghai I created a minimal reproduction of the bug:

#include <dmlc/io.h>
#include <string>
#include <utility>
#include <vector>

int main(int argc, char** argv) {
  std::unique_ptr<dmlc::InputSplit> source(
    dmlc::InputSplit::Create("./cifar10_val.rec", 0, 1, "recordio"));

  source->BeforeFirst();
  dmlc::InputSplit::Blob rec;
  size_t sum = 0;
  while (source->NextRecord(&rec)) {
    sum += rec.size;
  }

  return 0;
}

It shows that the bug came into existence since dmlc/dmlc-core#452 is merged. For a running example, see https://github.com/hcho3/mxnet-issue12800-repro.

@hcho3
Copy link
Contributor

hcho3 commented Oct 19, 2018

Submitted dmlc/dmlc-core#471, along with a test case.

hcho3 added a commit to dmlc/dmlc-core that referenced this issue Oct 24, 2018
* Do not add newline for RecordIO InputSplit

Adding newlines caused apache/mxnet#12800.

TODO: Add a test case

* Add a test case

* Cache value of IsTextParser() in a tight loop to avoid virtual function call

* Address reviewer comment: use tiny example

* Fix typo
Vikas-kum pushed a commit to Vikas-kum/private-dmlc-core that referenced this issue Nov 5, 2018
* Do not add newline for RecordIO InputSplit

Adding newlines caused apache/mxnet#12800.

TODO: Add a test case

* Add a test case

* Cache value of IsTextParser() in a tight loop to avoid virtual function call

* Address reviewer comment: use tiny example

* Fix typo
@ankkhedia
Copy link
Contributor

@hcho3 Is this still an issue or has been fixed after your PR?

@Vikas-kum
Copy link
Contributor Author

This is fixed.

Vikas-kum pushed a commit to Vikas-kum/private-dmlc-core that referenced this issue Nov 16, 2018
* Do not add newline for RecordIO InputSplit

Adding newlines caused apache/mxnet#12800.

TODO: Add a test case

* Add a test case

* Cache value of IsTextParser() in a tight loop to avoid virtual function call

* Address reviewer comment: use tiny example

* Fix typo
ruslo pushed a commit to hunter-packages/dmlc-core that referenced this issue Mar 23, 2019
* Do not add newline for RecordIO InputSplit

Adding newlines caused apache/mxnet#12800.

TODO: Add a test case

* Add a test case

* Cache value of IsTextParser() in a tight loop to avoid virtual function call

* Address reviewer comment: use tiny example

* Fix typo
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants