Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_recordimage_dataset_with_data_loader_multiworker Segmentation fault on OSX #17774

Open
leezu opened this issue Mar 5, 2020 · 4 comments
Assignees

Comments

@leezu
Copy link
Contributor

leezu commented Mar 5, 2020

Description

test_recordimage_dataset_with_data_loader_multiworker fails due to timeout on OSX CI. Maybe it's due to the CI machine having too little resources. Or there is a bug.

Error Message

https://github.com/leezu/mxnet/commit/e32e82a6551da0bd5a8ec2847ae10556039c8bda/checks/484044519/logs

2020-03-04T05:07:55.1938710Z test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker ... 
2020-03-04T05:07:55.1964590Z Segmentation fault: 11
2020-03-04T05:07:55.2065580Z 
2020-03-04T05:07:55.2129640Z 
2020-03-04T05:07:55.2130340Z Segmentation fault: 11
2020-03-04T05:07:55.2130700Z 
2020-03-04T05:07:55.2130980Z 
2020-03-04T05:07:55.2131290Z Segmentation fault: 11
2020-03-04T05:07:55.2131520Z 
2020-03-04T05:07:55.2131630Z 
2020-03-04T05:07:55.2131740Z Segmentation fault: 11
2020-03-04T05:07:55.2131800Z 
2020-03-04T05:07:55.2131920Z 
2020-03-04T05:07:55.2132220Z Segmentation fault: 11
2020-03-04T05:07:55.2132270Z 
2020-03-04T05:07:55.3309680Z 
2020-03-04T05:07:55.3310610Z Segmentation fault: 11
2020-03-04T05:07:55.3311090Z 
2020-03-04T05:07:55.3335390Z 
2020-03-04T05:07:55.3336210Z Segmentation fault: 11
2020-03-04T05:07:55.3336520Z 
2020-03-04T05:07:55.3484940Z 
2020-03-04T05:07:55.3586550Z Segmentation fault: 11
2020-03-04T05:07:55.3587880Z 
2020-03-04T05:07:55.3588300Z 
2020-03-04T05:07:55.3588640Z Segmentation fault: 11
2020-03-04T05:07:55.3588920Z 
2020-03-04T05:07:55.3589170Z 
2020-03-04T05:07:55.3589470Z Segmentation fault: 11
2020-03-04T05:07:55.3589730Z 
2020-03-04T05:07:55.5656380Z 
2020-03-04T05:07:55.5661260Z Segmentation fault: 11
2020-03-04T05:42:33.0930710Z ERROR: test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker
2020-03-04T05:42:33.0931180Z ----------------------------------------------------------------------
2020-03-04T05:42:33.0931280Z Traceback (most recent call last):
2020-03-04T05:42:33.0931750Z   File "/Users/runner/Library/Python/3.7/lib/python/site-packages/nose/case.py", line 198, in runTest
2020-03-04T05:42:33.0931840Z     self.test(*self.arg)
2020-03-04T05:42:33.0931950Z   File "/Users/runner/runners/2.165.2/work/mxnet/mxnet/tests/python/unittest/common.py", line 215, in test_new
2020-03-04T05:42:33.0932040Z     orig_test(*args, **kwargs)
2020-03-04T05:42:33.0932140Z   File "/Users/runner/runners/2.165.2/work/mxnet/mxnet/tests/python/unittest/test_gluon_data.py", line 91, in test_recordimage_dataset_with_data_loader_multiworker
2020-03-04T05:42:33.0932230Z     for i, (x, y) in enumerate(loader):
2020-03-04T05:42:33.0932330Z   File "/Users/runner/runners/2.165.2/work/mxnet/mxnet/python/mxnet/gluon/data/dataloader.py", line 484, in __next__
2020-03-04T05:42:33.0932440Z     batch = pickle.loads(ret.get(self._timeout))
2020-03-04T05:42:33.0932540Z   File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 653, in get
2020-03-04T05:42:33.0932630Z     raise TimeoutError
2020-03-04T05:42:33.0932700Z multiprocessing.context.TimeoutError: 
2020-03-04T05:42:33.0933160Z -------------------- >> begin captured stdout << ---------------------
2020-03-04T05:42:33.0933280Z Worker timed out after 120 seconds. This might be caused by 
2020-03-04T05:42:33.0933330Z 
2020-03-04T05:42:33.0933780Z             - Slow transform. Please increase timeout to allow slower data loading in each worker.
2020-03-04T05:42:33.0934210Z             - Insufficient shared_memory if `timeout` is large enough.
2020-03-04T05:42:33.0934310Z             Please consider reduce `num_workers` or increase shared_memory in system.
2020-03-04T05:42:33.0934410Z             
2020-03-04T05:42:33.0934910Z 
2020-03-04T05:42:33.0935850Z --------------------- >> end captured stdout << ----------------------

To Reproduce

Steps to reproduce

  1. Make sure your fork of MXNet is up to date with latest master
  2. Create a new branch and add a commit changing
          python3 -m nose --with-timer --verbose tests/python/unittest/ --exclude-test=test_extensions.test_subgraph --exclude-test=test_extensions.test_custom_op --exclude-test=test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

to

          python3 -m nose --with-timer --verbose tests/python/unittest/test_gluon_data.py -m test_recordimage_dataset_with_data_loader_multiworker

Go to https://github.com/$YOURFORK/mxnet/actions and observe the build status.

@leezu
Copy link
Contributor Author

leezu commented Mar 5, 2020

Possibly related,

2020-03-05T06:16:36.4620400Z test_gluon_data.test_multi_worker_dataloader_release_pool ... libc++abi.dylib: terminating with uncaught exception of type dmlc::Error: [06:16:36] ../src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) : 

is flaky on OSX

@zhreshold
Copy link
Member

Reproducible by statically build on osx, not reproducible when build against dynamic openmp

@leezu
Copy link
Contributor Author

leezu commented Mar 6, 2020

You mention the issue can't be reproduced when dynamically linking openmp.

Did you verify the normal, non-static, cmake build with 3rdparty/openmp can't reproduce this issue?

@zhreshold
Copy link
Member

zhreshold commented Mar 11, 2020

@leezu
Let me update the mac test metric with clang

Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Static build openmp unittest mentioned
Yes 3rdparty/openmp segfault
No 3rdparty/openmp segfault
Yes delete 3rdparty/openmp ok
No delete 3rdparty/openmp ok

Note that the segfault only happens at the first run, consequential tests are always ok.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants