Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon DataLoader incorrectly terminates the process pool in 1.4 #15025

Closed
chandana1332 opened this issue May 21, 2019 · 13 comments · Fixed by #15195
Closed

Gluon DataLoader incorrectly terminates the process pool in 1.4 #15025

chandana1332 opened this issue May 21, 2019 · 13 comments · Fixed by #15195

Comments

@chandana1332
Copy link
Contributor

chandana1332 commented May 21, 2019

Description

Gluon DataLoader terminates the process pool early while _MultiWorkerIter is operating on the pool.
Cause: https://github.com/apache/incubator-mxnet/pull/13537/files
As seen in the patch, the process pool is terminated when DataLoader is garbage collected but the scope of the process pool goes beyond the DataLoader until _MultiWorkerIter

Environment info (Required)

----------Python Info----------
Version      : 3.7.3
Compiler     : Clang 9.0.0 (clang-900.0.39.2)
Build        : ('default', 'Mar 27 2019 09:23:32')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /usr/local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'
ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'
Version      : 1.4.0
Directory    : /usr/local/lib/python3.7/site-packages/mxnet
Commit Hash   : a03d59ed867ba334d78d61246a1090cd1868f5da
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Wed Apr 24 20:50:53 PDT 2019; root:xnu-3789.73.49~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS MD_CLEAR IBRS STIBP L1DF SSBD'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0241 sec, LOAD: 0.6201 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0415 sec, LOAD: 0.5400 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0237 sec, LOAD: 0.4507 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0229 sec, LOAD: 0.2132 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0175 sec, LOAD: 0.4395 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0216 sec, LOAD: 0.1202 sec.

I'm using Python 3.7

Error Message:

  File "/usr/local/lib/python3.7/site-packages/mxnet/gluon/data/dataloader.py", line 435, in __next__
    self._push_next()
  File "/usr/local/lib/python3.7/site-packages/mxnet/gluon/data/dataloader.py", line 430, in _push_next
    self._worker_fn, (r, self._batchify_fn, self._dataset))
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 362, in apply_async
    raise ValueError("Pool not running")
ValueError: Pool not running

Minimum reproducible example

dl = iter(
                    gluon.data.DataLoader(
                        dataset,
                        batchify_fn=func,
                        **self.loader_kwargs
                    )
                )

num_worker >0

What have you tried to solve it?

  1. Commented the following in dataloader.py (Hack)
  def __del__(self):
        if self._worker_pool:
            # manually terminate due to a bug that pool is not automatically terminated
            # https://bugs.python.org/issue34172
            assert isinstance(self._worker_pool, multiprocessing.pool.Pool)
            self._worker_pool.terminate()
  1. Use DataLoaderV1 (Hack)
  2. Move the worker pool init and termination logic to _MultiWorkerIter (Probable fix)
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Gluon, Bug

@zachgk
Copy link
Contributor

zachgk commented May 23, 2019

@zhreshold Can you take a look?

@zhreshold
Copy link
Member

A workaround is to let DataLoader and iterator have same scope.

The worker pool is managed by dataloader for sake of resource conservation in case users created hundreds of iterators out of the same dataloader.
One way to fix this problem is to use ref counting for how many iterators are using the worker pool.
Any suggestion? @chandana1332

@chandana1332
Copy link
Contributor Author

The worker pool is managed by dataloader for sake of resource conservation in case users created hundreds of iterators out of the same dataloader.One way to fix this problem is to use ref counting for how many iterators are using the worker pool.

Got it! I see why the worker pool init was moved inside DataLoader. Ref counting is a good option but it might be better if we let python handle scope of objects rather than managing it on our side.
One way to let python handle it is to pass a reference of DataLoader to the iterator. This way DataLoader will not go out of scope untill all the iterators have gone out of scope (example: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader)
What do you think?

@zhreshold
Copy link
Member

@chandana1332 Yep, this is actually better. Would you like to contribute a minor patch for it?

@chandana1332
Copy link
Contributor Author

chandana1332 commented May 28, 2019

Yeah, I can.

Do you have instructions on what/how to test, etc?
Also, would this patch be applied to 1.4?

@zhreshold
Copy link
Member

Test cases are located in https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_gluon_data.py and you can add the failure case you just posted.
1.5 is to be cut soon, so I am not sure if there will be a new release for 1.4.

@chandana1332
Copy link
Contributor Author

Okay, I can add a test there.

What is the release date for 1.5?

@zhreshold
Copy link
Member

not sure yet

@szha
Copy link
Member

szha commented May 29, 2019

@chandana1332 While we want to release 1.5.0 as soon as possible, we still have a couple of PRs left before cutting the release candidate. https://lists.apache.org/thread.html/b8719632a1d23da619349e91610223c090071c02658d5153fa8f0757@%3Cdev.mxnet.apache.org%3E

In the meantime if you'd like to post the PR soon, @roywei is overseeing the release and helping coordinate. Just let us know when the change will be ready.

@chandana1332
Copy link
Contributor Author

Yeah, I should have it ready soon. Is there a date by when I need to put the PR in?

@roywei
Copy link
Member

roywei commented Jun 7, 2019

Hi @chandana1332, sorry I just saw this. As you can see in the above dev list discussion and release tracker. We are aming to tag 1.5.0 by today(06/07/2019). Do you need this change in 1.5.0? Once you create a PR and merge it, it will be available through nightly pip packages (pip install mxnet --pre), and we can add in in a patch release (1.5.1)

@chandana1332
Copy link
Contributor Author

Hey Wei,

I should have the patch ready by tomorrow. Not sure what the turn around time for releasing a pull request is but from my side, I'm okay if it gets patched to 1.5.1.

Thank you for checking in.

chandana1332 pushed a commit to chandana1332/incubator-mxnet that referenced this issue Jun 10, 2019
Issue: apache#15025
Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
chandana1332 added a commit to chandana1332/incubator-mxnet that referenced this issue Jun 10, 2019
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
chandana1332 added a commit to chandana1332/incubator-mxnet that referenced this issue Jun 11, 2019
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
chandana1332 added a commit to chandana1332/incubator-mxnet that referenced this issue Jun 11, 2019
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
chandana1332 added a commit to chandana1332/incubator-mxnet that referenced this issue Jun 11, 2019
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
chandana1332 added a commit to chandana1332/incubator-mxnet that referenced this issue Jun 11, 2019
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
wkcn pushed a commit that referenced this issue Jun 12, 2019
* Fixed a bug in Gluon DataLoader.
    Issue: #15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: #15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: #15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: #15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this issue Jun 23, 2019
* Fixed a bug in Gluon DataLoader.
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator

* Fixed a bug in Gluon DataLoader.
    Issue: apache#15025
    Fix: Broadened the scope of worker pool to iterators. Passed a reference of dataloader to the multi worker iterator
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants