Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

Open
zheng-da opened this issue Nov 30, 2018 · 12 comments · Fixed by #13531
Open

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

zheng-da opened this issue Nov 30, 2018 · 12 comments · Fixed by #13531

Comments

@zheng-da
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13418/11/pipeline

======================================================================

ERROR: test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\common.py", line 173, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\test_gluon_data.py", line 86, in test_recordimage_dataset_with_data_loader_multiworker

    for i, (x, y) in enumerate(loader):

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 279, in next

    return self.__next__()

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 267, in __next__

    self.shutdown()

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 298, in shutdown

    w.terminate()

  File "C:\Anaconda3\envs\py2\lib\multiprocessing\process.py", line 137, in terminate

    self._popen.terminate()

  File "C:\Anaconda3\envs\py2\lib\multiprocessing\forking.py", line 312, in terminate

    _subprocess.TerminateProcess(int(self._handle), TERMINATE)

WindowsError: [Error 5] Access is denied

@zheng-da
Copy link
Contributor Author

another one here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13436/4/pipeline

It seems this flaky test happens pretty frequently.

@vrakesh
Copy link
Contributor

vrakesh commented Nov 30, 2018

@mxnet-label-bot add [Flaky, Test, Python]

@zheng-da
Copy link
Contributor Author

zheng-da commented Dec 1, 2018

@larroy
Copy link
Contributor

larroy commented Dec 4, 2018

Guys, disabling the test is not the right thing to do, I added MXNET_HOME variable (see docs/faq/env_var.md) to deal with these problems. The right thing to do, is fix the windows CI run to set a different MXNET_HOME for each CI worker so there's no concurrent access from different processes.

larroy added a commit to larroy/mxnet that referenced this issue Dec 4, 2018
larroy added a commit to larroy/mxnet that referenced this issue Dec 5, 2018
larroy added a commit to larroy/mxnet that referenced this issue Dec 5, 2018
marcoabreu pushed a commit that referenced this issue Dec 5, 2018
…denied due t… (#13531)

* Use MXNET_HOME in cwd in windows to prevent access denied due to concurrent data downloads

Fixes #13484

* Revert "Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (#13527)"

This reverts commit 3d499cb.
zhaoyao73 pushed a commit to zhaoyao73/incubator-mxnet that referenced this issue Dec 13, 2018
…denied due t… (apache#13531)

* Use MXNET_HOME in cwd in windows to prevent access denied due to concurrent data downloads

Fixes apache#13484

* Revert "Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (apache#13527)"

This reverts commit 3d499cb.
@ChaiBapchya
Copy link
Contributor

Please reopen
PR #16253
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-16253/3/pipeline
timeout error

======================================================================

ERROR: test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Python37\lib\site-packages\nose\case.py", line 198, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\test_gluon_data.py", line 91, in test_recordimage_dataset_with_data_loader_multiworker

    for i, (x, y) in enumerate(loader):

  File "C:\jenkins_slave\workspace\ut-python-cpu\windows_package\python\mxnet\gluon\data\dataloader.py", line 473, in __next__

    batch = pickle.loads(ret.get(self._timeout))

  File "C:\Python37\lib\multiprocessing\pool.py", line 653, in get

    raise TimeoutError

multiprocessing.context.TimeoutError: 

-------------------- >> begin captured stdout << ---------------------

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

            


--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=908339891 to reproduce.

--------------------- >> end captured logging << ---------------------


----------------------------------------------------------------------
test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

@larroy
Copy link
Contributor

larroy commented Oct 3, 2019

This looks more like a IO stall than a bug.

@marcoabreu
Copy link
Contributor

120 seconds is quite some time. Considering everything is happening on local volume, it's quite unlikely that the disk is so occupied. Could something be stuck? I think it's worth investigating.

@marcoabreu marcoabreu reopened this Oct 3, 2019
@larroy
Copy link
Contributor

larroy commented Oct 4, 2019

Any suggestions? is it reproducible?

@Mauhing
Copy link

Mauhing commented Jun 16, 2020

It may be a shared memory problem. Check shm used by using df -h and find shm. If it is 100%, then your multiprocess worker will just stall.

If you use docker.
Use --shm 1024m to lauch docker, so docker run --shm 1024m

why?
gluon.data.DataLoader uses python multiprocess, multiprocess need shared memory. The default shared memory is 64m in docker container. You can check the shm usage by using df -h and find shm.

@karan6181
Copy link
Contributor

karan6181 commented Aug 5, 2020

Any suggestions or fix? The issue still persist

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

@Mauhing
Copy link

Mauhing commented Aug 6, 2020

Any suggestions or fix? The issue still persist

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

Try to monitor your shared memory usage while running the program. If the problem is still here, try to increase the shared memory size.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants