flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

zheng-da · 2018-11-30T08:00:32Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13418/11/pipeline

======================================================================

ERROR: test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\common.py", line 173, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\test_gluon_data.py", line 86, in test_recordimage_dataset_with_data_loader_multiworker

    for i, (x, y) in enumerate(loader):

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 279, in next

    return self.__next__()

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 267, in __next__

    self.shutdown()

  File "C:\jenkins_slave\workspace\ut-python-cpu@3\windows_package\python\mxnet\gluon\data\dataloader.py", line 298, in shutdown

    w.terminate()

  File "C:\Anaconda3\envs\py2\lib\multiprocessing\process.py", line 137, in terminate

    self._popen.terminate()

  File "C:\Anaconda3\envs\py2\lib\multiprocessing\forking.py", line 312, in terminate

    _subprocess.TerminateProcess(int(self._handle), TERMINATE)

WindowsError: [Error 5] Access is denied

The text was updated successfully, but these errors were encountered:

zheng-da · 2018-11-30T08:04:47Z

another one here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13436/4/pipeline

It seems this flaky test happens pretty frequently.

aaronmarkham · 2018-11-30T17:49:29Z

Another one here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-13463/2/pipeline

vrakesh · 2018-11-30T18:20:25Z

@mxnet-label-bot add [Flaky, Test, Python]

zheng-da · 2018-12-01T05:50:31Z

another one here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13418/11/pipeline

larroy · 2018-12-04T14:00:27Z

Guys, disabling the test is not the right thing to do, I added MXNET_HOME variable (see docs/faq/env_var.md) to deal with these problems. The right thing to do, is fix the windows CI run to set a different MXNET_HOME for each CI worker so there's no concurrent access from different processes.

…o concurrent data downloads Fixes apache#13484

…urrent data downloads Fixes apache#13484

…denied due t… (#13531) * Use MXNET_HOME in cwd in windows to prevent access denied due to concurrent data downloads Fixes #13484 * Revert "Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (#13527)" This reverts commit 3d499cb.

…denied due t… (apache#13531) * Use MXNET_HOME in cwd in windows to prevent access denied due to concurrent data downloads Fixes apache#13484 * Revert "Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (apache#13527)" This reverts commit 3d499cb.

ChaiBapchya · 2019-10-02T00:49:31Z

Please reopen
PR #16253
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-16253/3/pipeline
timeout error

======================================================================

ERROR: test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Python37\lib\site-packages\nose\case.py", line 198, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\common.py", line 177, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\test_gluon_data.py", line 91, in test_recordimage_dataset_with_data_loader_multiworker

    for i, (x, y) in enumerate(loader):

  File "C:\jenkins_slave\workspace\ut-python-cpu\windows_package\python\mxnet\gluon\data\dataloader.py", line 473, in __next__

    batch = pickle.loads(ret.get(self._timeout))

  File "C:\Python37\lib\multiprocessing\pool.py", line 653, in get

    raise TimeoutError

multiprocessing.context.TimeoutError: 

-------------------- >> begin captured stdout << ---------------------

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

            


--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=908339891 to reproduce.

--------------------- >> end captured logging << ---------------------


----------------------------------------------------------------------
test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

larroy · 2019-10-03T16:49:54Z

This looks more like a IO stall than a bug.

marcoabreu · 2019-10-03T22:42:52Z

120 seconds is quite some time. Considering everything is happening on local volume, it's quite unlikely that the disk is so occupied. Could something be stuck? I think it's worth investigating.

larroy · 2019-10-04T00:06:16Z

Any suggestions? is it reproducible?

Mauhing · 2020-06-16T06:30:30Z

It may be a shared memory problem. Check shm used by using df -h and find shm. If it is 100%, then your multiprocess worker will just stall.

If you use docker.
Use --shm 1024m to lauch docker, so docker run --shm 1024m

why?
gluon.data.DataLoader uses python multiprocess, multiprocess need shared memory. The default shared memory is 64m in docker container. You can check the shm usage by using df -h and find shm.

karan6181 · 2020-08-05T21:18:53Z

Any suggestions or fix? The issue still persist

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

Mauhing · 2020-08-06T10:28:41Z

Any suggestions or fix? The issue still persist

Worker timed out after 120 seconds. This might be caused by 


            - Slow transform. Please increase timeout to allow slower data loading in each worker.

            - Insufficient shared_memory if `timeout` is large enough.

            Please consider reduce `num_workers` or increase shared_memory in system.

Try to monitor your shared memory usage while running the program. If the problem is still here, try to increase the shared memory size.

marcoabreu added Flaky Python Test labels Nov 30, 2018

jlcontreras mentioned this issue Dec 4, 2018

[Flaky test] Disable test_gluon_data.test_recordimage_dataset_with_dat… #13527

Merged

marcoabreu added the Disabled test label Dec 4, 2018

larroy mentioned this issue Dec 4, 2018

[MXNET-769] Use MXNET_HOME in a tempdir in windows to prevent access denied due t… #13531

Merged

7 tasks

larroy added a commit to larroy/mxnet that referenced this issue Dec 4, 2018

Use MXNET_HOME in a tempdir in windows to prevent access denied due t…

a1ae676

…o concurrent data downloads Fixes apache#13484

larroy added a commit to larroy/mxnet that referenced this issue Dec 5, 2018

Use MXNET_HOME in cwd in windows to prevent access denied due to conc…

5b3303b

…urrent data downloads Fixes apache#13484

larroy added a commit to larroy/mxnet that referenced this issue Dec 5, 2018

Use MXNET_HOME in cwd in windows to prevent access denied due to conc…

fb42c2a

…urrent data downloads Fixes apache#13484

marcoabreu closed this as completed in #13531 Dec 5, 2018

marcoabreu reopened this Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

zheng-da commented Nov 30, 2018

zheng-da commented Nov 30, 2018

aaronmarkham commented Nov 30, 2018

vrakesh commented Nov 30, 2018

zheng-da commented Dec 1, 2018

larroy commented Dec 4, 2018

ChaiBapchya commented Oct 2, 2019

larroy commented Oct 3, 2019

marcoabreu commented Oct 3, 2019

larroy commented Oct 4, 2019 •

edited

Loading

Mauhing commented Jun 16, 2020 •

edited

Loading

karan6181 commented Aug 5, 2020 •

edited

Loading

Mauhing commented Aug 6, 2020

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484

Comments

zheng-da commented Nov 30, 2018

zheng-da commented Nov 30, 2018

aaronmarkham commented Nov 30, 2018

vrakesh commented Nov 30, 2018

zheng-da commented Dec 1, 2018

larroy commented Dec 4, 2018

ChaiBapchya commented Oct 2, 2019

larroy commented Oct 3, 2019

marcoabreu commented Oct 3, 2019

larroy commented Oct 4, 2019 • edited Loading

Mauhing commented Jun 16, 2020 • edited Loading

karan6181 commented Aug 5, 2020 • edited Loading

Mauhing commented Aug 6, 2020

larroy commented Oct 4, 2019 •

edited

Loading

Mauhing commented Jun 16, 2020 •

edited

Loading

karan6181 commented Aug 5, 2020 •

edited

Loading