Flaky test test_gluon.test_export #11616

KellenSunderland · 2018-07-09T16:41:27Z

Failed build:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1148/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1148/nodes/822/log/?start=0

Output:

======================================================================
ERROR: test_gluon.test_export
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\test_gluon.py", line 787, in test_export
    prefix='resnet', ctx=ctx, pretrained=True)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\vision\resnet.py", line 406, in resnet18_v1
    return get_resnet(1, 18, **kwargs)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\vision\resnet.py", line 390, in get_resnet
    root=root), ctx=ctx)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\model_store.py", line 114, in get_model_file
    os.remove(zip_file_path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Windows\\system32\\config\\systemprofile\\.mxnet\\models\\resnet18_v1-a0666292.zip'

-------------------- >> begin captured stdout << ---------------------
Model file is not found. Downloading.
Downloading C:\Windows\system32\config\systemprofile\.mxnet\models\resnet18_v1-a0666292.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------
requests.packages.urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): apache-mxnet.s3-accelerate.dualstack.amazonaws.com
requests.packages.urllib3.connectionpool: DEBUG: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com:443 "GET /gluon/models/resnet18_v1-a0666292.zip HTTP/1.1" 200 43452554
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=767602412 to reproduce.
--------------------- >> end captured logging << ---------------------

The text was updated successfully, but these errors were encountered:

szha · 2018-07-09T17:52:42Z

Looks like a CI problem @marcoabreu

marcoabreu · 2018-07-09T18:01:45Z

To me, it looks like a test problem. It seems like we are hardcoding paths, causing problems if multiple runs are triggered in parallel. The same could happen on a user system if they have multiple processes starting up inside the same container or operating system, so we should consider this as an important bug.

Also, why de we write to system32?

szha · 2018-07-09T18:10:56Z

I was treating sandboxing as a CI problem, as I could hardly think of any circumstance in which we'd want tests from one Jenkins executor to interact with tests in another executor on the same host.

szha · 2018-07-09T18:12:04Z

Even if there is such case, I imagine we'd want to manage it explicitly.

szha · 2018-07-09T18:22:35Z

That particular API uses $HOME/.mxnet as the default path, so the system path likely comes from $HOME. One simple solution to fix such case is to override $HOME and let it point to a managed temporary file path.

marcoabreu · 2018-07-09T18:40:04Z

On production systems while running a service, we can't necessarily assume that every single request is handled in a sandbox. On Ubuntu we actually run in a separate container for every single run, but this is not the case. The windows use case seems to be pretty close to reality.

We don't want to interact in between executors, but that's a side effect because we don't make unique paths. Just load the model zoo in two separate processes and this problem will pop up.

I'd recommend to store the files in a specific non unique directory and check if they exist (and match the hash) before downloading the archive from the internet. If it doesn't exist, download the zip file to a temporary dir. Then put a readwritelock (shared read, unique write) onto the non unique target path and extract the archive file into it. Right now, the code assumes that we have exclusive access - especially the code which tries to delete the archive file which might be used from a different process because it's also non unique (on mobile, otherwise I'd post the code).

marcoabreu · 2018-07-09T18:40:30Z

Thanks for elaborating on the home variable. That makes sense because we run as administrator on windows.

szha · 2018-07-09T18:47:07Z

This situation has nothing to do with a micro service...

KellenSunderland · 2018-07-09T19:02:36Z

I'd agree this is a sandboxing issue in the CI. We'll take a look at how we can improve this @szha.

marcoabreu · 2018-07-09T19:35:14Z

To clarify: do we expect our users to be able to run multiple instances of MXNet on the same host?

larroy · 2018-07-10T13:15:27Z

I agree with Marco. You can't assume that writing to a global path or acquiring a global resource is ok, because this prevents to run tests in parallel in different copies of MXNet which should be possible, as developers often have several copies of the same repo to test different things.

So this test should write to a different location by means of a temporary directory.
The fact that is not a good practice that a unit test writes to disk and instead should avoid IO or use a memory buffer is a broader discussion.

We SHOULD be able to run multiple instances of MXNet in the same host, and run tests independently of course!!!

marcoabreu · 2018-07-10T19:18:15Z

Just to clarify: The problem here seems to be function-under-test python\mxnet\gluon\model_zoo\model_store.py:get_model_file, thus it's not the test itself causing the problem but rather the actually tested function having trouble with multi tenancy and being revealed by this test.

larroy · 2018-07-10T20:41:42Z

@marcoabreu clear, I'm working on a fix.

larroy · 2018-07-10T21:28:32Z

The import export tests are failing on my machine with latest master:


[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=233564263 to reproduce.
FAIL

======================================================================
FAIL: test_gluon.test_export
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/piotr/py3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/piotr/devel/mxnet/mxnet/tests/python/unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/piotr/devel/mxnet/mxnet/tests/python/unittest/test_gluon.py", line 799, in test_export
    assert_almost_equal(out.asnumpy(), mod_out.asnumpy())
  File "/home/piotr/devel/mxnet/mxnet/python/mxnet/test_utils.py", line 494, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 5336385536.000000 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(0, 321), a=-8591.546875, b=0.161002
 a: array([[ -6051.294   ,   4307.4683  ,   4112.863   ,  -6575.2446  ,
         -2436.4683  ,   3050.015   ,  -8878.826   ,  -8480.262   ,
         -9142.191   ,   -310.93896 ,   -661.2363  ,  -5160.913   ,...
 b: array([[ -51.129765  ,   12.116906  ,   15.172363  ,  -63.08578   ,
         -44.068184  ,  -10.512493  ,  -66.99067   ,  -45.393295  ,
         -63.472206  ,  -63.23494   ,   12.72966   ,   65.423035  ,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=2102901580 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=233564263 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 1.050s

szha · 2018-07-10T22:37:18Z

Many design decisions are based on having a single instance of mxnet in a system. For example, the mxnet processes don't attempt to find and talk to each other to coordinate on memory allocation. You'd need to change more than the download logic and tests to make it work.

szha · 2018-07-10T22:38:25Z

Having this complexity in the code base just to do what containers should have done doesn't seem the right approach to me.

larroy · 2018-07-11T09:07:43Z

@szha I see your point. Assuming that only one process is able to run on a system goes against 30 years of Unix and multi user OS development. I think is a bad assumption to make in general even if it's the optimal case. Containers are tools that shouldn't be required for correctness.

marcoabreu · 2018-07-11T09:11:43Z

Considering the fact that MXNet is not threadsafe and the interface (CAPI) is single threaded (with sticky thread) together with the assumption that you are only allowed to run a single instance of MXNet puts us in a quite difficult position.

larroy · 2018-07-11T09:12:44Z

You can run several instances of mxnet. Can you elaborate?

marcoabreu · 2018-07-11T09:14:27Z

This is the assumption stated above: "Many design decisions are based on having a single instance of mxnet in a system. "

szha · 2018-07-11T16:07:42Z

@marcoabreu and to you, that statement equals "you are only allowed to run a single instance of MXNet"?

szha · 2018-07-11T16:10:50Z

We can also ask whether one test should affect another in any way, and if so, why, and why not managed explicitly.

marcoabreu · 2018-07-11T16:11:55Z

Yes, because everything else is undefined behaviour, right? If we design a system to be used only in a specific way, we don't "guarantee" anything if people use it differently. How else would you Interpret such a design statement?

marcoabreu · 2018-07-11T16:13:03Z

My stand on that would be that our tests should all be completely self contained and thus have no other side effects that forbid us from running them in parallel or multiple times in sequence or another order - besides resource constraints like memory. What do you think?

szha · 2018-07-11T16:16:23Z

OK. Following this logic, we either have inter-process memory management for GPU or our CPU users can't run multi-process CPU inference. What guarantee does CI have BTW?

marcoabreu · 2018-07-11T16:26:33Z

Every process can manage it's own GPU memory - cuda supports multi tenancy out of the box. You're probably only going to run into out of memory problems if you run too many requests. Ideally we'd have proper memory management between processes, but as long as people keep their batch sizes and models in mind, they should be fine. This is definitely far from optimal and would basically be like over provisiong. BTW, we've been looking into cuda memory oversubscription to smooth out memory spikes without crashing the run.

I didn't get the part about multi process CPU inference. Please elaborate.

CI guarantees correctness of our tests if run sequentially. On CPU, the test suites (different Jenkins jobs) are being run in parallel (up to 3x) in separate containers on Ubuntu and up to 4x in a shared system on windows (which revealed this problem). GPU (due to our of memory issues) tests are always in sequence and have exclusive access to the GPU (besides system processes) and no parallelization of runs is happening.

Thus, the guarantee is basically correctness of single threaded sequential execution on CPU and GPU.

szha · 2018-07-11T16:43:24Z

Users don't directly deal with CUDA. They interact with CUDA through MXNet. MXNet has internal memory pool to hold previously allocated memory, without information about GPU memory held by other processes. Thus, running multiple processes on the same GPU, for many workloads will likely meet OOM error. You'd know if you tried.

The CPU inference part wasn't really important. I was just joking about "no inter-process memory management for GPU => no overall multi-tenancy guarantee for MXNet => even CPU users can't rely on multi-tenancy".

Since CI guarantees correctness of tests if run sequentially, and we won't find any problem with the test_export test if we indeed run sequentially on a host without other tests running, does it mean that running tests in parallel is breaking the guarantee?

marcoabreu · 2018-07-11T16:49:10Z

I tried, what you described was exactly my observation: you run out of cuda memory if you run MXNet in parallel. I didn't know we had the memory pool, but that certainly explains it. That problem is the reason all our GPU slaves have only one executors. Is there any possibility to tweak or disable that pool?

We're not in state where we can give actual guarantees at all, that's why I had it in quotes :)

No, running the tests in parallel (which is only the case on windows CPU) only reveals the problems we're having if we actually don't run in sequence.

marcoabreu · 2018-07-11T16:51:27Z

But yeah, we are regularly testing MXNet CPU on windows and don't seem to encounter many problems. I also wouldn't know much reason besides io(like this case) because the os automatically takes care of RAM and CPU. The only problematic part in our case seems to be the fact that we run out of resources on GPU - apparently due to the pooling.

szha · 2018-07-11T17:07:36Z

export MXNET_GPU_MEM_POOL_RESERVE=100 should effectively disable the pool.

larroy · 2018-08-06T10:13:27Z

My merged pr in addition to @szha env variable should fix this issue, can be closed.

KellenSunderland mentioned this issue Jul 9, 2018

[MXNET-643] Disable flaky test: gluon test_export #11617

Merged

5 tasks

marcoabreu added Bug Gluon Test Flaky labels Jul 9, 2018

marcoabreu added the Disabled test label Jul 9, 2018

larroy mentioned this issue Jul 10, 2018

[MXNET-769] set MXNET_HOME as base for downloaded models through base.data_dir() #11636

Merged

5 tasks

marcoabreu closed this as completed Jul 11, 2018

marcoabreu reopened this Jul 11, 2018

szha mentioned this issue Jul 11, 2018

download function should avoid race-condition #11646

Closed

sandeep-krishnamurthy closed this as completed Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test test_gluon.test_export #11616

Flaky test test_gluon.test_export #11616

KellenSunderland commented Jul 9, 2018

szha commented Jul 9, 2018

marcoabreu commented Jul 9, 2018 •

edited

Loading

szha commented Jul 9, 2018 •

edited

Loading

szha commented Jul 9, 2018

szha commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

szha commented Jul 9, 2018

KellenSunderland commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

larroy commented Jul 10, 2018 •

edited

Loading

marcoabreu commented Jul 10, 2018 •

edited

Loading

larroy commented Jul 10, 2018

larroy commented Jul 10, 2018

szha commented Jul 10, 2018

szha commented Jul 10, 2018

larroy commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

larroy commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

szha commented Jul 11, 2018

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

marcoabreu commented Jul 11, 2018 •

edited

Loading

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

marcoabreu commented Jul 11, 2018 •

edited

Loading

szha commented Jul 11, 2018

larroy commented Aug 6, 2018 •

edited

Loading

Flaky test test_gluon.test_export #11616

Flaky test test_gluon.test_export #11616

Comments

KellenSunderland commented Jul 9, 2018

szha commented Jul 9, 2018

marcoabreu commented Jul 9, 2018 • edited Loading

szha commented Jul 9, 2018 • edited Loading

szha commented Jul 9, 2018

szha commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

szha commented Jul 9, 2018

KellenSunderland commented Jul 9, 2018

marcoabreu commented Jul 9, 2018

larroy commented Jul 10, 2018 • edited Loading

marcoabreu commented Jul 10, 2018 • edited Loading

larroy commented Jul 10, 2018

larroy commented Jul 10, 2018

szha commented Jul 10, 2018

szha commented Jul 10, 2018

larroy commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

larroy commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

szha commented Jul 11, 2018

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

marcoabreu commented Jul 11, 2018 • edited Loading

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

szha commented Jul 11, 2018

marcoabreu commented Jul 11, 2018

marcoabreu commented Jul 11, 2018 • edited Loading

szha commented Jul 11, 2018

larroy commented Aug 6, 2018 • edited Loading

marcoabreu commented Jul 9, 2018 •

edited

Loading

szha commented Jul 9, 2018 •

edited

Loading

larroy commented Jul 10, 2018 •

edited

Loading

marcoabreu commented Jul 10, 2018 •

edited

Loading

marcoabreu commented Jul 11, 2018 •

edited

Loading

marcoabreu commented Jul 11, 2018 •

edited

Loading

larroy commented Aug 6, 2018 •

edited

Loading