Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test test_gluon.test_export #11616

Closed
KellenSunderland opened this issue Jul 9, 2018 · 31 comments
Closed

Flaky test test_gluon.test_export #11616

KellenSunderland opened this issue Jul 9, 2018 · 31 comments

Comments

@KellenSunderland
Copy link
Contributor

Failed build:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1148/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1148/nodes/822/log/?start=0

Output:

======================================================================
ERROR: test_gluon.test_export
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\tests\python\unittest\test_gluon.py", line 787, in test_export
    prefix='resnet', ctx=ctx, pretrained=True)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\vision\resnet.py", line 406, in resnet18_v1
    return get_resnet(1, 18, **kwargs)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\vision\resnet.py", line 390, in get_resnet
    root=root), ctx=ctx)
  File "C:\jenkins_slave\workspace\ut-python-cpu@3\pkg_vc14_cpu\python\mxnet\gluon\model_zoo\model_store.py", line 114, in get_model_file
    os.remove(zip_file_path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Windows\\system32\\config\\systemprofile\\.mxnet\\models\\resnet18_v1-a0666292.zip'

-------------------- >> begin captured stdout << ---------------------
Model file is not found. Downloading.
Downloading C:\Windows\system32\config\systemprofile\.mxnet\models\resnet18_v1-a0666292.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------
requests.packages.urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): apache-mxnet.s3-accelerate.dualstack.amazonaws.com
requests.packages.urllib3.connectionpool: DEBUG: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com:443 "GET /gluon/models/resnet18_v1-a0666292.zip HTTP/1.1" 200 43452554
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=767602412 to reproduce.
--------------------- >> end captured logging << ---------------------

@szha
Copy link
Member

szha commented Jul 9, 2018

Looks like a CI problem @marcoabreu

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 9, 2018

To me, it looks like a test problem. It seems like we are hardcoding paths, causing problems if multiple runs are triggered in parallel. The same could happen on a user system if they have multiple processes starting up inside the same container or operating system, so we should consider this as an important bug.

Also, why de we write to system32?

@szha
Copy link
Member

szha commented Jul 9, 2018

I was treating sandboxing as a CI problem, as I could hardly think of any circumstance in which we'd want tests from one Jenkins executor to interact with tests in another executor on the same host.

@szha
Copy link
Member

szha commented Jul 9, 2018

Even if there is such case, I imagine we'd want to manage it explicitly.

@szha
Copy link
Member

szha commented Jul 9, 2018

That particular API uses $HOME/.mxnet as the default path, so the system path likely comes from $HOME. One simple solution to fix such case is to override $HOME and let it point to a managed temporary file path.

@marcoabreu
Copy link
Contributor

On production systems while running a service, we can't necessarily assume that every single request is handled in a sandbox. On Ubuntu we actually run in a separate container for every single run, but this is not the case. The windows use case seems to be pretty close to reality.

We don't want to interact in between executors, but that's a side effect because we don't make unique paths. Just load the model zoo in two separate processes and this problem will pop up.

I'd recommend to store the files in a specific non unique directory and check if they exist (and match the hash) before downloading the archive from the internet. If it doesn't exist, download the zip file to a temporary dir. Then put a readwritelock (shared read, unique write) onto the non unique target path and extract the archive file into it. Right now, the code assumes that we have exclusive access - especially the code which tries to delete the archive file which might be used from a different process because it's also non unique (on mobile, otherwise I'd post the code).

@marcoabreu
Copy link
Contributor

Thanks for elaborating on the home variable. That makes sense because we run as administrator on windows.

@szha
Copy link
Member

szha commented Jul 9, 2018

This situation has nothing to do with a micro service...

@KellenSunderland
Copy link
Contributor Author

I'd agree this is a sandboxing issue in the CI. We'll take a look at how we can improve this @szha.

@marcoabreu
Copy link
Contributor

To clarify: do we expect our users to be able to run multiple instances of MXNet on the same host?

@larroy
Copy link
Contributor

larroy commented Jul 10, 2018

I agree with Marco. You can't assume that writing to a global path or acquiring a global resource is ok, because this prevents to run tests in parallel in different copies of MXNet which should be possible, as developers often have several copies of the same repo to test different things.

So this test should write to a different location by means of a temporary directory.
The fact that is not a good practice that a unit test writes to disk and instead should avoid IO or use a memory buffer is a broader discussion.

We SHOULD be able to run multiple instances of MXNet in the same host, and run tests independently of course!!!

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 10, 2018

Just to clarify: The problem here seems to be function-under-test python\mxnet\gluon\model_zoo\model_store.py:get_model_file, thus it's not the test itself causing the problem but rather the actually tested function having trouble with multi tenancy and being revealed by this test.

@larroy
Copy link
Contributor

larroy commented Jul 10, 2018

@marcoabreu clear, I'm working on a fix.

@larroy
Copy link
Contributor

larroy commented Jul 10, 2018

The import export tests are failing on my machine with latest master:


[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=233564263 to reproduce.
FAIL

======================================================================
FAIL: test_gluon.test_export
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/piotr/py3/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/piotr/devel/mxnet/mxnet/tests/python/unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/piotr/devel/mxnet/mxnet/tests/python/unittest/test_gluon.py", line 799, in test_export
    assert_almost_equal(out.asnumpy(), mod_out.asnumpy())
  File "/home/piotr/devel/mxnet/mxnet/python/mxnet/test_utils.py", line 494, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 5336385536.000000 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(0, 321), a=-8591.546875, b=0.161002
 a: array([[ -6051.294   ,   4307.4683  ,   4112.863   ,  -6575.2446  ,
         -2436.4683  ,   3050.015   ,  -8878.826   ,  -8480.262   ,
         -9142.191   ,   -310.93896 ,   -661.2363  ,  -5160.913   ,...
 b: array([[ -51.129765  ,   12.116906  ,   15.172363  ,  -63.08578   ,
         -44.068184  ,  -10.512493  ,  -66.99067   ,  -45.393295  ,
         -63.472206  ,  -63.23494   ,   12.72966   ,   65.423035  ,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=2102901580 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=233564263 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 1.050s

@szha
Copy link
Member

szha commented Jul 10, 2018

Many design decisions are based on having a single instance of mxnet in a system. For example, the mxnet processes don't attempt to find and talk to each other to coordinate on memory allocation. You'd need to change more than the download logic and tests to make it work.

@szha
Copy link
Member

szha commented Jul 10, 2018

Having this complexity in the code base just to do what containers should have done doesn't seem the right approach to me.

@larroy
Copy link
Contributor

larroy commented Jul 11, 2018

@szha I see your point. Assuming that only one process is able to run on a system goes against 30 years of Unix and multi user OS development. I think is a bad assumption to make in general even if it's the optimal case. Containers are tools that shouldn't be required for correctness.

@marcoabreu
Copy link
Contributor

Considering the fact that MXNet is not threadsafe and the interface (CAPI) is single threaded (with sticky thread) together with the assumption that you are only allowed to run a single instance of MXNet puts us in a quite difficult position.

@larroy
Copy link
Contributor

larroy commented Jul 11, 2018

You can run several instances of mxnet. Can you elaborate?

@marcoabreu
Copy link
Contributor

This is the assumption stated above: "Many design decisions are based on having a single instance of mxnet in a system. "

@szha
Copy link
Member

szha commented Jul 11, 2018

@marcoabreu and to you, that statement equals "you are only allowed to run a single instance of MXNet"?

@szha
Copy link
Member

szha commented Jul 11, 2018

We can also ask whether one test should affect another in any way, and if so, why, and why not managed explicitly.

@marcoabreu
Copy link
Contributor

Yes, because everything else is undefined behaviour, right? If we design a system to be used only in a specific way, we don't "guarantee" anything if people use it differently. How else would you Interpret such a design statement?

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 11, 2018

My stand on that would be that our tests should all be completely self contained and thus have no other side effects that forbid us from running them in parallel or multiple times in sequence or another order - besides resource constraints like memory. What do you think?

@szha
Copy link
Member

szha commented Jul 11, 2018

OK. Following this logic, we either have inter-process memory management for GPU or our CPU users can't run multi-process CPU inference. What guarantee does CI have BTW?

@marcoabreu
Copy link
Contributor

Every process can manage it's own GPU memory - cuda supports multi tenancy out of the box. You're probably only going to run into out of memory problems if you run too many requests. Ideally we'd have proper memory management between processes, but as long as people keep their batch sizes and models in mind, they should be fine. This is definitely far from optimal and would basically be like over provisiong. BTW, we've been looking into cuda memory oversubscription to smooth out memory spikes without crashing the run.

I didn't get the part about multi process CPU inference. Please elaborate.

CI guarantees correctness of our tests if run sequentially. On CPU, the test suites (different Jenkins jobs) are being run in parallel (up to 3x) in separate containers on Ubuntu and up to 4x in a shared system on windows (which revealed this problem). GPU (due to our of memory issues) tests are always in sequence and have exclusive access to the GPU (besides system processes) and no parallelization of runs is happening.

Thus, the guarantee is basically correctness of single threaded sequential execution on CPU and GPU.

@szha
Copy link
Member

szha commented Jul 11, 2018

Users don't directly deal with CUDA. They interact with CUDA through MXNet. MXNet has internal memory pool to hold previously allocated memory, without information about GPU memory held by other processes. Thus, running multiple processes on the same GPU, for many workloads will likely meet OOM error. You'd know if you tried.

The CPU inference part wasn't really important. I was just joking about "no inter-process memory management for GPU => no overall multi-tenancy guarantee for MXNet => even CPU users can't rely on multi-tenancy".

Since CI guarantees correctness of tests if run sequentially, and we won't find any problem with the test_export test if we indeed run sequentially on a host without other tests running, does it mean that running tests in parallel is breaking the guarantee?

@marcoabreu
Copy link
Contributor

I tried, what you described was exactly my observation: you run out of cuda memory if you run MXNet in parallel. I didn't know we had the memory pool, but that certainly explains it. That problem is the reason all our GPU slaves have only one executors. Is there any possibility to tweak or disable that pool?

We're not in state where we can give actual guarantees at all, that's why I had it in quotes :)

No, running the tests in parallel (which is only the case on windows CPU) only reveals the problems we're having if we actually don't run in sequence.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 11, 2018

But yeah, we are regularly testing MXNet CPU on windows and don't seem to encounter many problems. I also wouldn't know much reason besides io(like this case) because the os automatically takes care of RAM and CPU. The only problematic part in our case seems to be the fact that we run out of resources on GPU - apparently due to the pooling.

@szha
Copy link
Member

szha commented Jul 11, 2018

export MXNET_GPU_MEM_POOL_RESERVE=100 should effectively disable the pool.

@larroy
Copy link
Contributor

larroy commented Aug 6, 2018

My merged pr in addition to @szha env variable should fix this issue, can be closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants