-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flaky test test_gluon.test_export #11616
Comments
Looks like a CI problem @marcoabreu |
To me, it looks like a test problem. It seems like we are hardcoding paths, causing problems if multiple runs are triggered in parallel. The same could happen on a user system if they have multiple processes starting up inside the same container or operating system, so we should consider this as an important bug. Also, why de we write to system32? |
I was treating sandboxing as a CI problem, as I could hardly think of any circumstance in which we'd want tests from one Jenkins executor to interact with tests in another executor on the same host. |
Even if there is such case, I imagine we'd want to manage it explicitly. |
That particular API uses |
On production systems while running a service, we can't necessarily assume that every single request is handled in a sandbox. On Ubuntu we actually run in a separate container for every single run, but this is not the case. The windows use case seems to be pretty close to reality. We don't want to interact in between executors, but that's a side effect because we don't make unique paths. Just load the model zoo in two separate processes and this problem will pop up. I'd recommend to store the files in a specific non unique directory and check if they exist (and match the hash) before downloading the archive from the internet. If it doesn't exist, download the zip file to a temporary dir. Then put a readwritelock (shared read, unique write) onto the non unique target path and extract the archive file into it. Right now, the code assumes that we have exclusive access - especially the code which tries to delete the archive file which might be used from a different process because it's also non unique (on mobile, otherwise I'd post the code). |
Thanks for elaborating on the home variable. That makes sense because we run as administrator on windows. |
This situation has nothing to do with a micro service... |
I'd agree this is a sandboxing issue in the CI. We'll take a look at how we can improve this @szha. |
To clarify: do we expect our users to be able to run multiple instances of MXNet on the same host? |
I agree with Marco. You can't assume that writing to a global path or acquiring a global resource is ok, because this prevents to run tests in parallel in different copies of MXNet which should be possible, as developers often have several copies of the same repo to test different things. So this test should write to a different location by means of a temporary directory. We SHOULD be able to run multiple instances of MXNet in the same host, and run tests independently of course!!! |
Just to clarify: The problem here seems to be function-under-test python\mxnet\gluon\model_zoo\model_store.py:get_model_file, thus it's not the test itself causing the problem but rather the actually tested function having trouble with multi tenancy and being revealed by this test. |
@marcoabreu clear, I'm working on a fix. |
The import export tests are failing on my machine with latest master:
|
Many design decisions are based on having a single instance of mxnet in a system. For example, the mxnet processes don't attempt to find and talk to each other to coordinate on memory allocation. You'd need to change more than the download logic and tests to make it work. |
Having this complexity in the code base just to do what containers should have done doesn't seem the right approach to me. |
@szha I see your point. Assuming that only one process is able to run on a system goes against 30 years of Unix and multi user OS development. I think is a bad assumption to make in general even if it's the optimal case. Containers are tools that shouldn't be required for correctness. |
Considering the fact that MXNet is not threadsafe and the interface (CAPI) is single threaded (with sticky thread) together with the assumption that you are only allowed to run a single instance of MXNet puts us in a quite difficult position. |
You can run several instances of mxnet. Can you elaborate? |
This is the assumption stated above: "Many design decisions are based on having a single instance of mxnet in a system. " |
@marcoabreu and to you, that statement equals "you are only allowed to run a single instance of MXNet"? |
We can also ask whether one test should affect another in any way, and if so, why, and why not managed explicitly. |
Yes, because everything else is undefined behaviour, right? If we design a system to be used only in a specific way, we don't "guarantee" anything if people use it differently. How else would you Interpret such a design statement? |
My stand on that would be that our tests should all be completely self contained and thus have no other side effects that forbid us from running them in parallel or multiple times in sequence or another order - besides resource constraints like memory. What do you think? |
OK. Following this logic, we either have inter-process memory management for GPU or our CPU users can't run multi-process CPU inference. What guarantee does CI have BTW? |
Every process can manage it's own GPU memory - cuda supports multi tenancy out of the box. You're probably only going to run into out of memory problems if you run too many requests. Ideally we'd have proper memory management between processes, but as long as people keep their batch sizes and models in mind, they should be fine. This is definitely far from optimal and would basically be like over provisiong. BTW, we've been looking into cuda memory oversubscription to smooth out memory spikes without crashing the run. I didn't get the part about multi process CPU inference. Please elaborate. CI guarantees correctness of our tests if run sequentially. On CPU, the test suites (different Jenkins jobs) are being run in parallel (up to 3x) in separate containers on Ubuntu and up to 4x in a shared system on windows (which revealed this problem). GPU (due to our of memory issues) tests are always in sequence and have exclusive access to the GPU (besides system processes) and no parallelization of runs is happening. Thus, the guarantee is basically correctness of single threaded sequential execution on CPU and GPU. |
Users don't directly deal with CUDA. They interact with CUDA through MXNet. MXNet has internal memory pool to hold previously allocated memory, without information about GPU memory held by other processes. Thus, running multiple processes on the same GPU, for many workloads will likely meet OOM error. You'd know if you tried. The CPU inference part wasn't really important. I was just joking about "no inter-process memory management for GPU => no overall multi-tenancy guarantee for MXNet => even CPU users can't rely on multi-tenancy". Since CI guarantees correctness of tests if run sequentially, and we won't find any problem with the test_export test if we indeed run sequentially on a host without other tests running, does it mean that running tests in parallel is breaking the guarantee? |
I tried, what you described was exactly my observation: you run out of cuda memory if you run MXNet in parallel. I didn't know we had the memory pool, but that certainly explains it. That problem is the reason all our GPU slaves have only one executors. Is there any possibility to tweak or disable that pool? We're not in state where we can give actual guarantees at all, that's why I had it in quotes :) No, running the tests in parallel (which is only the case on windows CPU) only reveals the problems we're having if we actually don't run in sequence. |
But yeah, we are regularly testing MXNet CPU on windows and don't seem to encounter many problems. I also wouldn't know much reason besides io(like this case) because the os automatically takes care of RAM and CPU. The only problematic part in our case seems to be the fact that we run out of resources on GPU - apparently due to the pooling. |
|
My merged pr in addition to @szha env variable should fix this issue, can be closed. |
Failed build:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1148/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1148/nodes/822/log/?start=0
Output:
The text was updated successfully, but these errors were encountered: