-
Notifications
You must be signed in to change notification settings - Fork 6.8k
When predicting, does mxnet provide thread-safe interface? #3946
Comments
I have same question. |
The engine is not thread safe so there is no way to use multiple threads for pushing computation to the engine. However the engine already does threading and scheduling for you so you really shouldn't need to. |
BTW, python multi threading doesn't really work due to GIL. Try multi processing instead |
@piiswrong On the same card, if I bind multiple executors on the same set of weight ndarrays, and run them from different thread simultaneously, is that ok? |
I have the same problem. Is it possible to provide multi-threaded prediction service using mxnet? As in caffe case, we have to copy the whole net as many times as the thread number. Is mxnet any better? my trained model size is normally around 1GB, so memory consumption is a big issue. |
Any progress here? |
I have the same question. Any progress here? |
I think we should separate "data" and the model weights . when predicting, each thread can share the model weights and use different data . In this way , we have mush less memory consumption . |
Tl;dr: if you use a high performance python web-server like gunicorn, you'll get what you want. You'll have lots of cores running in parallel working on different requests, and you won't need a copy in memory of the model for each worker. Getting a complex codebase to be threadsafe is no small task, so this won't be "resolved" any time soon. Fortunately it's not necessary here for most of what you want. Your answer lies in the Unix fork() command. If you want to understand, go read: https://en.wikipedia.org/wiki/Fork_(system_call) The magic of fork is copy-on-write memory semantics whereby each forked worker has its own virtual memory address space, but they all share the same physical memory. (Until one of them writes to the shared memory, in which case a private copy of that memory block is made in that process's virtual address space -- thus "copy-on-write".) So even though it's not multi-threaded, fork() & pre-fork worker servers like gunicorn let you accomplish almost the same thing with multiple processes instead of threads. Forked processes are somewhat more heavyweight than threads, but they're nowhere near as expensive as you running the same command multiple times. |
Good points made by piiswrong@ and leopd@. Based on my experience of using fork() over many years, I would say that calling fork() from a Multi-Threaded process is typically NOT recommended. If you do, you will need to understand the details in how it works to make correct use of it. A few caveats on the use of fork() are:
References: |
phoenixbai> As in caffe case, we have to copy the whole net as many times as the thread number. Is mxnet any better? my trained model size is normally around 1GB, so memory consumption is a big issue. I would suggest saving the model in a file and accessing the file using memory mapped files from multiple threads/processes. This will significantly reduce the memory requirements of your solution. Some good references: |
@piiswrong can you expand on why you suggesting that the executors should accessed from the same thread ? This means that for the lifetime of the process only thread will be responsible for interacting with MXNet. |
@piiswrong @eric-haibin-lin Maybe not related. Can mxnet find two independent ops in a computation graph and execute them parallelly on two cores of one CPU, respectively? Or if mxnet can konw how much cores are there, and give first half to the first operator and give the second half to the second operator. |
If a graph has two parallel paths, MXNet can detect that and execute it if it has enough WORKER_THREADS. https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md#set-the-number-of-threads
For CPU we rely on openmp for parallelization. We may give a hint to openmp but there's no guarantee on how many threads are actually executing for a single operator. @cjolivier01 works on CPU performance tuner and maybe has more comments on this |
currently, it will run the ops in parallel, using (possibly) several OMP on each one independently. OMP threads for the given operator will tend to be on separate physical cores, however currently there is not coordination between OMP thread/core allocation across parallelly-executing operators, so they may overlap for some period of their execution. it’s actualy kind of a tricky thing, because you want to allocate them across operators in an ideal way (like you mentioned), but they usually aren’t going to run perfectly parallel, so there will be time-boxes where the cpu wouldn’t be fully utilized (when they aren’t overlapping). this would be especially apparent when the operators aren’t the same. how best to angle this is currently under discussion and input is welcome. |
@eric-haibin-lin @cjolivier01 Thanks for the information. I am just curious about how does mxnet deal with the model parallelism, op parallelism and the parallelism inside of a op. If I have 40 cores and 2 independent ops, I can create 40 threads and give the first 20 threads to the first op and the other 20 threads to the second op, and execute the two ops concurrently on cpu. But maybe it's not as efficient as executing the two ops sequentially since both the two ops will leverage all 40 threads. |
@gold-mango Have you found some solutions?
|
Confirmed with @piiswrong offline that the dependency engine in C++ is actually thread safe. |
@yanhn are you using python for inference? MXNET engine has limited number of worker threads https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md#set-the-number-of-threads |
Does this mean that it is safe to have multiple threads calling the engine concurrently? Specifically, is it OK to create one executor per thread and running them simultaneously? (I am asking this in the context of multithreaded inference on CPU with C/C++ API). And this seems in contradiction with:
as stated in https://mxnet.incubator.apache.org/architecture/overview.html, and also the discussion at https://discuss.mxnet.io/t/fixing-thread-safety-issues-in-scala-library/236. Has something changed since then? Some clarification on this would be highly appreciated! |
@eric-haibin-lin |
@junrushao1994 could you check @hqucms 's comment on the thread safety of engine's Push API? Is this true? |
@hqucms @eric-haibin-lin I am not sure why our document says "Push APIs are not thread-safe" (https://mxnet.incubator.apache.org/architecture/overview.html). @tqchen Could you help confirm this? |
any process here? About the "Push APIs are not thread-safe" |
Can I just create multi infer handle in different thread? I've tried in this way and don't work. |
@loadwiki Can you provide an example to reproduce? Thx |
Here is a very reasonable proposal. |
The parallel utility in gluonnlp may be useful for some use cases: https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/utils/parallel.py#L66-L77 |
When deploying online, multi-threads is usually required. If each thread load a model, the memory consume costly, so is there a thread-safe interface that share model parameters?
The text was updated successfully, but these errors were encountered: