Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[RFC] Integrate TVM into Apache MXNet #15465

Open
yzhliu opened this issue Jul 5, 2019 · 35 comments
Open

[RFC] Integrate TVM into Apache MXNet #15465

yzhliu opened this issue Jul 5, 2019 · 35 comments
Labels
RFC Post requesting for comments

Comments

@yzhliu
Copy link
Member

yzhliu commented Jul 5, 2019

Problem Statement

Currently in MXNet we implement operator kernels in C++. Developers need to specify the detail logic of each computation, which slows down the development process. Given the fact that we’re moving forward to be numpy-compatible[1], a large amount of operators are to be implemented. Moreover, we also have various of backend to support, including CPUs and GPUs from AMD, ARM, Intel, Nvidia, etc. It requires great effort to implement efficient kernels for each of these hardwares, so as writing test cases for each of the operator+backend combination.

Proposal

Thus I would like to propose to integrate Apache TVM into Apache MXNet, so that we can leverage its ability to easily implement (in Python) high-performance operator kernels. Here’re some of the advantages,

  1. We devised a new approach to implement MXNet kernels in Python, (see PR [WIP] tvm op support #15345 ). For example, to implement broadcast add, people can write pure Python,
@defop(name="vadd", target="cpu", auto_broadcast=True,
       dtype=AllTypes, ndim=list(range(6)))
def vadd(dtype, ndim):
    A = tvm.placeholder(shape=[tvm.var() for _ in range(ndim)], name='A', dtype=dtype)
    B = tvm.placeholder(shape=[tvm.var() for _ in range(ndim)], name='B', dtype=dtype)
    C = tvm.compute(shape=[tvm.var() for _ in range(ndim)],
                    lambda *index: A[index] + B[index], name='C')
    s = tvm.create_schedule(C.op)
    axes = [axis for axis in C.op.axis]
    fused = s[C].fuse(*axes)
    s[C].parallel(fused)
    return s, [A, B, C]

The code above will be compiled to binary and linked into MXNet as a regular function. Note that the same piece of compute definition can be shared across the multiple backends (cpu, gpu, etc.). As it is much more concise than that defined in C++, it can lower the bar of implementing high-performance kernels and improve the development experience. We expect people can develop more efficiently with such approach.

  1. Operators in current TVM are already numpy-compatible, we can leverage the efforts there to help our Numpy project.

Approach

  • Build and link TVM dynamic library into MXNet.
  • Build infrastructure to write operator kernels in Python, including:
    • The approach for registering TVM operator into MXNet
    • Modify CI to enable TVM operator build, and pack the operator library together with the release binary. We can enable automatic performance tuning[2] later to further improve the performance.

Currently Apache TVM has already been integrated as a 3rdparty repository, and some of the nnvm header files are included in MXNet source code.

FAQ

Q: Does it increase the binary size of MXNet release?
A: libtvm_runtime.so is roughly 750K, it is fairly small compared to libmxnet.so (60MB for cpu, ~500MB for gpu)

Q: Are TVM operators going to replace current operators in MXNet?
A: No. It is an alternative way to write kernels. For new operators which are easy to be written in this approach, we can benefit from the advantages mentioned above.

Q: Any license problem?
A: TVM is provided under Apache 2.0 License, and it’s currently incubating at Apache Software Foundation: https://tvm.ai/2019/03/18/tvm-apache-announcement

Background of TVM

Apache TVM is an open deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends.

  • TVM has showed its ability to get decent performance not only on end2end neural network, but also on single kernels. Recently papers and benchmarks showed it can achieve better performance even comparing to vendors’ acceleration libraries: https://tvm.ai/2018/10/03/auto-opt-all
  • TVM supports a large number of backend, including Intel, AMD, ARM CPUs and GPUs, as well as Nvidia GPUs and FPGAs. Reusing TVM's optimized kernels could benefit MXNet's backend support a lot.
  • TVM provides a flexible and convenient tool to route data structure and function call between C++ and other frontend language, which would be helpful for general purposes. See https://docs.tvm.ai/dev/runtime.html#tvm-node-and-compiler-stack for reference.

Reference

[1] [RFC] Introducing NumPy-compatible coding experience into MXNet - #14253
[2] [TVM] Automatic Kernel Optimization for Deep Learning on All Hardware Platforms - https://tvm.ai/2018/10/03/auto-opt-all

@yzhliu yzhliu added the RFC Post requesting for comments label Jul 5, 2019
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

@ZhennanQin
Copy link
Contributor

Good to see we have another option to implement operators with high performance! Here's a quick question:
How to deal with TVM dependence with LLVM ? LLVM is an important code generator for many CPU backends. If we integrate TVM into MXNet, then shall we integrate LLVM into MXNet? If the answer is yes, then we have binary size issue. If not, then how to make sure TVM can works together with any host installed LLVM?

@yzhliu
Copy link
Member Author

yzhliu commented Jul 5, 2019

@ZhennanQin This is a good point. LLVM is required during compile time, while for binary release, only tvm runtime is needed (instead of tvm compiler plus llvm). We don't do JIT.

@ZhennanQin
Copy link
Contributor

Oh, it's a bit different from what I thought. If we don't do jit, then tvm can't be used to implement custom op unless rebuilding MXNet.
Also, as tvm generated kernel is complied along with compling MXNet, avx512 instructions won't be generated from release version of MXNet, making integrated tvm can't provide best performance comparing with standalone tvm.

Seems jit is the solution to address above problem, then why don't we support it?

@yzhliu
Copy link
Member Author

yzhliu commented Jul 5, 2019

I actually like the idea. But as jit requires big surgery to MXNet's engine, I suggest to do aot first, then make another design for jit later.

@apeforest
Copy link
Contributor

@yzhliu Nice proposal. I look forward to this integration of TVM into MXNet backend. It would be great if you could also add a section in the project plan for performance comparison with existing backend, so we have a more quantitative measure.

@junrushao
Copy link
Member

@ZhennanQin It would be possible to allow this to be an easy plugin of MXNet. Imagine the following scenario: MXNet itself is a small library with not many operators, and can be extended by downloading extra TVM ops, specialized for different CPUs, without re-compilation.

By detecting the feature set of the CPUs used (whether supporting avx512, etc), we could instruct users to use the most efficient operator library.

@junrushao
Copy link
Member

@ZhennanQin btw, JIT is something requires runtime external dependency, and extra effort for compiling/caching/... We may consider it later after AOT is done :-)

@cjolivier01
Copy link
Member

dumb question: Are operators written this way available in any way to non-python languages such as lua, scala, etc?

@junrushao
Copy link
Member

@cjolivier01 yep! The TVM compiler generates llvm IR which are compiled as binary and loaded as a C++ module, then could be used in all frontend languages

@ZhennanQin
Copy link
Contributor

@junrushao1994 The plugin idea looks too fantastic for me. Of course, If we can achieve that, then integrated TVM is able to provide very good performance.

Here's another question about threading management. As you may know, MXNet is using openmp as threading management protocol. Also threaded engine can create many threaded workers to execute operators in parallel. So the total thread number on threaded engine is (total_workers * OMP_NUM_THREADS). AFAIK, TVM runtime has its own threading pool, will TVM switch to use openmp as threading management protocol after integration? can we run 2 TVM generated operators in different worker at same time?

@cjolivier01
Copy link
Member

That python code looks pretty intense compared to JAX/XLA, which looks like normal python/numpy code. What’s the difference between the end results/use cases of these packages?

https://github.com/google/jax/blob/master/README.md

@junrushao
Copy link
Member

@cjolivier01 basically you can consider TVM serves similar functionality as XLA. JAX doesn’t seem to be relevant in this case.

@junrushao
Copy link
Member

@yzhliu i am not super familiar with the threading management in TVM. Could you answer @ZhennanQin’s question on your side?

@samskalicky
Copy link
Contributor

@yzhliu If TVM operators have to be compiled and then would need to be recompiled with MXNet, have you thought about TVM being an "accelerator" that could be used with the Accelerator API proposal: https://cwiki.apache.org/confluence/display/MXNET/Bring+your+own+Accelerator ? Maybe using this interface we can build that "auto-download" the appropriate operator-set feature. And keep the operator libraries compiled separate from MXNet.

@yzhliu
Copy link
Member Author

yzhliu commented Jul 9, 2019

@ZhennanQin tvm will still be using its own thread pool, the thread pool is shared across the workers. I don't see this will be a problem as anyhow the max parallelism we can achieve is the number of physical cores.

@apeforest unfortunately I don't have the performance number for new-written operators as for now.

@samskalicky Yes this can be an option. while it's a little bit unclear to me how infer-shape/dtype/etc is registered to mxnet under the approach. Maybe we can keep in mind the use-cases when designing the interface, and port tvm as an "accelerator" later once it has been implemented.

@samskalicky
Copy link
Contributor

Thanks @yzhliu for the answers. How will users control the number of threads used by TVM in MXNet? they already have a few knobs to turn, and some use cases like MMS set minimum thread pool sizes to prioritize throughput over latency. Are we overcomplicating things by adding this additional threadpool?

The question about multiple OMP libs still stands though. Will TVM compile with the same OMP as MXNet?

@ZhennanQin
Copy link
Contributor

@yzhliu Actually this will be a performance problem. OMP threads will spin when parallel block is completed, this can help to avoid thread switch off before executing next parallel block. If a TVM operator follows with an OMP operator, then there will be 2x threads activate, half are TVM threads, and the other half are the spined OMP threads. This will cause physical core race, result in low performance onTVM operator.

@junrushao
Copy link
Member

@ZhennanQin Does it make sense to parallelize two CPU operators both of which eat all threads?

@ZhennanQin
Copy link
Contributor

@junrushao1994 No, it doesn't make sense. But currently we can specify OMP_NUM_THREADS = physical_core_num / 2 for two paralleled operators to let each of them to use half physical cores.

@junrushao
Copy link
Member

@ZhennanQin If we could somehow instruct TVM's threading backward how many threads it would utilize, will it be something similar? Or what if we could instruct TVM to use omp?

@ZhennanQin
Copy link
Contributor

@junrushao1994 Let me clarify my concerns for TVM thread pool.

1>
On default, threaded engine will only create single worker for normal CPU operator. At this scenario, user will specify OMP_NUM_THREADS = physical_core_num to maximize the computing capacity of CPU. A OMP operator followed by TVM operator will have omp thread spinning issue, causing TVM operator executes slowly. This is a major issue need to address because it's a typical scenario in MXNet.

2>
when user specify MXNET_CPU_WORKER_NTHREADS=2, then 2 CPU workers are activated. It's possible to run 2 independent CPU operators in parallel. For this case, user can set OMP_NUM_THREADS = physical_core_num / 2 to let each worker to use half physical cores. I want to know if TVM can do the same thing.

3>
Even if we resolved above problem, there's a thread switching penalty when maintaining 2 different threading pool. The penalty can't be ignored for latency sensitive task(e.g. BS=1 inference).

From my perspective, mixed using 2 kinds of thread pool isn't a good idea. It's hard to manage thread binding for multi-workers, and will have thread switching overhead between different thread pools. It's better to use same threading management as MXNet.

@tqchen
Copy link
Member

tqchen commented Jul 9, 2019

Technically it is easy to drop in openmp as a threadpool backend of TVM. It is just not the default one. Historically tvm defaulted to openmp then evolved its own version of thread pool for better perf. The current thread pool already has accepted the OMP env variable(OMP_NUM_THREADS), making it consistent with the omp settings. Most of the operators touched so far are single-threaded so won’t be affected by the discussion.

So to summarize the the answers to @ZhennanQin 's points. (1) and (2) likely won't be an issue because: TVM's thread number configurations and core settings are consistent with OMP(OMP_NUM_THREADS).

(3) Might be an issue when consecutive operators are both compute-intensive and use different backends. Given that the initial operators that are going to be supported likely to use only single threads for memory-intensive ones, it won't be an issue. If in the case where compute-intensive ones are provided by TVM, we need to do benchmarks to decide which one is better. Mainly because according to @yidawang, using tvm's current pool has around 15% gains over omp. This gain might override the latency hit we get in the beginning. Note that the existing libraries MXNet use like NNPack has its own thread pools. Nevertheless, if necessary, tvm runtime could switch to use openmp for threads.

Given that this is a technical issue that may or may not affect how to get the best speed of the newly added features. I would recommend us to keep it in mind and do benchmarks to decide the best option when we find it is necessary.

@yidawang
Copy link

yidawang commented Jul 9, 2019

We will never want to use two thread pools simultaneously. We should use the one which gives best-to-date performance. Technically it is not difficult for TVM to switch among different thread pool implementations. So we just need to benchmark them and decide which one to use. As @tqchen said, our benchmarking results show that the current TVM thread pool is better than the ones using OpenMP APIs. Please check out this paper for details.

@ZhennanQin
Copy link
Contributor

@tqchen I agree with you that this is a technical issue and shouldn't block this PFC. As thread management should be considered from high level design, I bring this topic here.

(1) and (2) likely won't be an issue because: TVM's thread number configurations and core settings are consistent with OMP(OMP_NUM_THREADS).

OMP_NUM_THREADS can't help for either of them. For (1), consider the executing sequence: OMP operator -> TVM operator. When OMP operator finished, OMP threads won't sleep immediately, but spin for a while. Then when TVM operator starts, there will be 2x threads active, making TVM threads can't fully occupy all physical cores. You can check OMP_WAIT_POLICY for details.
For (2), it's about how to manage threads. OMP manages threads per engine worker. If TVM uses single global thread pool, then it will has problem for multi-worker usage.

Another thing worth to mention is, as OMP is the MXNet thread management protocol, user may use many OMP environment to change their thread policy. For example, KMP_AFFINITY and OMP_WAIT_POLICY are used frequently. I doubt if TVM threads pool can support them.

It's good to know TVM can support OMP as a threadpool backend. Then let's performance data make the decision if integrated TVM will switch to use OMP. Also thanks @yidawang for the explanation.

@yidawang
Copy link

@ZhennanQin I guess we should first make sure that we don't use more than one thread pool implementation in the runtime. Also, KMP_AFFINITY only works for Intel CPUs, we have many more other platforms.

@ZhennanQin
Copy link
Contributor

@yidawang If we all agree that we should avoid using more than one thread pool implementation in the runtime, then I guess OMP is the only choice, is that true?

@yidawang
Copy link

@yidawang If we all agree that we should avoid using more than one thread pool implementation in the runtime, then I guess OMP is the only choice, is that true?

Sorry for the late reply, traveling these days. In this case, I think we should also benchmark to compare the performance of between different scenarios to decide. Theoretically, if a model runs both TVM ops and original MXNet ops on CPUs, I agree that using OpenMP may be a short term solution.

@larroy
Copy link
Contributor

larroy commented Jul 11, 2019

Can it fuse different operators in the graph? I think this would be badly needed for us.

@yzhliu
Copy link
Member Author

yzhliu commented Jul 12, 2019

@larroy In this proposal it cannot. It requires more deeply integrated with tvm/relay to run graph-level optimization, which @junrushao1994 is planning.

@larroy
Copy link
Contributor

larroy commented Jul 12, 2019

Could we add more information on how will it impact the build systems and dependencies? While integrating TVM is exciting, I think is fair to gauge the increase of complexity, build and release implications. Would it be possible to load operators compiled with TVM dynamically? This would open a path to modularize MXNet and reduce the library size / target inference only version in a more streamlined way than with amalgamation hacks.

@yzhliu
Copy link
Member Author

yzhliu commented Jul 15, 2019

Sure,

  • Build system
    • Download pre-built LLVM
    • Compile 3rdParty/tvm
    • use compiled tvm to compile tvm operator kernels and generate libtvmop.so
  • Runtime
    • libtvm_runtime.so linked with libmxnet.so
    • MXNet loads libtvmop.so during runtime (sort of dynamic loading)

While for "fully" supported dynamic loading, the major problem is how to register op metadata, basically InferShape/InferDtype/etc. It'll be great if the accelerator API solves the problem

@yzhliu
Copy link
Member Author

yzhliu commented Jul 15, 2019

Thanks everyone comments on the RFC. To summarize,

  • We agree to integrate TVM into MXNet
  • The first step is to use TVM write operator kernels, do ahead-of-time compile.
  • It might have performance issue if we mix TVM's threadpool and openmp together in MXNet. We need to benchmark and if it does appear to be a problem, we can switch to use openmp in TVM as well.
  • Later: whether and how to integrate the approach with Accelerator API so that tvm operators can be dynamically loaded.

@larroy
Copy link
Contributor

larroy commented Jul 16, 2019

Would it be possible to keep it decoupled, say the infrastructure to load the tvm operator is included in MXNet but we don't need to couple with LLVM and TVM, (even though we are using it through NNVM). I think irrespective of the technical merit and benefits of TVM which are undeniable, less coupling makes software easier to maintain. @samskalicky knows more about Accelerator APIs.

@samskalicky
Copy link
Contributor

If we're doing this on a per-operator basis and treating TVM operators like CPU operators then maybe we should be considering something similar to the dynamic customOp proposal instead of the accelerator API.

@yzhliu how are you thinking about presenting TVM ops here? will they run on CPU or GPU (or both)?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
RFC Post requesting for comments
Projects
None yet
Development

No branches or pull requests

10 participants