[RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead #17097

reminisce · 2019-12-17T06:02:12Z

MXNet imperative operator invocation overhead is as large as 30-60us, which is significant compared to the official NumPy operators with ~600ns overhead. This has negatively impacted the performance of applying MXNet to the models where many operators' kernel runtime duration is short, especially in the area of classic machine learning. We plan to address the problem in two steps:

Short term: Use pybind11 to replace Python op API and ctypes/c api. Preliminary experiments show that the pure Python-C++ turnaround time by using Pybind is between 400-600ns, while the current Python op API using ctypes/c api costs more than 10us. We believe with the correct implementation, we can reduce the op invocation overhead to 2us including the time on FFI and engine.
Long term: Adopt Python's C extension interface. NumPy did this by developing its own C API. This provides considerably less overhead compared to other solutions. However, it would cost much more engineering efforts by integrating this with our existing operator workflow in C++.

@hzfan @hgt312

wkcn · 2019-12-18T04:10:42Z

Hi @reminisce , I have several questions.

Does the time of asynchronous execution cover that of invocation overhead?
MXNet Imperative mode creates a computational graph, but NumPy computes operations directly. How long will they take?

To my knowledge, if using pybind11 or Python’s C extension, it need to re-build interfaces to support other version of Python.

chinakook · 2019-12-18T12:17:21Z

I have a good idea, I think we can build our own custom ctypes system or some other middleware to solve it.
We need a unified solution to solve all language bindings.
https://docs.microsoft.com/en-us/visualstudio/python/working-with-c-cpp-python-in-visual-studio?view=vs-2019

Approach	Vintage	Representative user(s)	Pro(s)	Con(s)
C/C++ extension modules for CPython	1991	Standard Library	Extensive documentation and tutorials. Total control.	Compilation, portability, reference management. High C knowledge.
PyBind11 (Recommended for C++)	2015		Lightweight, header-only library for creating Python bindings of existing C++ code. Few dependencies. PyPy compatibility.	Newer, less mature. Heavy use of C++11 features. Short list of supported compilers (Visual Studio is included).
Cython (Recommended for C)	2007	gevent, kivy	Python-like. Highly mature. High performance.	Compilation, new syntax, new toolchain.
Boost.Python	2002		Works with just about every C++ compiler.	Large and complex suite of libraries; contains many workarounds for old compilers.
ctypes	2003	oscrypto	No compilation, wide availability.	Accessing and mutating C structures cumbersome and error prone.
SWIG	1996	crfsuite	Generate bindings for many languages at once.	Excessive overhead if Python is the only target.
cffi	2013	cryptography, pypy	Ease of integration, PyPy compatibility.	Newer, less mature.
cppyy	2017		Similar to cffi using C++.	Newer, may have some issues with VS 2017.

leezu · 2019-12-19T05:06:12Z

@wkcn that can be done as part of setup.py locally on the users machine, or would you see any problem?

eric-haibin-lin · 2019-12-19T05:54:55Z

How much engineering efforts does it take to do (1)?

sxjscience · 2019-12-19T17:44:45Z

I think we can first replace the code-gen here with pybind11. https://github.com/apache/incubator-mxnet/blob/521c477ad32864d887481abf6c53acae3b717cf6/python/mxnet/ndarray/register.py#L115-L269

tqchen · 2019-12-19T18:21:01Z

Here is another candidate that I highly recommend: adopt TVM's FFI convention.

The historical problem of MXNet FFI was the blowing amount of the C API bindings as we add new features. This creates a huge amount of maintenance burden.

The real problem was not really about which FFI system to adopt(cython and pybind are fine in that end, except for the cost of compilation), but more of the cost to maintain the FFI. MXNet used to have a fast cython binding, but that was abandoned because we keep add new APIs we cannot keep up both ctypes and cython.

When developing TVM we learnt from the lesson and restrict the API to a limited set of runtime APIs that does not change, and have a stable cython, ctypes binding for them. The runtime support a type-erased function(PackedFunc), which can be efficiently called from any of the frontend language, and all the APIs are exposed through the PackedFunc. On the python side an additional wrapping is created for better documentation and call into the PackedFunc. See more in https://docs.tvm.ai/dev/runtime.html The system works great for over a few years now.

Of course I understand there has been legacy issues in MXNet that is why I did not bring this proposal up. But given this is a proposal for 2.0, I would encourage everyone to give a serious thought about this possibility.

eric-haibin-lin · 2019-12-19T19:18:37Z

@tqchen thanks for sharing this! Is there any reference benchmark result that stress tests the ffi overhead?

tqchen · 2019-12-19T19:21:10Z

I don't have any benchmarks at hand, but would be great if someone can help creating one with cython enabled.

tqchen · 2019-12-19T20:21:28Z

OK here is a script to quickly check. Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06

junrushao · 2019-12-20T02:34:56Z

@tqchen can you also copy the results you posted in #14883 to here? This shows convincing benchmark results that tvm ffi is comparable with numpy.

tqchen · 2019-12-20T03:40:10Z

I think interested folks can follow the last post here #14883 (comment) the general takeaway seems to be that the overhead is quite close to numpy's one

reminisce · 2019-12-20T05:37:23Z

@tqchen Thanks for sharing the benchmark results. We did consider using TVM FFI as a candidate and I strongly agree with your suggestion on making a limited set of runtime API for sustainable maintainability. The Python op API overhead in general is more obvious when passing Python native data structures to the backend, such as lists, tuples, etc. I modified your script by passing a Python tuple as an argument and the overhead is around 19us with Cython enabled, while pybind is normally less than 400ns, and numpy comes with even much lower overhead. Did I make any mistakes in coding up the test script or it's something that can be addressed in TVM FFI? Thanks.

import timeit

setup = """
import tvm
#x = tvm.nd.array([0])
#y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop((1, 2, 3, 4))')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)

tqchen · 2019-12-20T07:20:21Z

First of all, it would really be great if the issue of native data structure could been bought up at the first place in the RFC :)

Every design has a tradeoff, and most of the TVM FFI's design started as an artifact of of Amdahl's law. In particular, it is interesting to ask how much native python data structure we need. e.g. most of the fast path(e.g. add/sub) and common operators the problem does not involve tuple as an argument. As a layman example, I don't care if my save to json function takes 100us to finish(because that is not going to be bottleneck of my program).

When tuple and strings are involved, the typical operators (such as conv2d) are usually large. Moreover, when such operators are constructed in a two phase manner, or passed through an partially specialized annotator that translates the program, the cost of the constant tuple are only constructed once rather than in a loop (just like the following program).

class Module:
      def __init__(self):
         ## constructed once
          self.conv2d = conv2d(kernel=(2,2))
      
      def forward(self, x):
         # run multiple times
          return self.conv2d(x)

# hybridize can transform the code below to the above form.
@hybridize
def myfunc(x):
     y = conv2d(x, kernel=(2,2))

Of course, if we really say that "hey, we want the int tuple case to be fast", just like the case of copy in #14883 . We can introduce native tuple objects into the TVM FFI to improve the performance of the passing tuple as arguments.

The general design philosophy is we only need make the things that needs to be fast fast. The FFI can handle objects that it recognizes efficiently, while have a reasonable slow fallback for cases that are not necessarily the bottleneck.

As I said in the beginning of the post, the spectrum of "fast argument" should really be discussed in the RFC at the first place. It would be great if we can have a more constructive discussion about these use cases, then add technical reasoning, before reaching a verdict.

sxjscience · 2019-12-20T07:55:13Z

I think we need to support tuple if we are going to adopt the PackedFunction approach. It is used in some functions like split, hsplit, reshape, transpose, a[(1, 3)].

reminisce · 2019-12-20T18:23:06Z

@tqchen Thanks for explaining things inside out. Please know that I'm not against TVM FFI design. In fact, it's great to know that you think passing Python native data structures can be accelerated by engineering through TVM FFI. This is vital for keeping the future MXNet runtime API in a limited set for scalable and sustainable maintainability.

Putting the design decision aside, I want to share that there has been extremely strong motivation and need of squeezing out the latency of passing Python native data structures in op interface. Since MXNet is embracing NumPy compatibility, we want to get the op invocation performance on par with NumPy to be appealing to classic machine learning community. We have compared the performance between MXNet and NumPy using a bunch of classic machine learning models and found that optimizing passing Python native data structures is critical for MXNet to be on par with NumPy. Even for deep learning itself, this is also important in some applications. In #16716, reset_arrays was corroborated to get rid of passing tuple objects too many times in mx.nd.zeros so that the training performance can be increased by 5%.

With that being said, please allow me to summarize what we have reached here. I think we are aligned on exploring TVM FFI to have a clear engineering view of accelerating passing Python native data structures as arguments. We can start from tuples, and extend the findings to lists and strings later.

Thank everyone for a great discussion and sorry for the late responses since I have been on vacation this week. I didn't expect this task item of PoC to become a full-fledged RFC that has involved this many interested folks. I will be sure to make the post more descriptive and self-explanatory next time. Have a nice holiday! :)

tqchen · 2019-12-20T18:25:57Z

To followup on this thread about brining native support to tvm ffi, I will first discuss the ways to address tuple, and then discuss some of the pros and cons.

First of all, we all know that at a time point we are going to translate the python data structure we know into C++. The main question is where that translation can happen. In the pybind case, the translation happens in the c++ side by directly passing pyobject to the c++. In the case of cython, the translation happens in the C API level. In the case of TVM FFI, the translation can happen at a python wrapping.

The following code gives two examples of such translation(myempty0, and myempty1), the support for myempty1 requires one additional commit in this branch. I also added a str argument as as str support was mentioned.

In the first approach(myempty0), we directly unpack the tuple as positional arguments, and encode the data structure as a flattened argument. In the second approach(myempty1), it first creates a IntTuple object that the C++ side can recognize and then pass it to the C++ side. Note that at the moment all the operations are done through python, if there is a concern in terms of wrapping, we can certainly bring some of them into cython.

We could introduce a third approach(myempty2): which directly passes a PyObject* to the c++ side, then the developer can additional packages(e.g. pybind or related) to process the tuple. This third approach would achieve the same perf as pybind. However, there are some trade-offs, see discussion section.

import timeit
import tvm

nop = tvm._api_internal._nop

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop((1,2,1))')
timer.timeit(1)
num_repeat = 1000
print("tvm.nowrap:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
"""

timer = timeit.Timer(setup=setup,
                     stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)


def myempty0(shape):
    return nop(*shape)

def myempty1(shape):
    return nop(tvm.container.IntTuple(*shape))


setup = """
import numpy as np
import tvm
from __main__ import myempty0, myempty1
"""

timer = timeit.Timer(setup=setup,
                     stmt='myempty0((1,2,1))')
timer.timeit(1)
print("tvm.myempty0:", timer.timeit(num_repeat) / num_repeat)

timer = timeit.Timer(setup=setup,
                     stmt='myempty1((1,2,1))')
timer.timeit(1)
print("tvm.myempty1:", timer.timeit(num_repeat) / num_repeat)


setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)

Here are results on my computer:

$ TVM_FFI=cython python test.py
tvm.nowrap: 1.3379121999999994e-05
numpy.emmpty: 2.849200000000218e-07
tvm.myempty0: 1.7188400000001102e-07
tvm.myempty1: 9.54876000000021e-07
tvm.str_arg: 2.1269199999998656e-07
$ TVM_FFI=ctypes python test.py
tvm.nowrap: 5.141489199999999e-05
numpy.emmpty: 3.3954099999999874e-07
tvm.myempty0: 7.372958999999985e-06
tvm.myempty1: 1.3045717999999983e-05
tvm.str_arg: 5.962782e-06

As we can see, the myempty0 was quite fast. myempty1 was a bit slower due to the creation of the object(but still within the order of magnitude that we can use). Note that in the case of async exec and gradient tapping, we will need to create object to book-keeping the arguments, and myempty1 may not be a bad approach as the data structure is already created in the FFI level.

Discussion

As explained in the beginning, the real question was where should the wrapping happen.

In the case of TVM, usually the wrapping happens at the native language(python) level, because we know there is a need of the python side wrapper for better code, type checking and docs. The translation forces the python arguments into arguments that can be recognized by the runtime.

In the case of pybind, the translation happens at the C++ level, by calling into the python C API(the myempty2 approach is similar to this one).

The advantage of exposing PyObject and related operations to the c++ level is certainly the deferred marshaling of data structures. On the other hand, such approach directly ties the FFI with python. It means other language frontends can no longer take benefit of the new FFI. On a similar direction, if we want to package some of the functions into a minimum runtime that is independent of python, we can no longer do that. This is why while in theory we could bring PyObject(or a related Proxy) to TVM runtime, we have not done so far.

Of course this is an interesting tradeoff, and everyone is welcomed to discuss their thoughts. Here is a summary of points that can be discussed:

How shall we deal with the FFI of other languages, shall we have a separate set of FFI for those?
Is it worth the burden of the additional (perhaps unnatural) wrapping in the case of myempty0 and myempty1
We certainly want to be python first, how much advantage does a specific wrapper and PyObject in C++ would bring.
What are the collections of fast path objects that we need in the FFI.

tqchen · 2019-12-20T18:34:01Z

Some additional thoughts along the line:

It would be great to discuss what are the collection of "fast path" data structures for numpy compatibility.
Given that this looks like a proposal, perhaps it should be tagged with an RFC so that it can be mirrored to dev@

ptrendx · 2019-12-20T18:49:03Z

@reminisce Do you have data on what is the time split between the FFI itself and engine scheduling inside MXNet backend?

sxjscience · 2019-12-20T18:51:27Z

@tqchen For the "fast path" structures that we need to support, I'm considering the following:

py_slice, Ellipsis for basic indexing.
Also, in some scenarios, numpy scalar will be involved:

import mxnet.numpy as mnp
import numpy as np
a = mnp.array(np.array(1))
b = a + np.array(1)

Also, there are two scenarios when list will be involved:

For concatenate, the arguments may be wrapped in a python list.
In the case of split and hsplit, the return value can be a list.

tqchen · 2019-12-20T19:02:05Z

@sxjscience here are some quick thoughts (of course passing pyobject kind of "solves" the problem, so I am discussing the wrapping that can be done through the tvm ffi).

concatenate seems can be achieved through the python side wrapping concat-> concat_internal(*args)
return list value can be addressed by introducing a Tuple object to tvm runtime, we recently introduced an ADT object that can serve the purpose.
numpy native data structure: one way to deal with it is to convert numpy native structure to something related to DLTensor (e.g. dlpack). It may not be as fast as directly treat it as a scalar though

The py_slice is the most tricky case, my guess is that it could be accelerated through a cython layer that translate the slice into flattened representation, of course it is not too ideal and pybind maybe better for this case if we only want to handle it through c++.

reminisce · 2019-12-21T04:40:58Z

@ptrendx I benchmarked mx.nd.zeros((1,), out=out) in async engine. The total time including FFI and engine scheduling is 40us, and FFI itself takes 21us.

ptrendx · 2019-12-21T15:38:43Z

Ok, so it is about 50:50 split. Is there also work underway to profile what is the reason of the time spent in the engine?

reminisce · 2019-12-22T05:21:08Z

@ptrendx Yes, there is an effort of profiling engine code flow using VTune. We hope the exercise can pinpoint the hotspots that contribute to the most part of latency. Further time split for pure C++ part between setup code (shape/type inference, memory allocation, dependency setup) and op scheduling is also around 50% vs. 50%.

For the "fast path" data structures, I'm summarizing the items as follows (including the ones suggested by @sxjscience):

tuple and list since they can be interchangeable in NumPy semantics to represent shapes and axes.
str because einsum has this parameter and the op can be intensively used in transformer models.
py_slice, Ellipsis, None for basic indexing. We can do one step further by moving the whole indexing dispatch logic to backend.
np scalars.
mx.context.Context. One call of mx.cpu() can be as large as 600ns using ctypes. One thought is do it in the pybind way by creating a Python binding for the backend Context class.
np.dtype. Similar to Context.

tqchen · 2019-12-22T05:32:25Z

Thanks for discussions so far, to clarify the techinal questions and discuss tradeoffs further.

The following fast-path can be addressed in the TVM FFI:

tuple, list via translation in python/cython side (see benchmark above)
str is already fast (seem benchmark above)
Context can be quite fast if the object is a tvm object, around same mangitude as passing NDArray
np.dtype can get around by by str conversion, or introduce a type structure, TVM FFI support DLDataType natively
None: natively supported by FFI

The following items needs to be discussed

py_slice, Ellipsis Can be supported by adding pyobject support via pybind, however that introduces python C API dependency into the FFI layer(making the function not accessible to other language frontends). Would be interesting to discuss alternatives

Of course, all of the above cases can be solved by pybind, or a mix of pybind and TVM FFI. It would certainly be interesting to discuss the possible engineering path.

Technical Choices and Tradeoffs

The main techinque trade-off points that influences the decision are as follows:

Do everything in c++ vs some python layer translation: pybind is definitely good for former, tvm ffi requires translation in python(which could be good for docs readability).
How to resolve FFI for other languages(e.g. scala, clojure), tvm FFI directly exposes to these languages, pybind works great for python, but we need to re-expose to other languages(via another FFI)
Compatibility of future compilers, e.g. allow callback into those numpy functions from compiled code. TVM FFI convention directly enables that, while pybind needs some additional works(to create a interpolation layer from tvm ffi into pybind).
Whether we are interested in a mixed approach(e.g. allow passing pyobject to backend for certain cases, while use tvm ffi for most cases), note again need to re-expose the same API to other languages in another way that does not uses pyobject, but likely only need to do it for limited cases.

tqchen · 2019-12-23T01:34:45Z

After some thoughts along the direction, I find a better and fun answer to the above question to support tuple/ellipsis/slice in tvm ffi effectively.

I hacked up a POC in https://github.com/tqchen/tvm/tree/poc-pyffi (lastest commit) that supports the following benchmark script(disclaimer: it is only a POC so not intended for use or fully optimized, but it demonstrates all the technical flows necessary to make a fully functioning FFI).

import timeit
import tvm
nop = tvm._api_internal._nop

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
"""

timer = timeit.Timer(setup=setup,
                                  stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)

On my laptop(macbook 13inch), the results are as follows

$ TVM_FFI=cython python benchmark_ffi.py
tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07
numpy.emmpty: 2.7016599999998834e-07
tvm.str_arg: 2.3390799999997714e-07

What is Implemented in the POC

In the POC, we introduced specific objects for Ellipsis, Slice and Tuple(already supported in ADT). During a PackedFunc call, a python tuple/ellipsis/slice was converted into the object that is supported by the backend. We implemented a cython version(the previous recursive conversion was in python) to back it up.

The reason that we are able to create Object in the cython side is because all TVM object has been recently converted to be POD-C compatible, so the object can be created in the cython side without crossing DLL boundary and passed to the c++ backend.

We can see from the benchmark that the cost of such deep-copy was at a reasonable level. We also only used the default memory allocator, so there could be space for further improvements.

Technical Choices and Tradeoffs

Please also see tradeoff discussions in the last post. As we can see, the main difference here is where to do the conversion, and whether do we do lazy/deep copy:

In the case of pybind: conversion is happened in the c++ side, data structures are lazily created.
In the case of the POC: conversion is happened in cython, data structures are deeply translated into another in-memory format.

The laziness certainly avoids a copy in cases where we do not necessarily need to book-keep the created argument. On the other hand, supporting a common data structure in the c++ side means the binding can potentially be reused by other language frontends.

reminisce · 2019-12-23T19:58:40Z

Thank @tqchen for sharing the PoC code within such a short timeframe. :) The numbers look promising even with Python native objects deeply copied. Pybind performs deep copy by default unless the receiving object in C++ end is marked as opaque so that the Python object passed by reference. That is often used for propagating large object changes from C++ to Python. In our op invocation use cases, there has been no such urgency of introducing this level of complexity so far since the Python objects are small and parameter passing is a one-way trip. The 300ns overhead should give us a good start to squeeze the total overhead into the 2us range. If there is really a need of passing PyObjects in the future, we can always add that with a compile flag option. I think it's worth following this branch to integrate TVM FFI with MXNet op invocation flow to get more comprehensive benchmark results.

hzfan · 2019-12-24T11:33:48Z

Following this branch I made a simple POC on MXNet side (code here). It turns out that passing a python tuple and receiving it as TShape takes around 200ns.

tqchen · 2019-12-24T16:45:39Z

@hzfan thanks for implementing a poc:) However, these is a subtle but important difference which worth discussing in here :) I will use cython-ffi to refer to the above poc, and tvm-ffi to refer to tvm's poc

In cython-ffi, both data structure and functions are exposed, this means a in order to grow the set of functions, we need to expand the set of C API. In another words, we need to grow the set of FFI API as we add more functions
In the tvm-ffi, the set of C API is fixed, and only data structure constructions are exposed to the cython side, given that a set of supported data structures are also fixed(tuple, int, slice, ellipsis). In this way, we do not have to grow the set of FFI API as we add functions.
Another subtle point is that we are passing data structure across dll boundaries. In the case of cython-ffi, it could be a c++ container(Tuple). TVM's object structure is designed to be C ABI compatible. This property allows us to construct an object in one dll and pass to another. However it is not necessarily true for all c++ classes. There is a potential danger when passing c++ container across DLL boundaries(when two dll has different allocator, calling push_back in another dll could cause error).

The difference again boils down to the design point of what is a clear cut of FFI conventions. Ideally, it would be: a stable set of C API and object structures that does not change over time.

larroy · 2019-12-26T21:21:16Z

What's the point of having an API if you type erase it? Then you might as well have a single function API with a type erased callback name to select the function to call. In the end you move the burden away from the API to the callers and inside the API to the dispatchers. For going this route of uber-clever template tricks to generate code, I think it's better to just put in place proper code generation for maintainability. Could you provide a bit more details about tradeoffs? Everything has tradeoffs, I don't believe any solution which is sold as a panacea, there's no silver bullet.

…

On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen ***@***.***> wrote: I have another candidate that would highly recommend: adopt TVM's FFI convention. The historical problem of MXNet FFI was the blowing amount of the C API bindings as we add new features. This creates a huge amount of maintenance burden. The real problem was not really about which FFI system to adopt(cython and pybind are fine in that end, except for the cost of compilation), but more of the cost to maintain the FFI. MXNet used to have a fast cython binding, but that was abandoned because we keep add new APIs we cannot keep up both ctypes and cython. When developing TVM we learnt from the lesson and restrict the API to a limited set of runtime APIs that does not change, and have a stable cython, ctypes binding for them. The runtime support a type-erased function(PackedFunc), which can be efficiently called from any of the frontend language, and all the APIs are exposed through the PackedFunc. On the python side an additional wrapping is created for better documentation and call into the PackedFunc. See more in https://docs.tvm.ai/dev/runtime.html The system works great for over a few years now. Of course I understand there has been legacy issues in MXNet that is why I did not bring this proposal up. But given this is a proposal for 2.0, I would encourage everyone to give a serious thought about this possibility. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17097>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCV2NICGEG3VMB2JTNHW3QZO3RLANCNFSM4J3WHPZQ> .

larroy · 2019-12-26T21:25:08Z

Pybind is nice, I used Boost python many years ago, which I think is based on. The problem with this is the hourglass C bindings, you have to go from Python to C++ / Pybind, down to C and to the engine, this seems like a lot of boilerplate.

…

On Mon, Dec 16, 2019 at 10:02 PM reminisce ***@***.***> wrote: MXNet imperative operator invocation overhead is as large as 30-60us, which is significant compared to the official NumPy operators with ~600ns overhead. This has negatively impacted the performance of applying MXNet to the models where many operators' kernel runtime duration is short, especially in the area of classic machine learning. We plan to address the problem in two steps: 1. Short term: Use pybind11 to replace Python op API and ctypes/c api. Preliminary experiments show that the pure Python-C++ turnaround time by using Pybind is between 400-600ns, while the current Python op API using ctypes/c api costs more than 10us. We believe with the correct implementation, we can reduce the op invocation overhead to 2us including the time on FFI and engine. 2. Long term: Adopt Python's C extension interface. NumPy did this by developing its own C API. This provides considerably less overhead compared to other solutions. However, it would cost much more engineering efforts by integrating this with our existing operator workflow in C++. @hzfan <https://github.com/hzfan> @hgt312 <https://github.com/hgt312> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17097>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCV2J7KYUKKTFZTKN6TB3QZBTOZANCNFSM4J3WHPZQ> .

tqchen · 2019-12-26T21:59:47Z

@larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here.

When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along.

The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type).

For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function.

In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well.

In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++.

To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.

In terms of the calling cost, it really depends on whether the caller and callee are strongly typed.

If caller is strongly typed, then assigning type code is O(1)
If caller is a dynamic type(like python) then we need to have a dispatcher to dispatch and select the right type code
If callee is strongly typed, then the cost of checking is O(1) by just check the code to be the correct one
If the callee is dynamic type, then a dispatching need to happen, which have another level of hashtable lookup O(1)

As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed.

Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps

szha · 2019-12-27T19:55:10Z

Thanks for the explanation. I'm not so concerned about complexity of dispatching. If I understood you correctly the main benefit that you explain for the TVM project was not having to change the C API, but still you need to do type checking in both ends, or at least on the receiving end of the API, correct? I think we have discussed similar things in the past and we might have different views on strongly typed vs dynamic typed. A priori I prefer to see an API which can be evolved and changed, I find it more explicit and clearer that what I think you do with PackedFun which I have looked at briefly but not used extensively. If one is going to call into the C API using pybind, does it make sense to layer a C++ API on top of the C API for this? Also these microbenchmarks are nice, but we also need to consider the overhead in typical workloads and see if it's still significant. CFFI is also another alternative. I couldn't access your pointers like: https://github.com/tqchen/tvm/tree/pyffi

…

On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen ***@***.***> wrote: @larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here. When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along. The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type). For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function. In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well. In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++. To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good. In terms of the calling cost, it really depends on whether the caller and callee are strongly typed. - If caller is strongly typed, then assigning type code is O(1) - If caller is a dynamic type(like python) then we need to have a dispatcher to dispatch and select the right type code - If callee is strongly typed, then the cost of checking is O(1) by just check the code to be the correct one - If the callee is dynamic type, then a dispatching need to happen, which have another level of hashtable lookup O(1) As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed. Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #17097 (comment)

larroy · 2019-12-27T22:44:51Z

LOL, the last one was my comment, not @szha :-D

szha · 2019-12-27T22:46:44Z

Test On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy <[email protected]> wrote:

…

Thanks for the explanation. I'm not so concerned about complexity of dispatching. If I understood you correctly the main benefit that you explain for the TVM project was not having to change the C API, but still you need to do type checking in both ends, or at least on the receiving end of the API, correct? I think we have discussed similar things in the past and we might have different views on strongly typed vs dynamic typed. A priori I prefer to see an API which can be evolved and changed, I find it more explicit and clearer that what I think you do with PackedFun which I have looked at briefly but not used extensively. If one is going to call into the C API using pybind, does it make sense to layer a C++ API on top of the C API for this? Also these microbenchmarks are nice, but we also need to consider the overhead in typical workloads and see if it's still significant. CFFI is also another alternative. I couldn't access your pointers like: https://github.com/tqchen/tvm/tree/pyffi On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen ***@***.***> wrote: > @larroy indeed every solution has trade-offs, and these tradeoffs are > discussed in the above posts when we compare solutions, and they are backed > by benchmarks :) it would be great if you can also suggest potential > tradeoffs here. > > When you expose an API from typed language(c++) to a dynamic > language(python), you have to type erase it, given that the python > functions don't have the type, and you have to pass the information along. > > The only difference is where you do the type checking(that the python > type corresponds to the right c++ type), and translation(translating to the > c++ type). > > For example, in the case of pybind, the erasure is done implicitly when > you call the python function, then checking and translation happens when > you call into the c++ function. > > In the case of creating a C API for each feature and wrap things in the > python side, the type checking is done in the python side, and translation > as well. > > In the case of tvm ffi, the type translation is done in the python/cython > side, while the type checking is done in the c++. > > To dive deeper into the tradeoffs for PackedFunc calling convention. The > convention erases the type by having the type code stored into the > arguments. This brings additional cost of passing arguments into heap, as > opposed to registers. So they might not be designed for inline functions > that needs to happen at the order of 1e-9s, however, for API functions that > needs to run around 1e-7 or even 1e-8 level, this convention is pretty good. > > In terms of the calling cost, it really depends on whether the caller and > callee are strongly typed. > - If caller is strongly typed, then assigning type code is O(1) > - If caller is a dynamic type(like python) then we need to have a > dispatcher to dispatch and select the right type code > - If callee is strongly typed, then the cost of checking is O(1) by just > check the code to be the correct one > - If the callee is dynamic type, then a dispatching need to happen, which > have another level of hashtable lookup O(1) > > As we can see, the only place where dispatching is necessary is the > dynamic type handling case. Even in these cases, if there is a strong need > of specialization, we can directly force the type by running checking on > the caller, and pass in the right type code (the engineering burden is the > same as wrapping the C API). However, the benchmark suggests that the > dynamic dispatching cost is reasonable, and satisfies the API speed. > > Coming back to the tradeoff, the main tradeoff here is the engineering > burden to keep an hourglass design(with fixed set of API) vs efficiency. > While my post did not suggest that TVM's ffi is a silver bullet, it does > works pretty well for our use cases. hope it helps > > > -- > You are receiving this because you are subscribed to this thread. > Reply to this email directly or view it on GitHub: > > #17097 (comment)

tqchen · 2019-12-28T04:03:42Z

re the need for explicit type checking code in TVM FFI.

Actually there is no explicit code for type checking as they are generated automatically via template expansion(on the receiving end), also we also have a "strong typed" signature that wraps the packed function interface, which gives you compile time type checking https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h#L191

For dynamic language side(python) the exposed function is still type erased(as python is a dynamic language).

Note that the view dynamic vs static typed language does not really apply to this case, because the main goal(exposing to python) means type-erasure(as python is dynamically typed). The main goal would be how to reduce the number of abstraction layers.

Also these microbenchmarks are nice, but we also need to consider the
overhead in typical workloads and see if it's still significant.

If we apply reasoning, most API cost is going to be FFI cost + exec cost, and I think the conclusion so far is we want FFI cost to be around 1e-7s to 1e-6s, which is the limit of any cost .

hzfan · 2020-01-24T06:44:09Z

I created a follow-up design proposal on cwiki. TVM FFI works well with MXNet and the overhead for np.zeros gets greatly reduced. Any feedback is appreciated.

tqchen · 2020-01-25T23:00:12Z

Thanks @hzfan I would also high recommending taking a close look at the TVM's object protocol, and try to push most of the things through the Object eventually(Create temporary support for legacy cases like TShape is fine, but eventually pushing things as object will have a greater uniformity, and brings benefit such as putting everything into a container)

hzfan · 2020-02-02T07:53:55Z

@tqchen Thanks for sharing this. I don’t know if I understand correctly. For now arguments except primitives pass through FFI via Object (like ADTObj). It is then converted to TShape in backend and TShape is not involved in FFI directly.

As you said, Object allows me to conveniently put various kinds of things into a container (ADTObj), without losing their types. For example, now a tuple of tuples like ((2, 2), (2, 2)) is allowed.

Also sorry for the late reply. I have been on a vacation this week :)

eric-haibin-lin mentioned this issue Dec 19, 2019

[Discussion] Overhead in MXNet Execution #14883

Open

tqchen closed this as completed Dec 19, 2019

tqchen reopened this Dec 19, 2019

reminisce added the RFC Post requesting for comments label Dec 22, 2019

reminisce changed the title ~~[mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead~~ [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead Dec 22, 2019

szha mentioned this issue Dec 31, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

szha assigned hzfan Jan 24, 2020

hzfan mentioned this issue Feb 6, 2020

MXNet FFI for Operator Imperative Invocation #17510

Merged

7 tasks

szha mentioned this issue Feb 24, 2020

[RFC] MXNet 2.0 API Deprecation #17676

Open

sxjscience mentioned this issue May 13, 2020

Speed difference between mxnet.numpy and numpy #18302

Open

anko-intel mentioned this issue May 25, 2020

DOT product too slow on CPU and GPU compared to np and pytorch #17971

Closed

ptrendx mentioned this issue Aug 17, 2020

[RFC] GPU performance improvements in MXNet engine #18951

Open

szha unassigned hzfan Oct 7, 2020

saudet mentioned this issue Dec 8, 2020

[RFC] MXNet 2.0 JVM Language development #17783

Open

[RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead #17097

[RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead #17097

Comments

reminisce commented Dec 17, 2019

wkcn commented Dec 18, 2019

chinakook commented Dec 18, 2019 • edited Loading

leezu commented Dec 19, 2019

eric-haibin-lin commented Dec 19, 2019

sxjscience commented Dec 19, 2019 • edited Loading

tqchen commented Dec 19, 2019 • edited Loading

eric-haibin-lin commented Dec 19, 2019

tqchen commented Dec 19, 2019

tqchen commented Dec 19, 2019 • edited Loading

junrushao commented Dec 20, 2019

tqchen commented Dec 20, 2019

reminisce commented Dec 20, 2019

tqchen commented Dec 20, 2019 • edited Loading

sxjscience commented Dec 20, 2019 • edited Loading

reminisce commented Dec 20, 2019

tqchen commented Dec 20, 2019 • edited Loading

Discussion

tqchen commented Dec 20, 2019 • edited Loading

ptrendx commented Dec 20, 2019

sxjscience commented Dec 20, 2019 • edited Loading

tqchen commented Dec 20, 2019 • edited Loading

reminisce commented Dec 21, 2019

ptrendx commented Dec 21, 2019

reminisce commented Dec 22, 2019

tqchen commented Dec 22, 2019 • edited Loading

Technical Choices and Tradeoffs

tqchen commented Dec 23, 2019 • edited Loading

What is Implemented in the POC

Technical Choices and Tradeoffs

reminisce commented Dec 23, 2019

hzfan commented Dec 24, 2019

tqchen commented Dec 24, 2019 • edited Loading

larroy commented Dec 26, 2019 via email

larroy commented Dec 26, 2019 via email

tqchen commented Dec 26, 2019

szha commented Dec 27, 2019 via email

larroy commented Dec 27, 2019 • edited Loading

szha commented Dec 27, 2019 via email

tqchen commented Dec 28, 2019

hzfan commented Jan 24, 2020

tqchen commented Jan 25, 2020

hzfan commented Feb 2, 2020

chinakook commented Dec 18, 2019 •

edited

Loading

sxjscience commented Dec 19, 2019 •

edited

Loading

tqchen commented Dec 19, 2019 •

edited

Loading

tqchen commented Dec 19, 2019 •

edited

Loading

tqchen commented Dec 20, 2019 •

edited

Loading

sxjscience commented Dec 20, 2019 •

edited

Loading

tqchen commented Dec 20, 2019 •

edited

Loading

tqchen commented Dec 20, 2019 •

edited

Loading

sxjscience commented Dec 20, 2019 •

edited

Loading

tqchen commented Dec 20, 2019 •

edited

Loading

tqchen commented Dec 22, 2019 •

edited

Loading

tqchen commented Dec 23, 2019 •

edited

Loading

tqchen commented Dec 24, 2019 •

edited

Loading

larroy commented Dec 27, 2019 •

edited

Loading