-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise calling machinery #359
Comments
On my machine with default settings:
With PyObjC_FAST_BUT_INEXACT:
With LTO and -O3:
With experimental lookup cache:
With experimental lookup cache and PyObjC_FAST_BUT_INEXACT :
|
This option enables caching the result of _type_lookup in classes. Disabled for now because this slightly changes semantics (super calls on arbitrary classes can behave differently). I'm also not 100% convinced this new option is safe to use with "normal" Python application code. It is clearly faster for lookups of inherited methods (just like PyObjC_FAST_BUT_INEXACT).
A large part of the libffi support code is shared between objc.function and objc.selector, and the former is easier to isolate. My current plan is therefore to add some benchmarks for objc.function, then optimise the libffi support bits and when finally port that to objc.selector (and objc.IMP). And finally look into selector specific optimisations as well (such as caching IMPs, although that requires some research to ensure that the cache is invalidated as needed). On first glance this may not be needed, the assembly code for a function calling an ObjC method just calls objc_msgSend and doesn't seem to contain some kind of cache. |
As a quick experiment I've implemented vectorcall support for objc.function and that appears to improve performance by a couple of percent (for a single argument function). With this change (not yet committed): python function call : 0.026 Without vectorcall the objc function call was about 0.175, the new version is about 5% faster (and basically for free). Next up is a stripped down version of PyObjCFFI_ParseArguments (for simple enough functions) and a variant of the vectorcall support function that uses it. |
The absolute best we can reach: python3.9 Tools/pyobjcbench.py (master)pyobjc This cuts out all overhead by adding a vectorcall implementation variant for functions with signature "double(*)(double)" and selecting that for functions with that signature in their typestring. Then hardcode argument conversion, and call the function without using libffi. This makes is clear that we have significant overhead in argument parsing and/or the libffi code. This won't end up as such in the final implementation for objc.function, but may get used for objc.selector (which is used a lot more and which would make the additional code overhead more acceptable with common enough signatures). |
This is more realistic:
This is with a change that recognises simple enough functions (limited number arguments, limited total size of arguments, no pass by reference, no blocks, no functions) and replaces the default vectorcall implementation by a simpler one. That implementation is both simpler and avoids a number of calls to PyMem_Malloc. There's still room for further improvements, but this reduces the overhead of objc.function w.r.t. a builtin function by over 50%. I'm still not committing, this patch needs further testing (and double checking that the bookkeeping is correct). I also have to think about restructuring some of PyObjC's internal testing because a lot of the tests for function calling will now exclusively use the shortcut path and not the full implementation [either add a build or runtime option to avoid using the optimised versions, or duplicate tests with a variant that uses the slow path due to a pass-by-reference input argument]. Most of the improvement is from the simpler implementation, restoring the PyMem_Malloc call for the argument buffer results in only slightly worse performance (but that's just one of several calls to PyMem_Malloc that are removed in the fast path):
I need to check this, but expect that the simpler implementation does not use more stack than the full implementation, even with the argument buffer on the stack. |
See also #362 |
"-flto" gives a tiny improvement:
"-fvisibility=hidden" instead of "-fvisibility=protected" doesn't help, but would avoid exposing internals. |
Dropping use of libffi (just for a specific test case):
The difference is probably not large enough to bother investiging further at this time. |
It might be interesting to play around with a variant that supports even less features (for example only a limited subset of types), which might allow inlining parts of objc_support.m. |
Current plan is to experiment with stuffing often used bits from objc_support.m into method-signature.m, that way the big case statements don't have to be used (at a cost of slightly higher memory usage). It is far from clear at this point that this will actually help though. |
With vectorcall for native selectors and an implementation for "simple" signatures:
The improvement is less than I'd hoped for, I guess I need to add some testing code to determine "native" calling speed. Interestingly enough I get slightly better performance by not dropping the GIL:
And
The improved performance for not dropping the GIL might be interesting enough to introduce metadata for (not dropping the GIL is not save in general, some calls will effectively block for a long time). |
Thanks for working on this. And would it make sense to add a flag that would allow to switch to a very fast but not as save code path. I my case, those methods are called (a lot) in the drawRect: of the main view and so every millisecond counts. |
Currently the fast path is for methods are functions with a limited amount of arguments (max 8) and where none of them (or the result) require special handling (no blocks, no pass-by-reference arguments, ...). One of the things I want to add later on is an option to record statistics about signatures to help me pick a set of signatures that are worthwhile to further optimise.
I might do that, but preferably in a way that allows me to specify in metadata which methods are save w.r.t. such shortcuts and apply them automatically. But at this time I'm still hoping that I can avoid that. Note that all statistics in this issue are for calling ObjC from Python. Once I've merged a first set of optimisations for that I'll do something similar for calling from Python to ObjC, which would help your drawRect: use case. |
Function indirection through pointers is clearly somewhat expensive:
This is clearly faster and the change w.r.t. previous attempts is to introduce a second vectorcall implementation for selectors that is hardcoded to call the "simple" libffi caller instead of calling it through a function pointer. I might end up with 3 vectorcall variants:
|
I think my case can benefit from improving both call directions.
And I call this from objC:
This is a very simplified example. And some of the callbacks have a |
|
One thing I notice that my problems are much worse on python3. Are there changes in python3 that would suggest a behavior like this or might it be caused by the way how I init the runtime? |
changing the argument list builder in the method stub for calling from ObjC to Python to preallocate a tuple of the right size instead of incrementally growing a list and converting that to a tuple is marginally faster (about 4% for that micro benchmark). not committing this right now because I get a crash when testing blocks. |
This gives a couple of percent speedup in calling methods from ObjC. Issue #359
This is for python 3.9 only and speeds up calling by a couple of percent (due to vectorcall itself being more efficient, and by avoiding dynamic memory allocation). Issue #359
I don't know, the code for Python 2 and 3 was pretty much the same for PyObjC. The major difference is slightly stricter type checking in some places, non of which should be on the fast path. That said, I've never paid much attention to speed because the bridge was fast enough for what I do with it. |
These don't have a clear improvement, but should help a little. Issue #359
The current simple changes to the method stub increase performance in the micro benchmark for calling methods from ObjC by about 8% compared to 7.3 (on my M1 laptop with Python 3.9). |
I've started merging work into the repository, the current improvement is shown below (python 3.10, x86_64 VM running BigSur). The VM is a fairly noise environment, but this looks promising. Merged are:
|
I've looked into PGO for a couple of minutes, but need to find a tutorial for that. My first attempt resulted in a failed build using the profile data because the profile data was claimed to be out-of-date. |
One thing I want to look into soonish is the code that walks the MRO looking for methods. I have two, currently disabled, options in the PyObjC code that should speed things up significantly, but at the cost that the I need to determine if that affects correctness of Python code that doesn't introspect The goal is to minimise the amount of times that the code has to look at the objc runtime after an initial lookup. The results below show that enabling these options is worth it performance wise, although there's still a pretty large difference with looking up names in regular python classes. FAST_BUT_INEXACT:
LOOKUP_CACHE:
Both options:
|
One thing to look into for speeding up calling from ObjC to Python: imp_implementationWithBlock. This function is available from macOS 10.7, which means it can be used without compile-time or runtime guards. This could be used to create IMP's bound to a Python function without going through libffi. This can be used for a number of common method signatures to remove the overhead of libffi (assuming this function is more efficient). |
This introduces two special method stubs for calling from ObjC to Python for selectors with no arguments and either a void or 'id' result. This reduces the overhead for these calls, but less than I had hoped.
The performance difference between the "classic" call_from_objc API (using libFFI) and the alternative using imp_implementationWithBlock is not quite clear, the latter appears to be slightly faster but the difference is very close. The new API has the advantage not requiring changes to the framework bindings, in the old API the method implementation has no access to the PyObjCMethodSignature for the selector and that needs to change to be able to handle APIs returning an "id" correctly (due to "already_retained" and "already_cfretained"). Fixing that is possible, but so far it doesn't seem worthwhile to make that change. |
Either helpers-simple-methods, or the earlier introduced block-as-imp.[hm] will survive. I'm still experimenting with which one is more efficient.
The machinery for calling methods can be optimised. See also #350.
Python -> Objective-C:
avoid calling PyObjCClass_CheckMethodList when not necessary, this is an expensive operation and is not necessary when the attribute is already in the class dict (PyObjCClass_... is no longer expensive, that was in an earlier version of PyObjC, avoiding that call doesn't change performance and complicates the code)Libffi_caller creates and destroys libffi context on every call, cache those in the callable object (this is a layering violation, but might help in performance) [for objc.function the fficif is already cached, for objc.selector this might be harder due to bound selectors; vectorcall. might help there]Objective-C to Python:
method_stub should use vectorcall on Python 3.9:Don't use PyList_* APIs to build argument vector, but use a C array instead (stack allocated)Use PyObject_Vectorcall instead of PyObject_CallFor 3.8 and earlier: Allocate a correctly sized PyTuple directly instead of first building a list and then converting to tupleThe closure/stub in ObjC classes currently dynamically looks up the Python method on every call, that's not really necessary. This might need some code to change the Python object stored in the close when the class is updated at runtime (which shouldn't happen a lot)Generic
[DONE]
Enable LTO (couple of percent faster)Look into PGO as well (using test suite to collect profiling information?)
Move more information into PyObjCMethodSignature objects, in particular
This could remove most uses of the helper functions for this (except for structs, arrays and the like), and
hopefully that helps to remove some overhead and hence increase performance. But that needs to be tested!
The text was updated successfully, but these errors were encountered: