-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893
Conversation
Number of calls to these functions:
Would it be worth it to add a _PyObject_CallOneArg() static inline variant, similar to _PyObject_CallNoArgs()? |
I added PyObject_CallNoArgs() to reduce the usage of the stack memory. I'm not convinced that the static inline function is really required. For me, the main benefit of PyObject_CallOneArg() is that it has a convenient API: no need to build an array, just pass a regular I'm not sure why we have so many static inline functions currently. Does it help to optimize Python when Python is built without LTO (on macOS)? Is it really worth it? |
@markshannon @pablogsal @serhiy-storchaka @methane: Would you mind to review this change? I'm trying to hide implementation details from the public C API and move "Tstate" variants to the internal C API. |
Most changes in this PR are related to _PyObject_CallNoArgs() moved to pycore_call.h. I created PR #28895 just for that and I rebased this PR on top of it: you can focus the review on the second short commit. |
I ran a microbenchmark on PyObject_CallOneArg() using test_bench.patch and bench.py of https://bugs.python.org/issue45439 on Linux. I built Python without PGO and without LTO: only with -O3 to prevent PyObject_CallOneArg() from being inlined in _testinternalcapi.test_bench(). I checked with gdb that PyObject_CallOneArg() is not inlined with this PR:
Result:
With this PR, PyObject_CallOneArg() is 1.08x faster... Maybe it's just noise in my benchmark. I didn't use CPU isolation and values are between 26 and 28 ms. I expected that the current code (static inline) would be faster!? At least, my PR doesn't seem to make PyObject_CallOneArg() slower. |
Currently, PyObject_CallOneArg() is inlined as the following machine code. There are multiple
|
I updated my PR to use _PyThreadState_GET():
|
Microbenchmark comparing PyObject_CallNoArgs() "public" to _PyObject_CallNoArgs() "inline" on this PR with bench_no_args.patch, bench_no_args_inline.py and bench_no_args_public.py attached to https://bugs.python.org/issue45439 I used the same method than previously: add test code to the _testinternalcapi, use gcc -O3, no PGO, no LTO, no CPU isolation. Result:
The public PyObject_CallNoArgs() "public" (opaque function call) is 1.27x faster than the internal inline _PyObject_CallNoArgs() flavor (inline the code in the _testinternalcapi module). |
If this PR is merged, it becomes possible to add PyObject_CallOneArg() to the limited C API and the stable ABI. |
Another PyObject_CallOneArg() benchmark using LTO+PGO:
PyObject_CallOneArg() is still faster with this PR. |
I used bench2.py and test_bench2.patch attached to https://bugs.python.org/issue45439 for this benchmark. I added test_bench() to the sys module to help the compiler to optimize the code: avoid PLT indirection. |
Another microbenchmark on PyObject_CallOneArg() using test_bench2.patch, gcc -O3 and CPU isolation. Compare the PR (call) to main (inline), this PR is faster:
EDIT: I started to comment assembly code, but I posted incomplete code by mistake, and then I removed all assembly since I was lost while following the code flow :-D |
I reverted the change in PyObject_CallOneArg() to use again PyThreadState_Get()
With this revert, the PR has basically the same performance (gcc -O3, CPU isolation):
|
* Move _PyObject_VectorcallTstate() and _PyObject_FastCallTstate() to pycore_call.h (internal C API). * Convert PyObject_CallOneArg(), PyObject_Vectorcall(), _PyObject_FastCall() and PyVectorcall_Function() static inline functions to regular functions. * Add _PyVectorcall_FunctionInline() static inline function. * PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs() and PyObject_CallOneArg() now call _PyThreadState_GET() rather than PyThreadState_Get().
PR rebased on top on latest commits. In the merged commit 7cdc2a0, I modified _PyObject_CallNoArgs() to call _PyThreadState_GET() rather than calling PyThreadState_Get(). |
I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of tp_vectorcall_offset from public headers to the internal C API". Last years, I added About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage. IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by Well, happy hacking and let's continue continuous benchmarking Python! |
pycore_call.h (internal C API).
_PyObject_FastCall() and PyVectorcall_Function() static inline
functions to regular functions.
and PyObject_CallOneArg() now call _PyThreadState_GET() rather
than PyThreadState_Get().
https://bugs.python.org/issue45439