bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

vstinner · 2021-10-11T23:04:49Z

Move _PyObject_VectorcallTstate() and _PyObject_FastCallTstate() to
pycore_call.h (internal C API).
Convert PyObject_CallOneArg(), PyObject_Vectorcall(),
_PyObject_FastCall() and PyVectorcall_Function() static inline
functions to regular functions.
Add _PyVectorcall_FunctionInline() static inline function.
PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs()
and PyObject_CallOneArg() now call _PyThreadState_GET() rather
than PyThreadState_Get().

https://bugs.python.org/issue45439

vstinner · 2021-10-11T23:14:18Z

_ssl, _sqlite and _testcapi extensions now call the public PyObject_CallNoArgs() function, rather than _PyObject_CallNoArgs().

_ssl uses it in _password_callback(): it's no a performance critical function.
_sqlite uses it in progress_callback() and step_callback(): it would be nice to use _PyObject_CallNoArgs(), but currently _sqlite doesn't use the internal C API. IMO it's a good practice to not abuse the internal C API in the Python stdlib.

Convert PyObject_CallOneArg(), PyObject_Vectorcall(), _PyObject_FastCall() and PyVectorcall_Function() static inline functions to regular functions.

Number of calls to these functions:

PyObject_CallOneArg(): 100
_PyObject_FastCall(): 10
PyVectorcall_Function(): 0
PyVectorcall_Function(): 0

Would it be worth it to add a _PyObject_CallOneArg() static inline variant, similar to _PyObject_CallNoArgs()?

vstinner · 2021-10-11T23:17:14Z

I added PyObject_CallNoArgs() to reduce the usage of the stack memory. I'm not convinced that the static inline function is really required.

For me, the main benefit of PyObject_CallOneArg() is that it has a convenient API: no need to build an array, just pass a regular PyObject* object as argument.

I'm not sure why we have so many static inline functions currently.

Does it help to optimize Python when Python is built without LTO (on macOS)? Is it really worth it?

vstinner · 2021-10-11T23:18:41Z

@markshannon @pablogsal @serhiy-storchaka @methane: Would you mind to review this change?

I'm trying to hide implementation details from the public C API and move "Tstate" variants to the internal C API.

vstinner · 2021-10-12T01:09:46Z

Most changes in this PR are related to _PyObject_CallNoArgs() moved to pycore_call.h. I created PR #28895 just for that and I rebased this PR on top of it: you can focus the review on the second short commit.

vstinner · 2021-10-12T07:05:32Z

I ran a microbenchmark on PyObject_CallOneArg() using test_bench.patch and bench.py of https://bugs.python.org/issue45439 on Linux. I built Python without PGO and without LTO: only with -O3 to prevent PyObject_CallOneArg() from being inlined in _testinternalcapi.test_bench().

I checked with gdb that PyObject_CallOneArg() is not inlined with this PR:

$ gdb -args env/bin/python bench.py --worker --loops 2^29 --warmups 3
(gdb) run
^C
Program received signal SIGINT, Interrupt.

(gdb) disassemble test_bench
Dump of assembler code for function test_bench:
...
   0x00007ffff7fb582a <+747>:	call   0x7ffff7fb5080 <PyObject_CallOneArg@plt>
...

Result:

Mean +- std dev: [ref] 28.5 ns +- 1.2 ns -> [change] 26.3 ns +- 0.5 ns: 1.08x faster

With this PR, PyObject_CallOneArg() is 1.08x faster... Maybe it's just noise in my benchmark. I didn't use CPU isolation and values are between 26 and 28 ms.

I expected that the current code (static inline) would be faster!? At least, my PR doesn't seem to make PyObject_CallOneArg() slower.

vstinner · 2021-10-12T07:13:56Z

Currently, PyObject_CallOneArg() is inlined as the following machine code.

There are multiple @plt function calls. Maybe this PLT indirection explains why this PR makes PyObject_CallOneArg() faster. With this PR, a single PLT indirection is needed to call PyObject_CallOneArg(), but inside PyObject_CallOneArg() there are no more PLT indirections.

...
   0x00007ffff7fb5ce8 <+120>:	call   0x7ffff7fb5130 <PyThreadState_Get@plt>
   0x00007ffff7fb5ced <+125>:	mov    r14,rax
   0x00007ffff7fb5cf0 <+128>:	test   r13,r13
   0x00007ffff7fb5cf3 <+131>:	je     0x7ffff7fb5de6 <test_bench+374>
   0x00007ffff7fb5cf9 <+137>:	mov    r15,QWORD PTR [r13+0x8]
   0x00007ffff7fb5cfd <+141>:	test   BYTE PTR [r15+0xa9],0x8
   0x00007ffff7fb5d05 <+149>:	je     0x7ffff7fb5d98 <test_bench+296>
   0x00007ffff7fb5d0b <+155>:	mov    rdi,r13
   0x00007ffff7fb5d0e <+158>:	call   0x7ffff7fb51c0 <PyCallable_Check@plt>
   0x00007ffff7fb5d13 <+163>:	test   eax,eax
   0x00007ffff7fb5d15 <+165>:	je     0x7ffff7fb5e43 <test_bench+467>
   0x00007ffff7fb5d1b <+171>:	mov    rax,QWORD PTR [r15+0x38]
   0x00007ffff7fb5d1f <+175>:	test   rax,rax
   0x00007ffff7fb5d22 <+178>:	jle    0x7ffff7fb5e24 <test_bench+436>
   0x00007ffff7fb5d28 <+184>:	mov    rax,QWORD PTR [r13+rax*1+0x0]
   0x00007ffff7fb5d2d <+189>:	test   rax,rax
   0x00007ffff7fb5d30 <+192>:	je     0x7ffff7fb5d98 <test_bench+296>
   0x00007ffff7fb5d32 <+194>:	lea    rsi,[rsp+0x28]
   0x00007ffff7fb5d37 <+199>:	xor    ecx,ecx
   0x00007ffff7fb5d39 <+201>:	mov    rdx,rbp
   0x00007ffff7fb5d3c <+204>:	mov    rdi,r13
   0x00007ffff7fb5d3f <+207>:	call   rax
   0x00007ffff7fb5d41 <+209>:	mov    rdi,r14
   0x00007ffff7fb5d44 <+212>:	xor    ecx,ecx
   0x00007ffff7fb5d46 <+214>:	mov    rsi,r13
   0x00007ffff7fb5d49 <+217>:	mov    rdx,rax
   0x00007ffff7fb5d4c <+220>:	call   0x7ffff7fb5170 <_Py_CheckFunctionResult@plt>
...

vstinner · 2021-10-12T07:35:07Z

I updated my PR to use _PyThreadState_GET():

PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs()
and PyObject_CallOneArg() now call _PyThreadState_GET() rather
than PyThreadState_Get().

vstinner · 2021-10-12T07:46:37Z

Microbenchmark comparing PyObject_CallNoArgs() "public" to _PyObject_CallNoArgs() "inline" on this PR with bench_no_args.patch, bench_no_args_inline.py and bench_no_args_public.py attached to https://bugs.python.org/issue45439

I used the same method than previously: add test code to the _testinternalcapi, use gcc -O3, no PGO, no LTO, no CPU isolation.

Result:

Mean +- std dev: [public] 7.33 ns +- 0.06 ns -> [inline] 9.33 ns +- 0.66 ns: 1.27x slower

The public PyObject_CallNoArgs() "public" (opaque function call) is 1.27x faster than the internal inline _PyObject_CallNoArgs() flavor (inline the code in the _testinternalcapi module).

vstinner · 2021-10-12T07:51:22Z

Convert PyObject_CallOneArg() (...) to regular functions.

If this PR is merged, it becomes possible to add PyObject_CallOneArg() to the limited C API and the stable ABI.

vstinner · 2021-10-12T09:54:02Z

Another PyObject_CallOneArg() benchmark using LTO+PGO:

Mean +- std dev: [ref3] 24.4 ns +- 0.4 ns -> [change] 20.9 ns +- 0.4 ns: 1.16x faster

PyObject_CallOneArg() is still faster with this PR.

vstinner · 2021-10-12T20:45:14Z

Another PyObject_CallOneArg() benchmark using LTO+PGO:

I used bench2.py and test_bench2.patch attached to https://bugs.python.org/issue45439 for this benchmark.

I added test_bench() to the sys module to help the compiler to optimize the code: avoid PLT indirection.

vstinner · 2021-10-12T21:13:19Z

Another microbenchmark on PyObject_CallOneArg() using test_bench2.patch, gcc -O3 and CPU isolation.

Compare the PR (call) to main (inline), this PR is faster:

Mean +- std dev: [inline] 31.2 ns +- 0.5 ns -> [call] 29.4 ns +- 0.5 ns: 1.06x faster

EDIT: I started to comment assembly code, but I posted incomplete code by mistake, and then I removed all assembly since I was lost while following the code flow :-D

vstinner · 2021-10-12T21:20:17Z

I reverted the change in PyObject_CallOneArg() to use again PyThreadState_Get()

-    PyThreadState *tstate = _PyThreadState_GET();
+    PyThreadState *tstate = PyThreadState_Get();

With this revert, the PR has basically the same performance (gcc -O3, CPU isolation):

Mean +- std dev: [inline] 31.2 ns +- 0.5 ns -> [call2] 31.5 ns +- 0.4 ns: 1.01x slower

* Move _PyObject_VectorcallTstate() and _PyObject_FastCallTstate() to pycore_call.h (internal C API). * Convert PyObject_CallOneArg(), PyObject_Vectorcall(), _PyObject_FastCall() and PyVectorcall_Function() static inline functions to regular functions. * Add _PyVectorcall_FunctionInline() static inline function. * PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs() and PyObject_CallOneArg() now call _PyThreadState_GET() rather than PyThreadState_Get().

vstinner · 2021-10-13T12:18:51Z

PR rebased on top on latest commits.

In the merged commit 7cdc2a0, I modified _PyObject_CallNoArgs() to call _PyThreadState_GET() rather than calling PyThreadState_Get().

vstinner · 2021-10-14T19:59:23Z

I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of tp_vectorcall_offset from public headers to the internal C API".

Last years, I added tstate parameters to internal C functions. The agreement was that only internal functions should use it, and indirectly that this tstate parameter should be hidden. I'm now sure exactly, but tstate started to pop up in Include/cpython/abstract.h around "call" functions. This PR fix this issue.

About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage.

IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add tstate parameter) or defined as a static inline function, rather than a regular function.

The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by gcc -fno-semantic-interposition).

Well, happy hacking and let's continue continuous benchmarking Python!

vstinner added the skip news label Oct 11, 2021

vstinner requested review from 1st1, berkerpeksag, ethanfurman, lysnikolaou, markshannon, methane, pablogsal, rhettinger and tiran as code owners October 11, 2021 23:04

bedevere-bot added the awaiting core review label Oct 11, 2021

the-knights-who-say-ni added the CLA signed label Oct 11, 2021

vstinner merged commit 3cc56c8 into python:main Oct 14, 2021

bedevere-bot removed the awaiting core review label Oct 14, 2021

vstinner deleted the internal_call_tstate branch October 14, 2021 19:53

vstinner mentioned this pull request Apr 10, 2022

[C API] Move usage of tp_vectorcall_offset from public headers to the internal C API #89602

Closed

davidhewitt mentioned this pull request Jan 15, 2025

pyo3-ffi: expose PyObject_Vectorcall(Method) on the stable abi on 3.12+ PyO3/pyo3#4853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

vstinner commented Oct 11, 2021 •

edited

Loading

vstinner commented Oct 11, 2021

vstinner commented Oct 11, 2021

vstinner commented Oct 11, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021 •

edited

Loading

vstinner commented Oct 12, 2021 •

edited

Loading

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021 •

edited

Loading

vstinner commented Oct 12, 2021

vstinner commented Oct 13, 2021

vstinner commented Oct 14, 2021

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

Conversation

vstinner commented Oct 11, 2021 • edited Loading

vstinner commented Oct 11, 2021

vstinner commented Oct 11, 2021

vstinner commented Oct 11, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021 • edited Loading

vstinner commented Oct 12, 2021 • edited Loading

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021

vstinner commented Oct 12, 2021 • edited Loading

vstinner commented Oct 12, 2021

vstinner commented Oct 13, 2021

vstinner commented Oct 14, 2021

vstinner commented Oct 11, 2021 •

edited

Loading

vstinner commented Oct 12, 2021 •

edited

Loading

vstinner commented Oct 12, 2021 •

edited

Loading

vstinner commented Oct 12, 2021 •

edited

Loading