Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

Merged
merged 1 commit into from
Oct 14, 2021
Merged

bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h #28893

merged 1 commit into from
Oct 14, 2021

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Oct 11, 2021

  • Move _PyObject_VectorcallTstate() and _PyObject_FastCallTstate() to
    pycore_call.h (internal C API).
  • Convert PyObject_CallOneArg(), PyObject_Vectorcall(),
    _PyObject_FastCall() and PyVectorcall_Function() static inline
    functions to regular functions.
  • Add _PyVectorcall_FunctionInline() static inline function.
  • PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs()
    and PyObject_CallOneArg() now call _PyThreadState_GET() rather
    than PyThreadState_Get().

https://bugs.python.org/issue45439

@vstinner
Copy link
Member Author

_ssl, _sqlite and _testcapi extensions now call the public PyObject_CallNoArgs() function, rather than _PyObject_CallNoArgs().

  • _ssl uses it in _password_callback(): it's no a performance critical function.
  • _sqlite uses it in progress_callback() and step_callback(): it would be nice to use _PyObject_CallNoArgs(), but currently _sqlite doesn't use the internal C API. IMO it's a good practice to not abuse the internal C API in the Python stdlib.

Convert PyObject_CallOneArg(), PyObject_Vectorcall(), _PyObject_FastCall() and PyVectorcall_Function() static inline functions to regular functions.

Number of calls to these functions:

  • PyObject_CallOneArg(): 100
  • _PyObject_FastCall(): 10
  • PyVectorcall_Function(): 0
  • PyVectorcall_Function(): 0

Would it be worth it to add a _PyObject_CallOneArg() static inline variant, similar to _PyObject_CallNoArgs()?

@vstinner
Copy link
Member Author

I added PyObject_CallNoArgs() to reduce the usage of the stack memory. I'm not convinced that the static inline function is really required.

For me, the main benefit of PyObject_CallOneArg() is that it has a convenient API: no need to build an array, just pass a regular PyObject* object as argument.

I'm not sure why we have so many static inline functions currently.

Does it help to optimize Python when Python is built without LTO (on macOS)? Is it really worth it?

@vstinner
Copy link
Member Author

@markshannon @pablogsal @serhiy-storchaka @methane: Would you mind to review this change?

I'm trying to hide implementation details from the public C API and move "Tstate" variants to the internal C API.

@vstinner
Copy link
Member Author

Most changes in this PR are related to _PyObject_CallNoArgs() moved to pycore_call.h. I created PR #28895 just for that and I rebased this PR on top of it: you can focus the review on the second short commit.

@vstinner
Copy link
Member Author

vstinner commented Oct 12, 2021

I ran a microbenchmark on PyObject_CallOneArg() using test_bench.patch and bench.py of https://bugs.python.org/issue45439 on Linux. I built Python without PGO and without LTO: only with -O3 to prevent PyObject_CallOneArg() from being inlined in _testinternalcapi.test_bench().

I checked with gdb that PyObject_CallOneArg() is not inlined with this PR:

$ gdb -args env/bin/python bench.py --worker --loops 2^29 --warmups 3
(gdb) run
^C
Program received signal SIGINT, Interrupt.

(gdb) disassemble test_bench
Dump of assembler code for function test_bench:
...
   0x00007ffff7fb582a <+747>:	call   0x7ffff7fb5080 <PyObject_CallOneArg@plt>
...

Result:

Mean +- std dev: [ref] 28.5 ns +- 1.2 ns -> [change] 26.3 ns +- 0.5 ns: 1.08x faster

With this PR, PyObject_CallOneArg() is 1.08x faster... Maybe it's just noise in my benchmark. I didn't use CPU isolation and values are between 26 and 28 ms.

I expected that the current code (static inline) would be faster!? At least, my PR doesn't seem to make PyObject_CallOneArg() slower.

@vstinner
Copy link
Member Author

vstinner commented Oct 12, 2021

Currently, PyObject_CallOneArg() is inlined as the following machine code.

There are multiple @plt function calls. Maybe this PLT indirection explains why this PR makes PyObject_CallOneArg() faster. With this PR, a single PLT indirection is needed to call PyObject_CallOneArg(), but inside PyObject_CallOneArg() there are no more PLT indirections.

...
   0x00007ffff7fb5ce8 <+120>:	call   0x7ffff7fb5130 <PyThreadState_Get@plt>
   0x00007ffff7fb5ced <+125>:	mov    r14,rax
   0x00007ffff7fb5cf0 <+128>:	test   r13,r13
   0x00007ffff7fb5cf3 <+131>:	je     0x7ffff7fb5de6 <test_bench+374>
   0x00007ffff7fb5cf9 <+137>:	mov    r15,QWORD PTR [r13+0x8]
   0x00007ffff7fb5cfd <+141>:	test   BYTE PTR [r15+0xa9],0x8
   0x00007ffff7fb5d05 <+149>:	je     0x7ffff7fb5d98 <test_bench+296>
   0x00007ffff7fb5d0b <+155>:	mov    rdi,r13
   0x00007ffff7fb5d0e <+158>:	call   0x7ffff7fb51c0 <PyCallable_Check@plt>
   0x00007ffff7fb5d13 <+163>:	test   eax,eax
   0x00007ffff7fb5d15 <+165>:	je     0x7ffff7fb5e43 <test_bench+467>
   0x00007ffff7fb5d1b <+171>:	mov    rax,QWORD PTR [r15+0x38]
   0x00007ffff7fb5d1f <+175>:	test   rax,rax
   0x00007ffff7fb5d22 <+178>:	jle    0x7ffff7fb5e24 <test_bench+436>
   0x00007ffff7fb5d28 <+184>:	mov    rax,QWORD PTR [r13+rax*1+0x0]
   0x00007ffff7fb5d2d <+189>:	test   rax,rax
   0x00007ffff7fb5d30 <+192>:	je     0x7ffff7fb5d98 <test_bench+296>
   0x00007ffff7fb5d32 <+194>:	lea    rsi,[rsp+0x28]
   0x00007ffff7fb5d37 <+199>:	xor    ecx,ecx
   0x00007ffff7fb5d39 <+201>:	mov    rdx,rbp
   0x00007ffff7fb5d3c <+204>:	mov    rdi,r13
   0x00007ffff7fb5d3f <+207>:	call   rax
   0x00007ffff7fb5d41 <+209>:	mov    rdi,r14
   0x00007ffff7fb5d44 <+212>:	xor    ecx,ecx
   0x00007ffff7fb5d46 <+214>:	mov    rsi,r13
   0x00007ffff7fb5d49 <+217>:	mov    rdx,rax
   0x00007ffff7fb5d4c <+220>:	call   0x7ffff7fb5170 <_Py_CheckFunctionResult@plt>
...

@vstinner
Copy link
Member Author

I updated my PR to use _PyThreadState_GET():

PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs()
and PyObject_CallOneArg() now call _PyThreadState_GET() rather
than PyThreadState_Get().

@vstinner
Copy link
Member Author

Microbenchmark comparing PyObject_CallNoArgs() "public" to _PyObject_CallNoArgs() "inline" on this PR with bench_no_args.patch, bench_no_args_inline.py and bench_no_args_public.py attached to https://bugs.python.org/issue45439

I used the same method than previously: add test code to the _testinternalcapi, use gcc -O3, no PGO, no LTO, no CPU isolation.

Result:

Mean +- std dev: [public] 7.33 ns +- 0.06 ns -> [inline] 9.33 ns +- 0.66 ns: 1.27x slower

The public PyObject_CallNoArgs() "public" (opaque function call) is 1.27x faster than the internal inline _PyObject_CallNoArgs() flavor (inline the code in the _testinternalcapi module).

@vstinner
Copy link
Member Author

Convert PyObject_CallOneArg() (...) to regular functions.

If this PR is merged, it becomes possible to add PyObject_CallOneArg() to the limited C API and the stable ABI.

@vstinner
Copy link
Member Author

Another PyObject_CallOneArg() benchmark using LTO+PGO:

Mean +- std dev: [ref3] 24.4 ns +- 0.4 ns -> [change] 20.9 ns +- 0.4 ns: 1.16x faster

PyObject_CallOneArg() is still faster with this PR.

@vstinner
Copy link
Member Author

Another PyObject_CallOneArg() benchmark using LTO+PGO:

I used bench2.py and test_bench2.patch attached to https://bugs.python.org/issue45439 for this benchmark.

I added test_bench() to the sys module to help the compiler to optimize the code: avoid PLT indirection.

@vstinner
Copy link
Member Author

vstinner commented Oct 12, 2021

Another microbenchmark on PyObject_CallOneArg() using test_bench2.patch, gcc -O3 and CPU isolation.

Compare the PR (call) to main (inline), this PR is faster:

Mean +- std dev: [inline] 31.2 ns +- 0.5 ns -> [call] 29.4 ns +- 0.5 ns: 1.06x faster

EDIT: I started to comment assembly code, but I posted incomplete code by mistake, and then I removed all assembly since I was lost while following the code flow :-D

@vstinner
Copy link
Member Author

I reverted the change in PyObject_CallOneArg() to use again PyThreadState_Get()

-    PyThreadState *tstate = _PyThreadState_GET();
+    PyThreadState *tstate = PyThreadState_Get();

With this revert, the PR has basically the same performance (gcc -O3, CPU isolation):

Mean +- std dev: [inline] 31.2 ns +- 0.5 ns -> [call2] 31.5 ns +- 0.4 ns: 1.01x slower

* Move _PyObject_VectorcallTstate() and _PyObject_FastCallTstate() to
  pycore_call.h (internal C API).
* Convert PyObject_CallOneArg(), PyObject_Vectorcall(),
  _PyObject_FastCall() and PyVectorcall_Function() static inline
  functions to regular functions.
* Add _PyVectorcall_FunctionInline() static inline function.
* PyObject_Vectorcall(), _PyObject_FastCall(), _PyObject_CallNoArgs()
  and PyObject_CallOneArg() now call _PyThreadState_GET() rather
  than PyThreadState_Get().
@vstinner
Copy link
Member Author

PR rebased on top on latest commits.

In the merged commit 7cdc2a0, I modified _PyObject_CallNoArgs() to call _PyThreadState_GET() rather than calling PyThreadState_Get().

@vstinner vstinner merged commit 3cc56c8 into python:main Oct 14, 2021
@vstinner vstinner deleted the internal_call_tstate branch October 14, 2021 19:53
@vstinner
Copy link
Member Author

I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of tp_vectorcall_offset from public headers to the internal C API".

Last years, I added tstate parameters to internal C functions. The agreement was that only internal functions should use it, and indirectly that this tstate parameter should be hidden. I'm now sure exactly, but tstate started to pop up in Include/cpython/abstract.h around "call" functions. This PR fix this issue.

About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage.

IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add tstate parameter) or defined as a static inline function, rather than a regular function.

The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by gcc -fno-semantic-interposition).

Well, happy hacking and let's continue continuous benchmarking Python!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants