Profile Native Extensions #2

benfred · 2018-08-22T23:04:06Z

Right now we are figuring out the call stack of the Python program by inspecting the PyInterpreterState/PyThreadState/PyFrameObject variables found in the target python program. This means we don't get information about time spent in non-python threads and time spent in native extensions to python (like extension code written in Cython or C++)

It could be useful to profile these native extensions in addition to the Python call stacks. It might be possible to use something like libunwind to augment the Python call stack to get this information.

edit: There is a prerelease here pip install py-spy==0.2.0.dev4 that can profile native extensions of 64-bit Linux/Windows.

The text was updated successfully, but these errors were encountered:

anntzer · 2018-09-05T23:36:03Z

You may(?) want to compare with https://github.com/vmprof/vmprof-python, which also uses libunwind.

okapies · 2018-11-22T11:27:24Z

I also eager to have the native backtrace support in py-spy! It seems this is a secret sauce of vmprof. It depends on libbacktrace which uses libunwind internally.

https://github.com/vmprof/vmprof-python/blob/bb40a6873669ed414817f87b4a0cde5d5b90af82/src/symboltable.c#L213-L267

nikhilm · 2019-01-17T03:57:42Z

This is something I've mostly thought about in the context of a native sampling profiler 1 systematically detecting Python code while unwinding, and interleaving python info at that point, but not actually tried.

The big problem is reconciling the state of the world from the interpreter, to that of the native stack, as frames are created and destroyed. The naive way is to rescan the interpreter "tree" every time one encounters a PyEval function on the stack, and then literally counting the frame list and the unwind stack to find which frame corresponds to this native function. The better way is to know the PyFrameObject directly. It is basically impossible under x86_64 to retrieve the PyFrameObject from a PyEval_EvalFrameEx call while unwinding because it is not required to be in a non-clobbered register. On x86 (32-bit), you can scan the stack and match again the tree inferred from the interpreter state, but this is slow/error prone.

PEP 523 (Py 3.6 onwards) introduced the ability to replace PyEval_EvalFrameDefault with a custom function. This can be done by simply loading a custom python extension in the interpreter. Once you have this, you are free to stash the frame pointer in a non-clobbered register (callee/caller-saved? I'm always confused), before continuing frame execution. At profile collection time, when you are unwinding the stacks, I conjecture that there is enough unwind info to retrieve this register content. If you do try this out, let me know. I'm very curious to know if this works :). vmprof doesn't use it because it is a relatively new thing, and it does require that you ship/use a python extension with your intepreter that injects itself to replace this function.

nikhilm · 2019-01-17T05:02:41Z

vmprof stops at the first PyEvalFrame call and switches entirely to Python
https://github.com/vmprof/vmprof-python/blob/e9c52d9131645879a3bcdd2abcbea70b384fe544/src/vmp_stack.c#L295
Their documentation seems to state there was an earlier implementation using PEP 523 where available and a custom trampoline at other times. https://vmprof.readthedocs.io/en/latest/native.html#technical-design
I have not searched the historical code.

benfred · 2019-01-17T19:05:51Z

Thanks for the pointers @nikhilm and @anntzer! I hadn't seen PEP 523 before.

Just as a quick update on this, I’ve basically got this working for 64bit Linux and OSX right now. It needs a bunch more cleanup, but I should have a initial version posted here in the next little bit. I think windows should be also relatively straightforward to do with the StackWalk64 API but I haven't started on that yet.

I don't want to modify the target program and also want to support profiling older versions of Python - so while using PEP 523 it is a cool idea I don't think it would work here. Like @nikhilm, I also found it tricky to reliably get the PyFrameObject pointer from the call stack with x86_64 calling conventions (it’s passed in to PyEval_EvalFrame* on the RDI register which isn’t unwound =(.

So instead I'm getting both the interpreter stack and the native stack - and then merging the two stack's together by replacing the native PyEval_EvalFrame.* calls with the corresponding python frame.

Like given this python stack:

Thread 0x7FFF98D36380 (active)
fit (implicit/als.py:163)
calculate_similar_artists (lastfm.py:87)
<module> (lastfm.py:186)

And this corresponding native stack with native and Cython code:

Thread 2563
0x00007fff34dfacf0 _cblas_saxpy (/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib)
0x00007fff34def7b0 _SAXPY (/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib)
0x000000010b0d8a9a __ZL44__pyx_pf_8implicit_4_als_30_least_squares_cgP7_object18__Pyx_memviewsliceS1_S1_S1_S1_fii._omp_fn.9 (implicit/_als.cpp:15554)
0x000000010b47be5f _GOMP_parallel (/usr/local/opt/gcc/lib/gcc/8/libgomp.1.dylib)
0x000000010b10c0b6 __ZL58__pyx_fuse_1_0__pyx_pw_8implicit_4_als_31_least_squares_cgP7_objectS0_S0_ (implicit/_als.cpp:15812)
0x000000010b0e5b23 __ZL24__pyx_FusedFunction_callP7_objectS0_S0_ (implicit/_als.cpp:43560)
0x000000010b0d3acc __ZL19__Pyx_PyObject_CallP7_objectS0_S0_ (implicit/_als.cpp:40769)
0x000000010b0ea96a __ZL42__pyx_pw_8implicit_4_als_5least_squares_cgP7_objectS0_S0_ (implicit/_als.cpp:11621)
0x0000000108f1d12b __PyCFunction_FastCallDict (/Users/ben/anaconda3/bin/python)
0x0000000108e9b8c3 __PyObject_FastCallDict (/Users/ben/anaconda3/bin/python)
0x00000001090a3026 _partial_call (/Users/ben/anaconda3/bin/python)
0x0000000108e9b821 __PyObject_FastCallDict (/Users/ben/anaconda3/bin/python)
0x0000000108ea3a67 __PyObject_FastCallKeywords (/Users/ben/anaconda3/bin/python)
0x0000000108ff4048 _call_function (/Users/ben/anaconda3/bin/python)
0x0000000108ff1b6f __PyEval_EvalFrameDefault (/Users/ben/anaconda3/bin/python)
0x0000000108fe5209 __PyEval_EvalCodeWithName (/Users/ben/anaconda3/bin/python)
0x0000000108ff43ba _fast_function (/Users/ben/anaconda3/bin/python)
0x0000000108ff3fac _call_function (/Users/ben/anaconda3/bin/python)
0x0000000108ff1abf __PyEval_EvalFrameDefault (/Users/ben/anaconda3/bin/python)
0x0000000108fe5209 __PyEval_EvalCodeWithName (/Users/ben/anaconda3/bin/python)
0x0000000108ff43ba _fast_function (/Users/ben/anaconda3/bin/python)
0x0000000108ff3fac _call_function (/Users/ben/anaconda3/bin/python)
0x0000000108ff1b6f __PyEval_EvalFrameDefault (/Users/ben/anaconda3/bin/python)
0x0000000108fe5209 __PyEval_EvalCodeWithName (/Users/ben/anaconda3/bin/python)
0x000000010903dd4c _PyRun_FileExFlags (/Users/ben/anaconda3/bin/python)
0x000000010903d224 _PyRun_SimpleFileExFlags (/Users/ben/anaconda3/bin/python)
0x0000000109063d66 _Py_Main (/Users/ben/anaconda3/bin/python)
0x0000000108e93929 _main (/Users/ben/anaconda3/bin/python)
0x00007fff60233015 _start (/usr/lib/system/libdyld.dylib)

This code merges them together to produce output that looks like

Thread 0x7FFF98D36380/2563 (active)
_cblas_saxpy (libBLAS.dylib)
_SAXPY (libBLAS.dylib)
_least_squares_cg (implicit/_als.pyx:200)
_GOMP_parallel (libgomp.1.dylib)
_least_squares_cg (implicit/_als.pyx:156)
least_squares_cg (implicit/_als.pyx:135)
fit (implicit/als.py:163)
calculate_similar_artists (lastfm.py:87)
<module> (lastfm.py:186)

There are a bunch of details I’m skipping over (matching the OS thread id and the python thread id, demangling Cython etc), and a bunch of other things that still need done - but I don’t see any blocking issues at this stage. I’ll probably write a blog post about this at some point with more details.

Also, I’m using libunwind only as a fallback, and instead using gimli to unwind (which seems to be about 10x faster).

nikhilm · 2019-01-17T19:24:18Z

Looks awesome! Exactly what I was thinking.
If PEP-523 isn't viable, and this approach is slow, then you may want to do what vmprof used to do, which is to replace the first few instructions of PyEval_EvalFrameEx with a custom piece of code, move the existing instructions into a different memory page and then stash rdi into rbx or similar. I believe pyflame has some examples of arbitrary code execution in the remote process using ptrace, which is Linux only.

I'm curious about libunwind vs gimli. Are you using the libunwind-sys APIs directly, or the higher level libunwind? I wasn't aware that gimli had a direct unwinding API, i thought it was just a parser. One thing to keep in mind is that macOS often uses compact unwind encoding for unwind information, which afaik is not DWARF.

As for using StackWalk64, it may not be the fastest for a profiler, as it was always made for debugging, and tends to want to lookup symbols. I believe this is why Firefox doesn't use it as the first choice in their profiler. On x86_64, there is RtlUnwind on windows, which is supposed to be much faster. On x86, you can do direct rbp/rsp based unwinding, with some considerations for leaf functions.

benfred · 2019-01-17T19:37:24Z

For OSX, I wrote my own compact unwind code in Rust - and am using gimli for the dwarf unwinding part.

For Linux, gimli doesn't provide a unwind API directly but it's pretty easy to build one on top of it (it's only a couple of hundred lines of code). I'm using libunwind-ptrace for a fallback in case the gimli unwinder fails - I've enabled caching w/ the _Ux86_64_set_caching_policy call, but it is still pretty slow compared to gimli. I'll post the code here soon - I'd be interested in your feedback on it.

I hadn't seen RtlUnwind for windows yet- thanks for the heads up!

nikhilm · 2019-01-17T19:41:35Z

I'm glad you have a lot of free time! :D

nikhilm · 2019-01-17T19:46:06Z

Ah! Ok, I've mostly been thinking of in-process profiling, where everything is much faster :)
Yes, in that case you'd have to write your own for OSX or use PLCrashReporter's code 1 since libunwind on OSX doesn't support remote unwinding. I am also not sure if RtlUnwindEx and co will work directly then.

benfred · 2019-01-28T20:03:39Z

I've got a first draft of this here: #85 . For windows, I went with the StackWalk64 api since I don't think that RtlUnwindEx will work with a remote process.

It still needs some work, but since it doesn't really impact existing functionality (only is called when you pass --native from commandline) I'm leaning towards pushing out a dev release so that people can start playing around with it.

edit: you can install the dev release on linux/osx by going pip install py-spy==0.2.0.dev0

belm0 · 2019-02-26T04:36:36Z

observations from using --native on dev build:

py-spy gave warning about libunwind, but I already have it installed

Failed to load libunwind-ptrace, you may have an elevated error rate
You can install libunwind-ptrace on ubuntu by going 'sudo apt install libunwind-dev'

$ apt-cache policy libunwind-dev
libunwind-dev:
  Installed: 1.1-4.1
  Candidate: 1.1-4.1

error on py-spy start

ERROR 2019-02-26T04:34:10Z: remoteprocess::linux::symbolication: failed to parse file for symbolication /.../lib/python3.7/site-packages/numpy/core/multiarray.cpython-37m-x86_64-linux-gnu.so: "Could not parse ELF header"

high sampling error rate (~30%)
errors: NoUnwindInfoForAddress, Failed to merge native and python frames (Have 0 native and 12 python)

Update to latest version of gimli/goblin and others. This will fix an issue we were seeing 'failed to parse file for symbolication' (#2 (comment)) that was caused by the object crate using an older version of goblin.

okapies · 2019-04-16T02:38:33Z

Any progress in this topic? I'm still facing Error: Failed to merge native and python frames when using --native. Please tell me if there is anything that I can help with.

benfred · 2019-07-07T01:59:54Z

There is a new dev release here pip install py-spy==0.2.0.dev2 that should solve some of the issues. In particular the libunwind linking issue has been resolved (#119), and we are now also falling back to using libunwind in the case where the we failed to merge the native/python stacks (#123)

benfred · 2019-07-14T21:50:36Z

So there is one more dev release here: pip install py-spy==0.2.0.dev3 that as far as I'm aware fixes all the remaining issues with profiling native extensions on x86_64 linux.

This includes some changes to fix issues on how native/python frames are merged, a small lru cache for symbol lookup to speed things up a bit, and updating the gimli/goblin dependencies to the latest versions which fix errors on failing to parse ELF files (like seen by @belm0)

Anyways - I'd appreciate it if people could give it a go, and let me know if they have any problems. I'm probably going to push this to a stable release for linux/windows in a week or so.

ashwinvis · 2019-07-16T15:02:21Z

I tried to profile a Pythran extension.

pip install pythran
pip install --pre py-spy
git clone https://github.com/fluiddyn/fluidfft
cd fluidfft/bench/pythran_implicit_loop
make

create a file prof.py for simplicity

from bench import (
    myfunc_default as func,
    f2d as arr
)

for i in range(1000):
    func(arr)

cProfile output

$ python -m cProfile -s tottime prof.py| head -n 10
         58759 function calls (56447 primitive calls) in 4.963 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    4.683    0.005    4.683    0.005 {built-in method mymod_default.myfunc}
        6    0.119    0.020    0.119    0.020 {method 'rand' of 'mtrand.RandomState' objects}
        1    0.039    0.039    0.276    0.276 bench.py:2(<module>)
      131    0.019    0.000    0.019    0.000 {built-in method marshal.loads}
    23/21    0.013    0.001    0.016    0.001 {built-in method _imp.create_dynamic}

py-spy output:

Collecting samples from 'python prof.py' (python v3.7.3)
Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%    4.77s     4.77s   <module> (prof.py:7)
  0.00%   0.00%   0.000s    0.040s   <module> (numpy/core/multiarray.py:12)
  0.00%   0.00%   0.000s    0.030s   <module> (sre_compile.py:14)
  0.00%   0.00%   0.010s    0.010s   get_data (<frozen importlib._bootstrap_external>:917)

py-spy native mode

Total Samples 500
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 29.00%  29.00%    1.16s     1.16s   0x7ff530b356e3 (mymod_default.so)
  9.00%   9.00%   0.310s    0.310s   0x7ff530b35661 (mymod_default.so)
  9.00%   9.00%   0.290s    0.290s   0x7ff530b356ce (mymod_default.so)
  7.00%   7.00%   0.350s    0.350s   0x7ff530b35698 (mymod_default.so)
  7.00%   7.00%   0.220s    0.220s   0x7ff530b35674 (mymod_default.so)
  6.00%   6.00%   0.300s    0.300s   0x7ff530b35649 (mymod_default.so)
  6.00%   6.00%   0.370s    0.370s   0x7ff530b356df (mymod_default.so)
  5.00%   5.00%   0.260s    0.260s   0x7ff530b356bc (mymod_default.so)
  5.00%   5.00%   0.220s    0.220s   0x7ff530b356a9 (mymod_default.so)
  5.00%   5.00%   0.300s    0.300s   0x7ff530b356f0 (mymod_default.so)
  3.00%   3.00%   0.220s    0.220s   0x7ff530b35686 (mymod_default.so)
  2.00%   2.00%   0.060s    0.060s   0x7ff530b3564f (mymod_default.so)
  1.00%   1.00%   0.040s    0.040s   0x7ff530b356eb (mymod_default.so)
  1.00%   1.00%   0.060s    0.060s   0x7ff530b3569c (mymod_default.so)
  1.00%   1.00%   0.030s    0.030s   0x7ff530b356c1 (mymod_default.so)
  1.00%   1.00%   0.050s    0.050s   0x7ff530b356ae (mymod_default.so)
  1.00%   1.00%   0.060s    0.060s   0x7ff530b35678 (mymod_default.so)
  1.00%   1.00%   0.080s    0.080s   0x7ff530b35654 (mymod_default.so)
  1.00%   1.00%   0.010s    0.010s   <module> (prof.py:6)
  0.00%   0.00%   0.000s    0.370s   _find_and_load_unlocked (<frozen importlib._bootstrap>:967)
  0.00%   0.00%   0.000s    0.080s   exec_module (<frozen importlib._bootstrap_external>:724)

With pythran there is also a possibilty to make an intermediate C++ file by running:

❯❯❯ pythran -E  mymod.py

And then compiling the extensions using gcc and pythran-config command. Right now it is hard to make sense of the results :) Do have a look. Anyways really nice project, lightweight and easy to use. I might be using it in the future.

ashwinvis · 2019-07-16T15:41:25Z

I discussed with Pythran's author, and apparently pythran by default strips away symbols. Using an option in ~/.pythranrc

[compiler]
ldflags=

which fixes it. Here is the result

 ❯❯❯ py-spy --native -- python prof.py                                                                                                                  (py-spy)


Collecting samples from 'python prof.py' (python v3.7.3)
Total Samples 1100
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 34.00%  34.00%    2.43s     2.43s   std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&) (complex:577)
 16.00%  16.00%   0.920s    0.920s   square<double> (square.hpp:32)
 10.00%  10.00%    1.40s     1.40s   square<double> (square.hpp:30)
 10.00%  10.00%    1.11s     1.11s   get (numpy_iexpr.hpp:475)
  6.00%  97.00%   0.960s     9.94s   __copy_m<(anonymous namespace)::pythonic::types::const_nditerator<(anonymous namespace)::pythonic::types::numpy_expr<(anonymous namespace)::pythonic::operator_::f
  6.00%   6.00%   0.710s    0.710s   std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:799)
  4.00%   4.00%   0.610s    0.610s   std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&) (complex:576)
  3.00%   3.00%   0.500s    0.500s   std::complex<double> std::operator/<double>(std::complex<double> const&, double const&) (complex:922)
  3.00%   3.00%   0.640s    0.640s   __copy_m<(anonymous namespace)::pythonic::types::const_nditerator<(anonymous namespace)::pythonic::types::numpy_expr<(anonymous namespace)::pythonic::operator_::f
  2.00%   2.00%   0.580s    0.580s   square<double> (square.hpp:33)
  2.00%   2.00%   0.110s    0.110s   std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:802)
  2.00%   2.00%   0.300s    0.300s   square<double> (square.hpp:31)
  1.00%   1.00%   0.150s    0.150s   std::complex<double>::operator+= (complex:545)
  1.00%   1.00%   0.110s    0.110s   std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:801)
  0.00%   0.00%   0.000s    0.060s   <module> (bench.py:2)

benfred · 2019-07-20T02:46:53Z

@ashwinvis Thanks for trying this out! I will make sure that I call out the need for symbols for best results when I release this. If no symbols are present, we try to figure out what the function is from the shared library exports - but in your case none of those functions would have been exported.

benfred · 2019-09-23T03:54:29Z

This has been added to v0.2.0 for x86_64 linux and windows - will write a blog post in the next couple of days with some more details

benfred · 2019-09-27T14:45:50Z

Blog post is here: https://www.benfrederickson.com/profiling-native-python-extensions-with-py-spy/ - describes the approach and some of the limitations

ihnorton · 2019-09-27T15:01:28Z

@benfred setting emit_linenums=True when calling cythonize will generate #line annotations in the output C/C++ code specifying the originating source location in the .pyx file (which overrides the actual line number in the debug symbols of the compiled library). Of course, this doesn't help with pre-compiled Cython extensions built without the option.

benfred · 2019-09-27T16:34:38Z

@ihnorton thanks for the heads up on the emit_linenums option! That's a much better solution,

I will update the post to point that as a the recommended way to get the correct line numbers from cython - I think that py-spy will just pull those in automatically if the option is set, but I will verify

Update to latest version of gimli/goblin and others. This will fix an issue we were seeing 'failed to parse file for symbolication' (benfred/py-spy#2 (comment)) that was caused by the object crate using an older version of goblin.

benfred added the enhancement New feature or request label Aug 22, 2018

ogrisel mentioned this issue Sep 7, 2018

Feature request: making it easier to profile nested jitted functions / methods numba/numba#3254

Open

benfred mentioned this issue Sep 28, 2018

Mark time spent in 'poll()' as idle #13

Closed

benfred mentioned this issue Mar 18, 2019

Update dependencies #103

Merged

benfred mentioned this issue Mar 20, 2019

show os thread id's in --dump #57

Closed

benfred mentioned this issue Mar 30, 2019

Weird profile on asyncio code #105

Closed

benfred closed this as completed Sep 23, 2019

SyamGadde mentioned this issue Dec 31, 2019

Native profiling (cython) -- merge error #210

Open

vinsonlu mentioned this issue Nov 17, 2023

Seg fault on py-spy record --subprocesses #638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile Native Extensions #2

Profile Native Extensions #2

benfred commented Aug 22, 2018 •

edited

Loading

anntzer commented Sep 5, 2018

okapies commented Nov 22, 2018

nikhilm commented Jan 17, 2019

nikhilm commented Jan 17, 2019

benfred commented Jan 17, 2019

nikhilm commented Jan 17, 2019

benfred commented Jan 17, 2019

nikhilm commented Jan 17, 2019

nikhilm commented Jan 17, 2019 •

edited

Loading

benfred commented Jan 28, 2019 •

edited

Loading

belm0 commented Feb 26, 2019

okapies commented Apr 16, 2019

benfred commented Jul 7, 2019

benfred commented Jul 14, 2019 •

edited

Loading

ashwinvis commented Jul 16, 2019

ashwinvis commented Jul 16, 2019

benfred commented Jul 20, 2019

benfred commented Sep 23, 2019

benfred commented Sep 27, 2019

ihnorton commented Sep 27, 2019 •

edited

Loading

benfred commented Sep 27, 2019 •

edited

Loading

Profile Native Extensions #2

Profile Native Extensions #2

Comments

benfred commented Aug 22, 2018 • edited Loading

anntzer commented Sep 5, 2018

okapies commented Nov 22, 2018

nikhilm commented Jan 17, 2019

nikhilm commented Jan 17, 2019

benfred commented Jan 17, 2019

nikhilm commented Jan 17, 2019

benfred commented Jan 17, 2019

nikhilm commented Jan 17, 2019

nikhilm commented Jan 17, 2019 • edited Loading

benfred commented Jan 28, 2019 • edited Loading

belm0 commented Feb 26, 2019

okapies commented Apr 16, 2019

benfred commented Jul 7, 2019

benfred commented Jul 14, 2019 • edited Loading

ashwinvis commented Jul 16, 2019

cProfile output

py-spy output:

py-spy native mode

ashwinvis commented Jul 16, 2019

benfred commented Jul 20, 2019

benfred commented Sep 23, 2019

benfred commented Sep 27, 2019

ihnorton commented Sep 27, 2019 • edited Loading

benfred commented Sep 27, 2019 • edited Loading

benfred commented Aug 22, 2018 •

edited

Loading

nikhilm commented Jan 17, 2019 •

edited

Loading

benfred commented Jan 28, 2019 •

edited

Loading

benfred commented Jul 14, 2019 •

edited

Loading

ihnorton commented Sep 27, 2019 •

edited

Loading

benfred commented Sep 27, 2019 •

edited

Loading