Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further overhead reductions #203

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Further overhead reductions #203

wants to merge 21 commits into from

Conversation

Theelx
Copy link
Collaborator

@Theelx Theelx commented Feb 25, 2023

This PR changes the data structures used to lower overhead. Specifically, a preshmap (pre-hashed map) is used for checking whether a thread has been seen before, which is significantly faster than a simple threading.get_ident() call every time. Additionally, a significant source of overhead is reduced by eliminating one of the hpTimer() calls. Each call can take up to 50 cycles for nanosecond resolution (on Linux), and overall the two calls together summed to half of the overhead. By removing one of the calls, the accuracy hasn't been measurably reduced (since the callback is so much faster than when hpTimer() was originall added). The last major change is that the STL unordered_map container is no longer used, as it was causing significant overhead when hashing the keys, and a parallelized version can take advantage of multiple cores better. Preshed and cymem are both submodules because trying to include them from a pypi installation breaks on a pyenv virtual environment (on my system).

Another fairly minor change is that with the release of Cython 3.0.0b1, the default for cdef functions has been changed to propagate python exceptions. This imposes a 3x speed penalty, raising it to unacceptable levels, so I added noexcept to function signatures to avoid this issue, since our cdef functions don't raise python exceptions.

Here's a table showing the current state of overhead. Here's a comment explaining jsonpickle.

Kernprof Version jsonpickle Slowdown Worst-case Slowdown
3.5.1 3x 60x
4.0.2 ~1.9x 30x
This PR ~1.5x 10x

Note: Please test this on async code! I haven't had a chance to, and I don't know if the threading handling is entirely accurate with async code.

@Theelx
Copy link
Collaborator Author

Theelx commented Feb 26, 2023

It seems as if there are a decent number of test failures. Please consider this a draft pull request for now, until I can get them all fixed (hopefully in the next week or so).

@ta946
Copy link

ta946 commented Apr 12, 2023

hey guys, any updates on this?
Just wrote some recursive functions with heavily nested but light for loops eg: dict update stuff (i know, worst case scenario. Also loops in python 😭)
Saw a 10x slowdown compared to manual timing

might create a new branch in the autoprofiler fork and merge the current state of this pr to reduce overhead until you have some free time to spare for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants