-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf Regression From Threading #202
Comments
FYI: Adding a type to |
Can you use the C API? Tracing through the CPython source code I see get_ident is defined here. There seems to be a Not sure if there is any speed to be gained by doing that. |
Actually I've been looking at that over the past half hour, and I can't really see any way to integrate that C source in. I'll keep thinking though! |
Update: I've gotten it working, plus I added a custom hash map for some extra speed. It should now have ~1.75x overhead over plain code instead of ~2x without these fixes. I'll try and push it in the morning because it's 2am for me and there's still code to clean up from the past ~8 hours of trying to make it work. |
I'm actually going to keep working on it a bit more. I got it down to a ~1.6x overhead for my most realistic test cases (jsonpickle, a project I'm a maintainer on), which is the one that had a ~1.75x overhead last night, and a ~10x slowdown worst case. The overhead differs based on what percent of lines executed are profiled, and how long each line takes. For example, with my jsonpickle test cases, I run the jsonpickle_benchmarks.py file 500 times with two fairly hot functions profiled: @profile
def main():
for x in range(10000000):
y = x
if __name__ == "__main__":
main() This has around a 10x slowdown because each line is much faster, normally (so kernprof's operations take up a much larger portion of the normal time) and virtually every line in the program is profiled. Here's a table comparing slowdowns:
Edit: I also found out that the hash maps item lookup, while better than that of python dicts', still cause most of the overhead. I'm going to look into other ways that may not have O(1) lookup, but have a smaller constant factor so that the average overhead can be reduced. |
@Erotemic @ta946 Here's the repo with the changes: https://github.com/Theelx/line_profiler/tree/383f4484bd136fc7b9b893cab7163dc4d7f0870c |
I don't have access to non-linux machines outside of the CI. If you submit a PR I'll run it on CI. |
couldn't get pip to install the package from local due to
ran:
pip error log
looking at the first error:
if i look in
which leads me to believe its installed correctly if i run this code in python:
i get but if i specifically import maps
i tried adding this to _line_profiler.pyx but didn't make a difference. admittedly i've nevery worked with cython so idk what im doing
|
Got it, thanks for the info! I'll try to push something to fix it in a few minutes @ta946, can you test again when you see a new commit to my fork? |
ran
then the same commands as above. new error popped up pip error log
|
Should be fixed again! |
build succeded but when i came to run the profiler
|
@ta946 Part 4 is ready! |
worked! |
benchmarked using the following code
i normally use mintty terminal as it doesn't slow down the code due to prints like cmd (conhost) does. so tested both just to make sure there is no discrepancy. the numbers are in seconds, and the one without profile next to it is plain python run.
|
Ok, so with normal profile, my edits make the overhead ~2.5x lower. When your autoprofile is merged, I'll test that too. Can you also test this code: import time
def main():
start = time.perf_counter()
for x in range(10000000):
y = x
elapsed_ms = round((time.perf_counter()-start)*1000, 2)
print(f"Time Elapsed profiling: {elapsed_ms}ms")
return elapsed_ms
if __name__ == "__main__":
e1 = main()
main = profile(main)
e2 = main()
print(f"Slowdown from profiling: {e2-e1}ms, {round(e2/e1, 2)}x") |
|
Awesome! |
|
thats an impressive optimization. i know its worst case but 10x still worried me so i thought id test out a script that only has a few lines but is computationally intensive
somehow ran faster 😂 i approve! |
|
i still get timings for cv2, i mainly use line_profiler to optimise it and numpy so if it doesnt profile native code then ive wasted alot of hours 😅
|
Ah, I meant that it doesn't get line timings for lines written in C within numpy/cv2. It still gets the total time taken, it just can't tell which line took how much time. I may be wrong though, so if you've gotten it to work on those, please tell me how! |
no such magic im afraid. although this SO question sent me down a rabbit hole. i tried to profile so theres the question of getting the relavent file and line numbers of the cython function, which im not sure if thats even possible? maybe we can use something like decompyle3 to decompile the .pyd/.so file and follow the imports to the function? assuming that can be solved, printing |
I'm not sure when line-profiler stopped reporting results inside Cython code, maybe this was related to the 4.0 changes from @Theelx? But in any case I think any reported measurement in Cython code would be wrong, so its probably better if it doesn't do it at all. If we do decide to support profiling in cython code, it will be important to have a MWE that is quick to test and demonstrate cases when it does / doesn't work. In xdoctest I support extracting doctests from binaries, and to test this I have a utility that builds a small c-extension, and then runs tests on it, that might be useful here. https://github.com/Erotemic/xdoctest/blob/main/tests/test_binary_ext.py |
My best guess is that this issue appeared on 3.5.1. I think cython code is generally too fast to have an accurate measurement, so I'd recommend against spending too much time working on profiling it, especially because the built-in annotation tool helps users trying to speed it up. |
In abcf021, it seems the performance of line_profiler has been severely impacted by the call to
threading.get_ident()
in the tracing callback. I'm measuring a doubling of overhead from this as compared to 4.0.0 before this fix, as the python interaction this causes is extremely slow. Does anyone know another way to preserve thread data that doesn't require Python interaction, or at least a way that'll let us move this away from the tracing callback?The text was updated successfully, but these errors were encountered: