-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the random functions to reduce allocations #19
Conversation
Hey @Kerollmops 👋 thank you for the PR. Could you tell me in which version you saw the 17% perf gain? I usually test Tue Aug 17 17:16:10 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m33.049s
user 0m30.617s
sys 0m2.227s
Tue Aug 17 17:16:44 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m30.889s
user 0m43.786s
sys 0m3.783s
also, what would be the benefit of this? |
That's strange because on my side I tested with
Being able to compile by only using |
makes sense, thank you! |
We are now dealing with the issue of 'works on my machine' bug :P Let me do the before and after tests, post the numbers here. There is also a PR (#18) which is trying to add the Circle CI setup. That should help solve the conflicts |
Here is the time it takes to run
And here is the time it takes to run this branch on my M1 MacBook Pro. In short, I have removed 18,02% of computing time.
CIs are really not made for that, most of the time CIs are simple VM machines and are not stable enough in terms of CPU, many neighbor processes are polluting the CPU. |
So I restarted my machine, killed active processes and ran it again. Master: Tue Aug 17 18:04:24 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m32.645s
user 0m30.203s
sys 0m2.247s
Tue Aug 17 18:04:58 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m30.543s
user 0m43.470s
sys 0m3.792s This PR:
|
this is phenomenal, but I am not sure why I am not getting the same perf. I will try to spend more time on it and try to figure out
True! but this would help in setting up some baseline. Right now people have to run the code, do before and after benchmarks, then post here. CI would solve all these. This improves developer experience as well. Later I would run the code locally anyways before merging Another idea is to reduce the number of insertions in the CI, from 100M to 10M, so that it runs quickly |
After rerunning it on my iMac just to be sure, here is the result of the master branch:
And here is the result of my branch, a 15,13% gain:
|
I can see some gains on linux (ubuntu-20.04) thought less than 15%. This is a laptop with i7-10610U.
Also the max-RSS for the threaded version is reduced almost by half. Edit: adding another run on Debian/unstable with Ryzen-4800U
Again, small improvement. I guess Mac OS allocator is less efficient and this optimization has greater effect there. |
I am trying to get a different machine to run tests, I should have one by next week. Sorry for the long silence. |
Thank you for the PR and I apologise for the delay. I had got a different machine at the start of Novemeber. But around the same time I started my batch at Recurse Center, I had been very busy with it. I setup the machine finally, ran them today and I do see improved numbers. new machine specs: MacBook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4 master: Mon Nov 22 18:55:46 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
real 0m30.814s
user 0m28.546s
sys 0m2.187s
Mon Nov 22 18:56:18 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
real 0m28.466s
user 0m41.418s
sys 0m3.314s pr:
This is awesome!! 🥳🥳 Non threaded version takes 22% less time and threaded version 18% (I am gonna add this to the leaderboard) Any theories on why both the threaded and non threaded versions are taking almost same amount of time? |
No worries ;)
This is probably because the allocator is not good at synchronizing the allocations requests between threads. Not sure! |
but on my old machine, threaded version was 3s faster! |
I have merged this, but I would still love to discuss more if we can find out this behaviour! |
This pull request globally improves the performance of the program by removing most of the allocations at run time. Basically, the changes are in the lib.rs file.
common.rs
module to belib.rs
file, this way cargo will not try to compile it like a binary file.Vec
creation in theget_random_age
function.format!
calls with anAreaCode
new type that wraps aTinyStr8
. By using this instead of strings. It helps kipping memory locality by directly storing the formatted string on a pointer-sized integer (64 bits) without any indirection to read thestr
itself and also removes any possible allocation.On my side, the performance gain is around 17.6%.