-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile-Guided Optimization (PGO) and LLVM BOLT results #827
Comments
Thanks for running these numbers! iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.
If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces. |
Well, it's partially true. Yes, BOLT can perform some optimizations even without a runtime profile. However most of the optimizations are done by BOLT only with runtime profiles. The runtime profile could be collected with Linux's
Sure! I have multiple examples of different PGO and/or BOLT integration into different projects:
Here are the examples not only for Rust-based projects - I hope it could help somehow. |
Thanks for pydantic-core, that is exactly what I was looking for! The next question is what is a minimal reasonable use case to profile. We're already going to be blowing up our build times with this and I'd like to not make it worse, particularly because our github action has a race condition where if you specify |
I have some (hopefully helpful) thoughts on that:
|
That was my expectation. Even still, build times are an impact because we have a gap between
So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations? |
Yeah. I think for the dependent on the HEAD actions you can use a Release build without PGO and just do not make PGO builds for the HEAD version.
That's an excellent question! Unfortunately, I have no related resources regarding PGO profile skew handling in |
I don't think that However, even if you reprofile the binary on every release workflow, I don't think that the CI cost would have to be so large. I think that running on some input that takes ~30s in CI should be enough for this project. So you'd have to pay for one additional (re)build of the crate + 30s-1m of profile gathering. You could try to use By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course). |
Huh, I had thought those were on. I enabled CGU=1 because it offered a big gain but didn't enable ThinLTO because it slowed down compile times (iirc) for little gain. See 1250609 |
With the costs and trade offs of PGO, is it still worth it with the CGU=1 change? |
I think so since CGU=1 and PGO implement different optimization sets. And enabling CGU=1 with LTO is a good thing to do before enabling PGO. |
There are trade offs with this. What I'm trying to weigh is how much of a gain there is going from CGU=1 to CGU=1 + PGO compared to any analysis time we have to do as part of our release pipeline. |
The only way to estimate the benefits is testing CGU=1 vs CGU=1 + PGO in the benchmarks :) |
Hi!
I did a lot of Profile-Guided Optimization (PGO) benchmarks recently on different kinds of software - all currently available results are located at https://github.com/zamazan4ik/awesome-pgo . According to the tests, PGO usually helps with achieving better performance. That's why testing PGO would be a good idea for Typos. I did some benchmarks on my local machine and want to share my results.
Test environment
master
branch (commitda2759161fbf9ac2840d6955f120bc3c6f24405f
)Test workload
As a test scenario, I used LLVM sources from https://github.com/llvm/llvm-project on commit
11db162db07d6083b79f4724e649a8c2c69913e1
. All runs are performed on the same hardware, operating system, and the same background workload. The command to runtypos
istaskset -c 0 ./typos -q --threads 1 llvm_project
. One thread was used for the purpose of reducing multi-threading scheduler influence on the results. All PGO optimizations are done with cargo-pgo.Results
Here are the results. Also, I posted Instrumentation results so you can estimate how
typos
slow in the Instrumentation mode. The results are intime
utility format.48,86s user 3,44s system 99% cpu 52,628 total
30,09s user 3,23s system 99% cpu 33,616 total
128,16s user 3,55s system 99% cpu 2:12,23 total
92,05s user 3,60s system 99% cpu 1:36,08 total
29,09s user 3,16s system 98% cpu 32,585 total
Some conclusions
typos
performanceFurther steps
I can suggest to do the following things:
The text was updated successfully, but these errors were encountered: