Profile-Guided Optimization (PGO) benchmark results #9507
Replies: 2 comments 4 replies
-
Thanks for doing this @zamazan4ik cool stuff. |
Beta Was this translation helpful? Give feedback.
-
Thank you @zamazan4ik -- I agree this is really cool. I wonder what we should do with this excellent information? Some options I can think of:
In general I don't think many organizations will setup PGO with their own workload (as they often don't have an easy to run / representative benchmark available during builds or perhaps don't want to slow down their build process by running benchmarks at the same time) |
Beta Was this translation helpful? Give feedback.
-
Hi!
I evaluate Profile-Guided Optimization (PGO) performance improvements for applications and libraries in different software domains - all current results can be found here. According to the tests, PGO helps to improve performance in various software domains. I decided to perform PGO benchmarks on the
arrow-datafusion
library too since some library users may be interested in improving the library performance (e.g. @Dandandan suggested to test PGO onarrow-datafusion
here: apache/datafusion-sqlparser-rs#1163 (comment))I did some benchmarks and here are the results.
Test environment
main
branch on commit767760b84c71b9af8b2d8cdee9cd5c099b57b462
Benchmark
For benchmark purposes, I use these benchmarks. Since it consists of many scenarios, I ran each scenario with each dedicated training phase. It means that for the Clickbench benchmark only Clickbench is used during the PGO training phase, for TPCH - only TPCH, etc. I made such a choice since I don't know how different are these benchmarks internally (OLTP vs OLAP, maybe other huge differences).
Actual benchmark commands are taken from the bench.sh with exact parameters like iteration count, memory mode (for TPCH), etc. PGO training phase is done with
--iterations 1
since for the training phase there is no need to perform multiple runs.All PGO-related routines are done with cargo-pgo.
All benchmarks are done on the same machine, with the same hardware/software during runs, with the same background "noise" (as much as I can guarantee, of course).
Some command-line commands for better understanding the process:
cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_release.json
cargo pgo run -- --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_instrumented.json
cargo pgo optimize run -- --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_optimized.json
Exactly the same process was used for any other benchmarks.
Results
As a result, I will post several artifacts - raw benchmark results and a comparison made with compare.py script.
Results (comparison Release (left) vs PGO-optimized (right)):
Raw results are attached as a
results.zip
archive (Raw results for Release, PGO-instrumented and PGO-optimized measurements): results.zipAt least in the provided by project benchmarks, there are measurable improvements in many cases.
Maybe mentioning these results somewhere in the README (or any other user-facing documentation) can be a good idea. Also, I think, it will be worth performing benchmarks in more scenarios.
Please do not treat the issue as a bug or something like that - it's just a benchmark report.
Beta Was this translation helpful? Give feedback.
All reactions