Profile-Guided Optimization (PGO) benchmark results #9507

zamazan4ik · 2024-03-08T12:43:38Z

zamazan4ik
Mar 8, 2024

Hi!

I evaluate Profile-Guided Optimization (PGO) performance improvements for applications and libraries in different software domains - all current results can be found here. According to the tests, PGO helps to improve performance in various software domains. I decided to perform PGO benchmarks on the arrow-datafusion library too since some library users may be interested in improving the library performance (e.g. @Dandandan suggested to test PGO on arrow-datafusion here: apache/datafusion-sqlparser-rs#1163 (comment))

I did some benchmarks and here are the results.

Test environment

Fedora 39
Linux kernel 6.7.4
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.76
arrow-datafusion version: main branch on commit 767760b84c71b9af8b2d8cdee9cd5c099b57b462

Benchmark

For benchmark purposes, I use these benchmarks. Since it consists of many scenarios, I ran each scenario with each dedicated training phase. It means that for the Clickbench benchmark only Clickbench is used during the PGO training phase, for TPCH - only TPCH, etc. I made such a choice since I don't know how different are these benchmarks internally (OLTP vs OLAP, maybe other huge differences).

Actual benchmark commands are taken from the bench.sh with exact parameters like iteration count, memory mode (for TPCH), etc. PGO training phase is done with --iterations 1 since for the training phase there is no need to perform multiple runs.

All PGO-related routines are done with cargo-pgo.

All benchmarks are done on the same machine, with the same hardware/software during runs, with the same background "noise" (as much as I can guarantee, of course).

Some command-line commands for better understanding the process:

Release benchmark (TPCH): cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_release.json
PGO instrumentation phase (TPCH): cargo pgo run -- --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_instrumented.json
PGO optimized benchmark: cargo pgo optimize run -- --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data/tpch_sf1 -m --format parquet -o ./results/tpch_sf1_mem_optimized.json

Exactly the same process was used for any other benchmarks.

Results

As a result, I will post several artifacts - raw benchmark results and a comparison made with compare.py script.

Results (comparison Release (left) vs PGO-optimized (right)):

ClickBench: https://gist.github.com/zamazan4ik/968905023d99f1b4e14879467cce270d
ClickBench partitioned: https://gist.github.com/zamazan4ik/207a0dd96c36036514b01f7639151877
ClickBench extended: https://gist.github.com/zamazan4ik/c7c921833560715f1d9fa8ff0197e709
Parquet: https://gist.github.com/zamazan4ik/27129b3b48658455cad8950cccfe4107
Sort: https://gist.github.com/zamazan4ik/256a8909421b73a463704eceabef27a9
TPCH Scale Factor 1: https://gist.github.com/zamazan4ik/3da01649d29a0dbf9747f4c3056809bd
TPCH Scale Factor 1 in memory: https://gist.github.com/zamazan4ik/d69c23d5d7f42a056f7536f2d7f15d75
TPCH Scale Factor 10: https://gist.github.com/zamazan4ik/6f93317d65161cc184eab15c5c24ffe2
TPCH Scale Factor 10 in memory: https://gist.github.com/zamazan4ik/91df829cea91cb8c0efc5b5f64c49ebd

Raw results are attached as a results.zip archive (Raw results for Release, PGO-instrumented and PGO-optimized measurements): results.zip

At least in the provided by project benchmarks, there are measurable improvements in many cases.

Maybe mentioning these results somewhere in the README (or any other user-facing documentation) can be a good idea. Also, I think, it will be worth performing benchmarks in more scenarios.

Please do not treat the issue as a bug or something like that - it's just a benchmark report.

Dandandan · 2024-03-08T20:54:28Z

Dandandan
Mar 8, 2024
Collaborator

Thanks for doing this @zamazan4ik cool stuff.
One thing I would be interested in is how does PGO generalize on one benchmark to another one, e.g. what does training on clickbench and running on tpch do and vice versa?

2 replies

zamazan4ik Mar 8, 2024
Author

I of course can perform such benchmarks - I knew that you would ask such a question, and exactly for this reason I saved all gathered PGO profiles ;)

But I want to highlight that in real life it's not required to "generalize" over multiple benchmarks with PGO. What you really want is to optimize for your specific workload. And if TPCH differs a lot from Clickbench - PGO's positive effects will be reduced.

So before performing such benchmarks, I wanna ask you - why are you interested in such an experiment? Is there a practical reason for that or just theoretical knowledge? Maybe you have in mind some problem that requires such knowledge?

zamazan4ik Mar 8, 2024
Author

Here are the results:

Clickbench benchmark, "Optimized with ClickBench profile" vs "Optimized with TPCH profile": https://gist.github.com/zamazan4ik/7e904d3bd244335ee2c0d899747af488
TPCH Scale Factor 10 benchmark, "Optimized with TPCH Scale Factor 10 profile" vs "Optimized with ClickBench profile": https://gist.github.com/zamazan4ik/bbbbf5e003d05ecbfb27056ed87168bd

With the Clickbench benchmark, we have an interesting thing to consider - with the TPCH profile it works faster! I guess it goes somehow from Query 29 since it slowed down. But it's just a guess - for better understanding profiling needs to be performed (e.g. compare flamegraphs for both cases and try to find the difference).

alamb · 2024-03-09T16:59:45Z

alamb
Mar 9, 2024
Collaborator

Thank you @zamazan4ik -- I agree this is really cool. I wonder what we should do with this excellent information? Some options I can think of:

Add a section to the documentation explaining that PGO can help up substantially (25%) and maybe offer some tips for users to use it?
Provide pre-gathered PGO data somehow, so users could build DataFusion with profiles guided from TPCH (or clickbench).

In general I don't think many organizations will setup PGO with their own workload (as they often don't have an easy to run / representative benchmark available during builds or perhaps don't want to slow down their build process by running benchmarks at the same time)

2 replies

zamazan4ik Mar 9, 2024
Author

Add a section to the documentation explaining that PGO can help up substantially (25%) and maybe offer some tips for users to use it?

Yes, it would be a great option. It requires almost no resources to maintain (write once and link to this discussion for the results). In this case, users who are interested in optimizing arrow-datafusion more will be able to use this information as an additional optimization opportunity. I have several examples of how such documentation can be written (it's for applications but anyway - for a library case it should look a similar way):

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
Rustc: https://rustc-dev-guide.rust-lang.org/building/optimized-build.html#profile-guided-optimization
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Provide pre-gathered PGO data somehow, so users could build DataFusion with profiles guided from TPCH (or clickbench).

Unfortunately, this way is a bit trickier in practice. Pre-gathered PGO profiles have multiple issues - e.g. incompatibilities between different compiler versions, a profile skew (when a PGO profile is gathered for an older version of the code. When time flies, pre-gathered PGO profiles become less and less efficient so some kind of regular PGO profile regeneration is required).

I could suggest another similar way - integrate into the build scripts the way to build the library with enabled PGO (based on some workload like TPCH, Clickbench, any other target workload, or any combination of them - it's up to discussion). On the one hand, users will be able to build the PGO-optimized version of the library. On another hand, you won't waste your maintenance resources on maintaining always up-to-date pre-gathered PGO profiles (however, this process can be simplified with CI).

Some examples of PGO build integration into the build scripts:

Rustc: a CI tool for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script.
Clang:
- Docs
- MinGW build script
Python:
- CPython: README
- Pyston: README
Go: Bash script
Swift: CMake script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag
ISPC: CMake scipts
NodeJS: Configure script
Android Open Source Project (AOSP):
- Official documentation
- Committed PGO profiles: repository
DMD: Custom build rule
LDC: GitHub action
tsv-utils: Makefile
Erlang OTP: Makefile
Clingo (PGO enabled only in Spack): Package recipe
SWI-Prolog:
- Script
- CMake module
hck: Justfile

If you have some prebuilt versions of the library (e.g. a Python wheel), you can think about pre-optimizing these prebuilt binaries with PGO (based on TPCH, Clickbench, etc.). As an example - Pydantic-core: GitHub PR.

In general I don't think many organizations will set up PGO with their workload (as they often don't have an easy-to-run / representative benchmark available during builds or perhaps don't want to slow down their build process by running benchmarks at the same time)

It's a pity but I agree with you. Current PGO adoption across the industry is low (except for companies like Google and Facebook - which use PGO and similar optimization technologies like LLVM BOLT). This is the situation that I am trying to change by showing positive PGO effects for different applications.

alamb Mar 11, 2024
Collaborator

Thank you very much @zamazan4ik -- I have filed #9561 to add documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) benchmark results #9507

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Profile-Guided Optimization (PGO) benchmark results #9507

zamazan4ik Mar 8, 2024

Test environment

Benchmark

Results

Replies: 2 comments · 4 replies

Dandandan Mar 8, 2024 Collaborator

zamazan4ik Mar 8, 2024 Author

zamazan4ik Mar 8, 2024 Author

alamb Mar 9, 2024 Collaborator

zamazan4ik Mar 9, 2024 Author

alamb Mar 11, 2024 Collaborator

zamazan4ik
Mar 8, 2024

Replies: 2 comments 4 replies

Dandandan
Mar 8, 2024
Collaborator

zamazan4ik Mar 8, 2024
Author

zamazan4ik Mar 8, 2024
Author

alamb
Mar 9, 2024
Collaborator

zamazan4ik Mar 9, 2024
Author

alamb Mar 11, 2024
Collaborator