Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #1099

zamazan4ik · 2023-10-17T06:04:46Z

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to Czkawka. I already performed some benchmarks and want to share my results here.

Test environment

Fedora 38
Linux kernel 6.5.5
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.73
Czkawka version: the latest for now from the master branch on commit 99277b9ea50f2e08dab7343e5cf3b89afa23b769
Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use the command czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project. As a benchmark directory, I use a fully cloned LLVM project repo. I test only CLI version since it's easier to test via SSH connection to the server - all results should be almost the same for the GUI version too.

In this benchmark, I use 4 build configurations:

Default Release build
Release + codegen-units=1 + lto = fat build
Release + codegen-units=1 + lto = fat + PGO build
Release + codegen-units=1 + lto = fat + PGO + BOLT build

Release build is done with cargo build --release, PGO and BOLT optimized builds are done with cargo-pgo. PGO and BOLT profiles are collected from the benchmark workload itself.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run all cached files were cleared.

Results

Builds:

release_czkawka_cli - Default Release build
release_lto_czkawka_cli - Release + codegen-units=1 + lto = fat build
optimized_lto_czkawka_cli - Release + codegen-units=1 + lto = fat + PGO build
bolt_optimized_czkawka_cli - Release + codegen-units=1 + lto = fat + PGO + BOLT build

I got the following results with Hyperfine:

hyperfine --warmup 20 --min-runs 200 './release_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka' './release_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka' './optimized_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka' './bolt_optimized_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka'
Benchmark 1: ./release_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     139.4 ms ±   4.4 ms    [User: 200.2 ms, System: 702.4 ms]
  Range (min … max):   133.0 ms … 157.0 ms    200 runs

Benchmark 2: ./release_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     136.9 ms ±   4.3 ms    [User: 185.6 ms, System: 710.0 ms]
  Range (min … max):   129.7 ms … 154.7 ms    200 runs

Benchmark 3: ./optimized_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     133.9 ms ±   4.6 ms    [User: 171.7 ms, System: 693.7 ms]
  Range (min … max):   126.6 ms … 153.6 ms    200 runs

Benchmark 4: ./bolt_optimized_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     133.3 ms ±   4.2 ms    [User: 163.9 ms, System: 703.0 ms]
  Range (min … max):   126.3 ms … 147.5 ms    200 runs

Summary
  ./bolt_optimized_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka ran
    1.00 ± 0.05 times faster than ./optimized_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
    1.03 ± 0.05 times faster than ./release_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
    1.05 ± 0.05 times faster than ./release_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka

According to the tests, it's possible to achieve several percent improvements with PGO at least in the benchmark above. However, BOLT does not show huge improvements here.

Also, for reference, I post the results for the same benchmark but for the PGO and BOLT instrumented versions (so you can estimate how Czkawka is slow in the Instrumentation mode):

hyperfine --warmup 10 --min-runs 50 './instrumented_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka' './bolt_instrumented_czkafka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka'
Benchmark 1: ./instrumented_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     171.2 ms ±   4.5 ms    [User: 1081.6 ms, System: 652.1 ms]
  Range (min … max):   162.3 ms … 186.6 ms    50 runs

Benchmark 2: ./bolt_instrumented_czkafka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka
  Time (mean ± σ):     362.2 ms ±   5.9 ms    [User: 1562.3 ms, System: 739.7 ms]
  Range (min … max):   349.0 ms … 379.7 ms    50 runs

Summary
  ./instrumented_lto_czkawka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka ran
    2.12 ± 0.07 times faster than ./bolt_instrumented_czkafka_cli dup -d /home/zamazan4ik/open_source/llvm-project >> /dev/null && rm -rf /home/zamazan4ik/.cache/czkawka

where:

instrumented_lto_czkawka_cli - Release + codegen-units=1 + lto = fat + PGO instrumentation build
bolt_instrumented_czkafka_cli - Release + codegen-units=1 + lto = fat + PGO optimized + BOLT instrumented build

Binary sizes for all binaries with size command:

size release_czkawka_cli release_lto_czkawka_cli optimized_lto_czkawka_cli bolt_optimized_czkawka_cli instrumented_lto_czkawka_cli bolt_instrumented_czkafka_cli
   text	   data	    bss	    dec	    hex	filename
12885070	 480912	 594632	13960614	 d505a6	release_czkawka_cli
10045712	 367800	 594600	11008112	 a7f870	release_lto_czkawka_cli
8998580	 387600	 594600	9980780	 984b6c	optimized_lto_czkawka_cli
10406147	 387600	 594600	11388347	 adc5bb	bolt_optimized_czkawka_cli
25542866	5692984	 604080	31839930	1e5d6ba	instrumented_lto_czkawka_cli
25046509	2300604	 594600	27941713	1aa5b51	bolt_instrumented_czkafka_cli

Further steps

I can suggest the following things to do:

Evaluate PGO and BOLT applicability to Czkawka in more scenarios.
If PGO helps to achieve better performance - add a note to Czkawka's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for Czkawka.
Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #1099

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #1099

zamazan4ik commented Oct 17, 2023 •

edited

Loading

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #1099

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #1099

Comments

zamazan4ik commented Oct 17, 2023 • edited Loading

Test environment

Benchmark setup

Results

Further steps

zamazan4ik commented Oct 17, 2023 •

edited

Loading