Evaluate using additional optimizations like LTO and PGO #22

zamazan4ik · 2024-09-12T21:18:26Z

Hi!

As I have done many times before, I decided to test the Profile-Guided Optimization (PGO) technique to optimize the application's performance. For reference, results for other projects are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO helped a lot for many other apps, I decided to apply it to the project to see if the performance win can be achieved. Here are my benchmark results.

This information can be interesting for anyone who wants to achieve more performance with the library in their use cases.

Test environment

Fedora 40
Linux kernel 6.10.7
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.79.0
tex-fmt version: main branch on commit f2689ac7e2c713cfb6106220c09a44141770a638
Disabled Turbo boost

Benchmark

For benchmark purposes, I use built-in into the project benchmarks. For PGO optimization I use cargo-pgo tool. For all measurements I used the same command but with different binaries - taskset -c 0 tex_fmt tests/source/* tests/target/*.

taskset -c 0 is used for reducing the OS scheduler's influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results in hyperfine's format:

hyperfine --warmup 25 --min-runs 100 --prepare "cp -r ../tests/* tests" "taskset -c 0 ./tex_fmt_release tests/source/* tests/target/*" "taskset -c 0 ./tex_fmt_lto tests/source/* tests/target/*" "taskset -c 0 ./tex_fmt_optimized tests/source/* tests/target/*" "taskset -c 0 ./tex_fmt_instrumented tests/source/* tests/target/*"

Benchmark 1: taskset -c 0 ./tex_fmt_release tests/source/* tests/target/*
  Time (mean ± σ):      92.3 ms ±   1.2 ms    [User: 72.6 ms, System: 8.5 ms]
  Range (min … max):    90.6 ms …  98.6 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.

Benchmark 2: taskset -c 0 ./tex_fmt_lto tests/source/* tests/target/*
  Time (mean ± σ):      87.3 ms ±   1.0 ms    [User: 67.5 ms, System: 8.6 ms]
  Range (min … max):    85.5 ms …  91.1 ms    100 runs

Benchmark 3: taskset -c 0 ./tex_fmt_optimized tests/source/* tests/target/*
  Time (mean ± σ):      80.1 ms ±   0.6 ms    [User: 60.2 ms, System: 9.1 ms]
  Range (min … max):    78.3 ms …  81.2 ms    100 runs

Benchmark 4: taskset -c 0 ./tex_fmt_instrumented tests/source/* tests/target/*
  Time (mean ± σ):     133.0 ms ±   1.6 ms    [User: 110.6 ms, System: 9.8 ms]
  Range (min … max):   131.0 ms … 139.4 ms    100 runs

Summary
  taskset -c 0 ./tex_fmt_optimized tests/source/* tests/target/* ran
    1.09 ± 0.01 times faster than taskset -c 0 ./tex_fmt_lto tests/source/* tests/target/*
    1.15 ± 0.02 times faster than taskset -c 0 ./tex_fmt_release tests/source/* tests/target/*
    1.66 ± 0.02 times faster than taskset -c 0 ./tex_fmt_instrumented tests/source/* tests/target/*

where (with binary size information - it's important for some cases too):

tex_fmt_release - default Release profile, 2.6 Mib
tex_fmt_lto - default Release profile + LTO, 2.4 Mib
tex_fmt_optimized - default Release profile + LTO + PGO optimized, 2.4 Mib
tex_fmt_instrumented - default Release profile + LTO + PGO instrumented, 4.5 Mib

According to the results, LTO and PGO measurably improve the application's performance.

Further steps

As a first easy step, I suggest enabling LTO only for the Release builds so as not to sacrifice the developers' experience while working on the project since LTO consumes an additional amount of time to finish the compilation routine. If you think that a regular Release build should not be affected by such a change as well, then I suggest adding an additional release-lto profile where additionally to regular release optimizations LTO also will be added. Such a change simplifies life for maintainers and others interested in the project persons who want to build the most performant version of the application. Using ThinLTO also should help).

Also, Post-Link Optimization (PLO) can be tested after PGO. It can be done by applying tools like LLVM BOLT to tex-fmt.

Thank you.

P.S. It's just an idea, not an actual issue. Possibly, Ideas in GitHub's Discussions is a better place to discuss such proposals.

The text was updated successfully, but these errors were encountered:

WGUNDERWOOD · 2024-09-12T22:23:05Z

Thanks for this, it is very interesting. I will try to replicate these results myself and will definitely consider including this in release binaries if it makes a noticeable improvement.

WGUNDERWOOD · 2024-09-13T08:32:49Z

Working on this in the lto-pgo branch, see cfdb644.

WGUNDERWOOD · 2024-09-13T16:11:21Z

I have implemented this in extra/build.sh and extra/perf.sh -- could you take a look and let me know if it aligns with what you were thinking? I do get some improvement, but not as much as you seem to attain. This may be because I'm not using taskset (it seems to slow down the benchmark), or just a hardware difference.

Benchmark 1: tex-fmt
  Time (mean ± σ):      93.6 ms ±   0.7 ms    [User: 87.1 ms, System: 6.2 ms]
  Range (min … max):    92.6 ms …  96.3 ms    50 runs

Benchmark 1: tex-fmt (no PGO)
  Time (mean ± σ):      97.9 ms ±   1.6 ms    [User: 92.0 ms, System: 5.4 ms]
  Range (min … max):    96.3 ms … 105.3 ms    50 runs

zamazan4ik · 2024-09-18T22:42:03Z

Excuse me for the so late response.

Thank you a lot for implementing it in the scripts! Yep, that's what exactly I was thinking about. I have several small and neat pieces of advice:

https://github.com/WGUNDERWOOD/tex-fmt/blob/lto-pgo/extra/build.sh#L9 - I think it should be enough to run once during the training phase instead of 210 times - the optimization result should be the same. I see that in 9942c41 you increased the runs count - what was the reason for that? No performance wins from PGO with fewer training runs?
https://github.com/WGUNDERWOOD/tex-fmt/blob/lto-pgo/extra/perf.sh#L33 - maybe I missed something but here we have the same binary as for the usual Release build without PGO. Am I blind or here we need to do something more?
Just to recheck, is PGO applied for binaries that are published on GitHub?

I do get some improvement, but not as much as you seem to attain. This may be because I'm not using taskset (it seems to slow down the benchmark), or just a hardware difference.

I guess just a hardware difference. Anyway, we still have nice user time improvement (since system time cannot be improved with PGO).

WGUNDERWOOD · 2024-09-19T08:04:55Z

Thanks for the response!

I am running the script several times before optimization as I think this is what's recommended in the cargo-pgo README -- it says to run the binary for "at least a minute".
perf.sh#L33 is a comparison test without PGO -- the test with PGO begins at L19.
PGO is not currently applied for GitHub binaries at the moment, though I will include it if consistent performance gains are apparent.

zamazan4ik · 2024-09-20T01:32:12Z

I am running the script several times before optimization as I think this is what's recommended in the cargo-pgo README -- it says to run the binary for "at least a minute".

Oh, I see. No worries - in your case you should be able to ignore it. This recommendation is true for larger applications. For example, we want to optimize some large applications, like a database that internally has a lot of different subsystems. When we are waiting for a minute, it increases the chances that all (or almost all) subsystems will be executed at least once during the workload (like running database benchmarks). tex-fmt is a bit different :)

PGO is not currently applied for GitHub binaries at the moment, though I will include it if consistent performance gains are apparent.

Yep, sounds good

WGUNDERWOOD · 2024-10-09T09:34:37Z

There have been some substantial improvements to the performance of tex-fmt over the last few releases, and I'm no longer seeing any advantages when using PGO. As such, I'm going to close this issue for now. Thank you very much for your help and for discussing this; I am more than happy to reopen the issue in the future if necessary.

zamazan4ik mentioned this issue Sep 22, 2024

A bit misleading "run your binary for at least a minute or more" wording Kobzol/cargo-pgo#60

Closed

WGUNDERWOOD closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using additional optimizations like LTO and PGO #22

Evaluate using additional optimizations like LTO and PGO #22

zamazan4ik commented Sep 12, 2024

WGUNDERWOOD commented Sep 12, 2024

WGUNDERWOOD commented Sep 13, 2024

WGUNDERWOOD commented Sep 13, 2024

zamazan4ik commented Sep 18, 2024 •

edited

Loading

WGUNDERWOOD commented Sep 19, 2024

zamazan4ik commented Sep 20, 2024

WGUNDERWOOD commented Oct 9, 2024

Evaluate using additional optimizations like LTO and PGO #22

Evaluate using additional optimizations like LTO and PGO #22

Comments

zamazan4ik commented Sep 12, 2024

Test environment

Benchmark

Results

Further steps

WGUNDERWOOD commented Sep 12, 2024

WGUNDERWOOD commented Sep 13, 2024

WGUNDERWOOD commented Sep 13, 2024

zamazan4ik commented Sep 18, 2024 • edited Loading

WGUNDERWOOD commented Sep 19, 2024

zamazan4ik commented Sep 20, 2024

WGUNDERWOOD commented Oct 9, 2024

zamazan4ik commented Sep 18, 2024 •

edited

Loading