Evaluate enabling additional optimization options like LTO, PGO and Post-Link Optimization (PLO) #67

zamazan4ik · 2023-11-25T16:49:58Z

zamazan4ik
Nov 25, 2023

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are available here. According to the tests, PGO can help with achieving better performance in many cases for many applications. Since this, I think trying to optimize vtracer with PGO can be a good idea. Also, I found that Vtracer does not use LTO for some reason - enabling it also would be a good idea.

I already did some benchmarks and want to share my results here.

Test environment

Fedora 39
Linux kernel 6.5.12
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.74
Vtracer version: the latest for now from the master branch on commit 74f2a04a17d8c246d80c439fb162780160a7c3e9
Disabled Turbo boost

Benchmark

For benchmark purposes, I use vtracer --input input.jpg --output output.svg command from the README file. For PGO optimization I use cargo-pgo tool. The same command was used for the PGO training phase. PGO instrumented Vtracer is built with cargo pgo build. PGO optimized version is built with cargo pgo optimize build.

Unfortunately, due to the bug in the Rustc compiler right now PGO cannot be enabled simultaneously with LTO for Vtracer. So I compare three Vtracer versions here: "Vtracer Release", "Vtracer Release with LTO" and "Vtracer Release with PGO". Later, when the bug will be fixed, we can apply LTO + PGO to Vtracer at the same time - it should work well. LTO for Vtracer is enabled with the following addition to the root Cargo.toml:

[profile.release]
codegen-units = 1
debug         = false
lto           = true
opt-level     = 3
strip         = true

All tests are done on the same machine, done multiple times (with hyperfine), with the same background "noise" (as much as I can guarantee of course).

As a test input, I use Sample JPG 5 Mib from https://sample-videos.com/download-sample-jpg-image.php .

Results

I got the following results (in hyperfine format):

hyperfine --warmup 2 --min-runs 10 './vtracer_release_no_lto --input input_5mib.jpg --output output.svg' './vtracer_release_with_lto --input input_5mib.jpg --output output.svg' './vtracer_optimized_no_lto --input input_5mib.jpg --output output.svg'
Benchmark 1: ./vtracer_release_no_lto --input input_5mib.jpg --output output.svg
  Time (mean ± σ):      7.779 s ±  0.032 s    [User: 7.160 s, System: 1.199 s]
  Range (min … max):    7.743 s …  7.821 s    10 runs

Benchmark 2: ./vtracer_release_with_lto --input input_5mib.jpg --output output.svg
  Time (mean ± σ):      7.427 s ±  0.020 s    [User: 6.833 s, System: 1.217 s]
  Range (min … max):    7.406 s …  7.463 s    10 runs

Benchmark 3: ./vtracer_optimized_no_lto --input input_5mib.jpg --output output.svg
  Time (mean ± σ):      7.289 s ±  0.040 s    [User: 6.683 s, System: 1.212 s]
  Range (min … max):    7.242 s …  7.344 s    10 runs

Summary
  ./vtracer_optimized_no_lto --input input_5mib.jpg --output output.svg ran
    1.02 ± 0.01 times faster than ./vtracer_release_with_lto --input input_5mib.jpg --output output.svg
    1.07 ± 0.01 times faster than ./vtracer_release_no_lto --input input_5mib.jpg --output output.svg

where:

vtracer_release_no_lto - usual Release
vtracer_release_with_lto - Release with LTO
vtracer_optimized_no_lto - Release with PGO

According to the tests above, LTO and PGO improve Vtracer performance.

Someone can be interested in how LTO and PGO perform if the training workload is different from the evaluation workload (usual questions in the ML world). So I did a simple measurement with another file from https://sample-videos.com/download-sample-jpg-image.php (30 Mib sample). The results are the following:

hyperfine --min-runs 1 './vtracer_release_no_lto --input input_30mib.jpg --output output.svg' './vtracer_release_with_lto --input input_30mib.jpg --output output.svg' './vtracer_optimized_no_lto --input input_30mib.jpg --output output.svg'
Benchmark 1: ./vtracer_release_no_lto --input input_30mib.jpg --output output.svg
  Time (abs ≡):        2341.102 s               [User: 2291.374 s, System: 42.931 s]

Benchmark 2: ./vtracer_release_with_lto --input input_30mib.jpg --output output.svg
  Time (abs ≡):        2089.145 s               [User: 2040.707 s, System: 42.652 s]

Benchmark 3: ./vtracer_optimized_no_lto --input input_30mib.jpg --output output.svg
  Time (abs ≡):        2193.480 s               [User: 2143.885 s, System: 43.255 s]

Summary
  ./vtracer_release_with_lto --input input_30mib.jpg --output output.svg ran
    1.05 times faster than ./vtracer_optimized_no_lto --input input_30mib.jpg --output output.svg
    1.12 times faster than ./vtracer_release_no_lto --input input_30mib.jpg --output output.svg

Just for reference, here is the information about Vtracer slowdown in the PGO instrumentation mode (during the PGO training phase):

hyperfine --warmup 5 --min-runs 10 './vtracer_instrumented_no_lto --input input_5mib.jpg --output output.svg'
Benchmark 1: ./vtracer_instrumented_no_lto --input input_5mib.jpg --output output.svg
  Time (mean ± σ):     15.312 s ±  0.051 s    [User: 65.179 s, System: 1.317 s]
  Range (min … max):   15.234 s … 15.413 s    10 runs

Further steps

I can suggest the following action points:

Perform more PGO benchmarks on Vtracer. If it shows improvements - add a note to the documentation about possible improvements in Vtracer performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Vtracer according to their workloads.
Optimize pre-built Vtracer binaries

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.

Here are some examples of how PGO optimization is integrated in other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

tyt2y3 · 2023-11-25T17:49:04Z

tyt2y3
Nov 25, 2023
Maintainer

Hi there, thank you for the interest. There are a few low-hanging fruits (relatively) at the system level and algorithm level for vtracer.
For example, a partial parallelization of the clustering algorithm and full parallelization of the curve fitting stage.

2 replies

zamazan4ik Nov 25, 2023
Author

I agree that fixing low-hanging fruit is a good thing to do. But they do not replace LTO and PGO optimizations - they complement them. Algorithmic optimizations are always a good thing to perform but even algorithms are optimized by the compiler during the compilation process. LTO and PGO just give more information to the compiler to optimize your awesome algorithm further. LTO and PGO help with lower-level optimizations like better inlining, better hot/cold code split, better I-cache utilization, etc. All of these of course can be performed manually - but why do we want to do it if it can be done by the compiler? Would be much better to spend human time on algorithms, and leave to the compiler as much as possible "dirty" stuff.

Enabling LTO is definitely a simple thing to do like just an additional switch in Cargo.toml. If we are talking about PGO - it can be more difficult to enable (in the starting topic you can see examples of how it's integrated with other projects and estimate the efforts).

tyt2y3 Nov 25, 2023
Maintainer

if you can curate a benchmark with a few cargo tricks, i would be happy to merge it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate enabling additional optimization options like LTO, PGO and Post-Link Optimization (PLO) #67

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Evaluate enabling additional optimization options like LTO, PGO and Post-Link Optimization (PLO) #67

zamazan4ik Nov 25, 2023

Test environment

Benchmark

Results

Further steps

Replies: 1 comment · 2 replies

tyt2y3 Nov 25, 2023 Maintainer

zamazan4ik Nov 25, 2023 Author

tyt2y3 Nov 25, 2023 Maintainer

zamazan4ik
Nov 25, 2023

Replies: 1 comment 2 replies

tyt2y3
Nov 25, 2023
Maintainer

zamazan4ik Nov 25, 2023
Author

tyt2y3 Nov 25, 2023
Maintainer