Evaluate using additional optimizations like LTO and PGO #93

zamazan4ik · 2024-09-14T22:01:15Z

Hi!

I have several ideas about how Dune's performance can be improved - maybe some of these ideas will be interesting enough for you to try them ;)

Firstly, I noticed that in the Cargo.toml file Link-Time Optimization (LTO) for the project is not enabled. I suggest switching it on since it will reduce the binary size (always a good thing to have) and will likely improve the application's performance a bit (in this case, it's not critical at all but anyway).

I suggest enabling LTO only for the Release builds so as not to sacrifice the developers' experience while working on the project since LTO consumes an additional amount of time to finish the compilation routine. If you think that a regular Release build should not be affected by such a change as well, then I suggest adding an additional dist or release-lto profile where additionally to regular release optimizations LTO also will be added. Such a change simplifies life for maintainers and others interested in the project persons who want to build the most performant version of the application. Using ThinLTO should also help. If we enable it on the Cargo profile level, users, who install the application with cargo install, will get the LTO-optimized version "automatically".

As a next step, I kindly suggest evaluating Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) approaches. These optimizations allow you to optimize an application based on different runtime statistics. I have collected as much information about PGO and PLO here (including many benchmarks) as I can: https://github.com/zamazan4ik/awesome-pgo . However, I don't know how performance questions are critical for Dune (at least found this one in the commit history).

Thank you.

P.S. It's more like an improvement idea rather than a bug. I created the issue just because the Discussions are disabled for the repo for now.

The text was updated successfully, but these errors were encountered:

zamazan4ik · 2024-10-01T06:25:08Z

I just made quick tests on the latest commit. My test environment is simple: Fedora 40 on the latest kernel, Rustc 1.81, background noise is reduced as much as I can.

Here are my results:

Release:

tokenize prelude        time:   [687.96 us 689.87 us 692.12 us]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

parse prelude           time:   [2.2194 ms 2.2262 ms 2.2363 ms]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

PGO optimized compared to Release:

Benchmarking tokenize prelude
Benchmarking tokenize prelude: Warming up for 3.0000 s
Benchmarking tokenize prelude: Collecting 100 samples in estimated 5.0731 s (45k iterations)
Benchmarking tokenize prelude: Analyzing
tokenize prelude        time:   [111.67 us 111.70 us 111.73 us]
                        change: [-83.783% -83.759% -83.739%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) high mild
  8 (8.00%) high severe

Benchmarking parse prelude
Benchmarking parse prelude: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
Benchmarking parse prelude: Collecting 100 samples in estimated 7.2064 s (5050 iterations)
Benchmarking parse prelude: Analyzing
parse prelude           time:   [1.4257 ms 1.4272 ms 1.4297 ms]
                        change: [-36.160% -35.847% -35.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

(just for reference) PGO instrumented compared to Release:

Benchmarking tokenize prelude
Benchmarking tokenize prelude: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, enable flat sampling, or reduce sample count to 50.
Benchmarking tokenize prelude: Collecting 100 samples in estimated 6.9070 s (5050 iterations)
Benchmarking tokenize prelude: Analyzing
tokenize prelude        time:   [1.3578 ms 1.3585 ms 1.3597 ms]
                        change: [+97.257% +97.650% +98.143%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

Benchmarking parse prelude
Benchmarking parse prelude: Warming up for 3.0000 s
Benchmarking parse prelude: Collecting 100 samples in estimated 5.1235 s (1300 iterations)
Benchmarking parse prelude: Analyzing
parse prelude           time:   [3.9306 ms 3.9417 ms 3.9563 ms]
                        change: [+76.088% +77.056% +77.926%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

Even if the benchmark suite is quite small, at least in these scenarios I see measurable improvements in performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using additional optimizations like LTO and PGO #93

Evaluate using additional optimizations like LTO and PGO #93

zamazan4ik commented Sep 14, 2024

zamazan4ik commented Oct 1, 2024

Evaluate using additional optimizations like LTO and PGO #93

Evaluate using additional optimizations like LTO and PGO #93

Comments

zamazan4ik commented Sep 14, 2024

zamazan4ik commented Oct 1, 2024