-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with LLVM BOLT binary optimizer #224
Comments
A: CPython BOLT experiment (No PGO + No LTO)
Instruction Order
Benchmark
Benchmark hidden because not significant (4): float, go, pickle_list, xml_etree_parse Binary Compression
Heatmap
ICache Miss$ perf stat -e instructions,L1-icache-misses -- python -m pyperformance run
|
B: CPython BOLT experiment (PGO vs PGO + BOLT)Environment
Binary Size
Benchmark
Benchmark hidden because not significant (4): pathlib, python_startup, python_startup_no_site, richards ICache Miss
Heatmap |
|
Experiment | instructions | L1-icache-misses | ratio |
---|---|---|---|
PGO + LTO | 6,685,195,147,835 | 49,473,656,139 | 0.7% |
PGO + LTO + BOLT | 7,070,332,908,269 | 32,415,421,302 | 0.4% |
Benchmark
Benchmark | pgo_lto | pgo_lto_bolt |
---|---|---|
2to3 | 300 ms | 302 ms: 1.01x slower |
chaos | 74.3 ms | 82.9 ms: 1.12x slower |
deltablue | 4.22 ms | 5.02 ms: 1.19x slower |
fannkuch | 396 ms | 455 ms: 1.15x slower |
float | 88.4 ms | 98.0 ms: 1.11x slower |
go | 151 ms | 177 ms: 1.17x slower |
hexiom | 6.71 ms | 7.86 ms: 1.17x slower |
json_dumps | 12.9 ms | 13.3 ms: 1.03x slower |
json_loads | 27.3 us | 30.6 us: 1.12x slower |
logging_format | 6.47 us | 6.98 us: 1.08x slower |
logging_silent | 113 ns | 136 ns: 1.21x slower |
logging_simple | 5.92 us | 6.30 us: 1.06x slower |
meteor_contest | 110 ms | 116 ms: 1.05x slower |
nbody | 108 ms | 109 ms: 1.01x slower |
nqueens | 86.7 ms | 99.0 ms: 1.14x slower |
pathlib | 21.4 ms | 22.4 ms: 1.05x slower |
pickle | 11.0 us | 11.7 us: 1.06x slower |
pickle_dict | 30.9 us | 32.5 us: 1.05x slower |
pickle_pure_python | 337 us | 385 us: 1.14x slower |
pidigits | 208 ms | 231 ms: 1.11x slower |
pyflate | 455 ms | 529 ms: 1.16x slower |
python_startup | 10.1 ms | 10.5 ms: 1.04x slower |
python_startup_no_site | 7.36 ms | 7.60 ms: 1.03x slower |
raytrace | 326 ms | 371 ms: 1.14x slower |
regex_compile | 137 ms | 156 ms: 1.13x slower |
regex_dna | 221 ms | 203 ms: 1.09x faster |
regex_effbot | 3.17 ms | 3.29 ms: 1.04x slower |
regex_v8 | 25.6 ms | 26.0 ms: 1.01x slower |
richards | 52.8 ms | 62.1 ms: 1.18x slower |
scimark_fft | 324 ms | 373 ms: 1.15x slower |
scimark_lu | 117 ms | 133 ms: 1.14x slower |
scimark_monte_carlo | 73.4 ms | 76.9 ms: 1.05x slower |
scimark_sor | 127 ms | 140 ms: 1.10x slower |
scimark_sparse_mat_mult | 4.54 ms | 5.17 ms: 1.14x slower |
spectral_norm | 107 ms | 118 ms: 1.10x slower |
sqlite_synth | 3.61 us | 3.70 us: 1.02x slower |
unpack_sequence | 44.4 ns | 47.6 ns: 1.07x slower |
unpickle | 14.7 us | 17.4 us: 1.18x slower |
unpickle_list | 5.04 us | 4.96 us: 1.01x faster |
unpickle_pure_python | 256 us | 301 us: 1.18x slower |
xml_etree_parse | 156 ms | 169 ms: 1.09x slower |
xml_etree_iterparse | 106 ms | 112 ms: 1.06x slower |
xml_etree_generate | 86.9 ms | 95.6 ms: 1.10x slower |
xml_etree_process | 60.9 ms | 67.2 ms: 1.10x slower |
Geometric mean | (ref) | 1.09x slower |
Benchmark hidden because not significant (2): pickle_list, telco
note
|
So which do you recommend? |
I am investigating why PGO + LTO + BOLT became slower than PGO + BOLT with the following things
So if we can get great results by turning those things and if I am going to leave experiment notes on this issue, do you think is this too noisy? |
I love reading your notes, that’s what I do. |
After investigating BOLT itself and pyston uses case with several experiments, I decided to get a profile from pyperformance. Reasons are the following things.
I am going to compare the benchmark result by applying pyperfomance profile data. But due to vendering issue, I may need to provide a tool for BOLT optimizer under Tools/ directory. |
C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)Environment
Benchmark
Benchmark hidden because not significant (10): deltablue, fannkuch, go, json_dumps, nqueens, python_startup, python_startup_no_site, scimark_lu, unpickle_list, xml_etree_iterparse |
Because I passed the wrong optimization option (--with-optimizations):( |
D: CPython BOLT experiment (PGO + LTO + BOLT + profiling with -m test vs PGO + LTO + BOLT + profiling with pyperformance)Environment
Benchmark
Benchmark hidden because not significant (30): deltablue, fannkuch, float, logging_silent, meteor_contest, nbody, pickle, pickle_dict, pickle_pure_python, pidigits, pyflate, raytrace, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, telco, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process |
I gathered all benchmark data for BOLT, I am not sure about providing a BOLT optimization pass for 1% performance gain. |
Thanks for running these extensive tests. It does look like it's not worth making the build process even more complex. I wonder if a more fruitful approach would be to come up with a better set of training code for PGO? (That belongs in a different issue. :-) |
I think so too :) Let's close the issue, if someone wants to reopen it, I will welcome it too. |
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt from LLVM 14.0.6. Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt from LLVM 14.0.6. Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
* Add support for the BOLT post-link binary optimizer Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task. * Simplify the build flags * Add a NEWS entry * Update Makefile.pre.in Co-authored-by: Dong-hee Na <[email protected]> * Update configure.ac Co-authored-by: Dong-hee Na <[email protected]> * Add myself to ACKS * Add docs * Other review comments * fix tab/space issue * Make it more clear that --enable-bolt is experimental * Add link to bolt's github page Co-authored-by: Dong-hee Na <[email protected]>
Related discussion: #184
bpo: https://bugs.python.org/issue46378
Since the experiment is time-consuming work, I am going to leave records for each experiment.
If this experiment is workable, My final target is providing BOLT optimization options to the CPython project.
cc @gvanrossum @vstinner
The text was updated successfully, but these errors were encountered: