Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monolithic lto link time very slow when building with several codegen units #48025

Closed
matthiaskrgr opened this issue Feb 6, 2018 · 5 comments · Fixed by #48163
Closed

monolithic lto link time very slow when building with several codegen units #48025

matthiaskrgr opened this issue Feb 6, 2018 · 5 comments · Fixed by #48163
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times.

Comments

@matthiaskrgr
Copy link
Member

matthiaskrgr commented Feb 6, 2018

I noticed my project https://github.com/matthiaskrgr/cargo-cache took noticeably longer to build with "lto=true" at some point.
I benchmarked link time (touch src/main.rs; cargo build --release) which appear to blow up as soon as we have several codegen units.

link times:

release stable: 1.23 beta: 1.24 nightly: 1.25
profile.release, lto=true, codegen-units=1 45s 65s 48s
profile.release, lto=true 46s 278s 244s
profile.release (default) 6.3s 6.39 6.43s

So it seem codegen-units=1 makes for bad parallelism while building but speeds up lto because less translation units need to be merged.

Maybe it is possible to dynamically use as little codegen units as possible (but still enough to utilize all cores) when building? //cc #46910

@retep998
Copy link
Member

retep998 commented Feb 6, 2018

Or maybe just don't use multiple codegen units by default when monolithic LTO is enabled.

@retep998 retep998 added the I-compiletime Issue: Problems and improvements with respect to compile times. label Feb 6, 2018
@matthiaskrgr
Copy link
Member Author

I think the optimal "balance" for linktime/buildtime with monolithic lto would be to

  1. compile all crates with codgen-units=1 as long as we can compile several crates in parallel.
  2. if we have to compile a single crate at a time, use ${thread_number} codegen units.

However this will probably introduce indeterministic builds depending on the number of threads a host machine has. So a single core machine would optimize better than a 8 core machine when building the same crate with monolithic lto. :/

@TimNN TimNN added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Feb 6, 2018
@matthiaskrgr
Copy link
Member Author

I made some benchmarks (cpu: i5-2540M dual core), rustc 1.25.0-nightly (b8398d9 2018-02-11)

[profile.release]
lto = true
opt-level = 3
codegen-units = x
codegen-units total build time
1 7:29
2 7:56
4 7:45
8 8:28
16 10:00

@alexcrichton
Copy link
Member

I believe this is fixed in #48163

@alexcrichton
Copy link
Member

er, didn't mean to close yet

alexcrichton added a commit to alexcrichton/rust that referenced this issue Feb 12, 2018
This commit updates our Fat LTO logic to tweak our custom wrapper around LLVM's
"link modules" functionality. Previously whenever the
`LLVMRustLinkInExternalBitcode` function was called it would call LLVM's
`Linker::linkModules` wrapper. Internally this would crate an instance of a
`Linker` which internally creates an instance of an `IRMover`. Unfortunately for
us the creation of `IRMover` is somewhat O(n) with the input module. This means
that every time we linked a module it was O(n) with respect to the entire module
we had built up!

Now the modules we build up during LTO are quite large, so this quickly started
creating an O(n^2) problem for us! Discovered in rust-lang#48025 it turns out this has
always been a problem and we just haven't noticed it. It became particularly
worse recently though due to most libraries having 16x more object files than
they previously did (1 -> 16).

This commit fixes this performance issue by preserving the `Linker` instance
across all links into the main LLVM module. This means we only create one
`IRMover` and allows LTO to progress much speedier.

From the `cargo-cache` project in rust-lang#48025 a **full build** locally when from
5m15s to 2m24s. Looking at the timing logs each object file was linked in in
single-digit millisecond rather than hundreds, clearly being a nice improvement!

Closes rust-lang#48025
kennytm added a commit to kennytm/rust that referenced this issue Feb 13, 2018
…kruppe

rustc: Persist LLVM's `Linker` in Fat LTO

This commit updates our Fat LTO logic to tweak our custom wrapper around LLVM's
"link modules" functionality. Previously whenever the
`LLVMRustLinkInExternalBitcode` function was called it would call LLVM's
`Linker::linkModules` wrapper. Internally this would crate an instance of a
`Linker` which internally creates an instance of an `IRMover`. Unfortunately for
us the creation of `IRMover` is somewhat O(n) with the input module. This means
that every time we linked a module it was O(n) with respect to the entire module
we had built up!

Now the modules we build up during LTO are quite large, so this quickly started
creating an O(n^2) problem for us! Discovered in rust-lang#48025 it turns out this has
always been a problem and we just haven't noticed it. It became particularly
worse recently though due to most libraries having 16x more object files than
they previously did (1 -> 16).

This commit fixes this performance issue by preserving the `Linker` instance
across all links into the main LLVM module. This means we only create one
`IRMover` and allows LTO to progress much speedier.

From the `cargo-cache` project in rust-lang#48025 a **full build** locally went from
5m15s to 2m24s. Looking at the timing logs each object file was linked in in
single-digit millisecond rather than hundreds, clearly being a nice improvement!

Closes rust-lang#48025
kennytm added a commit to kennytm/rust that referenced this issue Feb 14, 2018
…kruppe

rustc: Persist LLVM's `Linker` in Fat LTO

This commit updates our Fat LTO logic to tweak our custom wrapper around LLVM's
"link modules" functionality. Previously whenever the
`LLVMRustLinkInExternalBitcode` function was called it would call LLVM's
`Linker::linkModules` wrapper. Internally this would crate an instance of a
`Linker` which internally creates an instance of an `IRMover`. Unfortunately for
us the creation of `IRMover` is somewhat O(n) with the input module. This means
that every time we linked a module it was O(n) with respect to the entire module
we had built up!

Now the modules we build up during LTO are quite large, so this quickly started
creating an O(n^2) problem for us! Discovered in rust-lang#48025 it turns out this has
always been a problem and we just haven't noticed it. It became particularly
worse recently though due to most libraries having 16x more object files than
they previously did (1 -> 16).

This commit fixes this performance issue by preserving the `Linker` instance
across all links into the main LLVM module. This means we only create one
`IRMover` and allows LTO to progress much speedier.

From the `cargo-cache` project in rust-lang#48025 a **full build** locally went from
5m15s to 2m24s. Looking at the timing logs each object file was linked in in
single-digit millisecond rather than hundreds, clearly being a nice improvement!

Closes rust-lang#48025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants