Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable -Zshare-generics for inline(never) functions #123244

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Mar 30, 2024

This avoids inlining cross-crate generic items when possible that are
already marked inline(never), implying that the author is not intending
for the function to be inlined by callers. As such, having a local copy
may make it easier for LLVM to optimize but mostly just adds to binary
bloat and codegen time. In practice our benchmarks indicate this is
indeed a win for larger compilations, where the extra cost in dynamic
linking to these symbols is diminished compared to the advantages in
fewer copies that need optimizing in each binary.

It might also make sense it expand this with other heuristics (e.g.,
#[cold]) in the future, but this seems like a good starting point.

FWIW, I expect that doing cleanup in where we make the decision
what should/shouldn't be shared is also a good idea. Way too
much code needed to be tweaked to check this. But I'm hoping
to leave that for a follow-up PR rather than blocking this on it.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 30, 2024
@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 30, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 30, 2024
…enerics, r=<try>

Enable -Zshare-generics for inline(never) functions

This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time (in theory, TBD on in practice).

It might also make sense it expand this with other heuristics (e.g., #[cold]).

FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this.

r? `@Mark-Simulacrum` for perf at first
@bors
Copy link
Contributor

bors commented Mar 30, 2024

⌛ Trying commit 5702d83 with merge 1f2a5ec...

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Mar 30, 2024

☀️ Try build successful - checks-actions
Build commit: 1f2a5ec (1f2a5ecd17d6b0415946e70d192b88b566dc73f8)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (1f2a5ec): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.4% [0.3%, 18.5%] 203
Regressions ❌
(secondary)
2.7% [0.3%, 16.2%] 207
Improvements ✅
(primary)
-1.2% [-4.1%, -0.3%] 28
Improvements ✅
(secondary)
-3.9% [-6.8%, -2.7%] 4
All ❌✅ (primary) 1.1% [-4.1%, 18.5%] 231

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.5% [0.6%, 3.9%] 16
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.7% [-4.5%, -1.0%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.7% [-4.5%, 3.9%] 19

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [0.9%, 11.5%] 73
Regressions ❌
(secondary)
4.5% [1.2%, 14.3%] 119
Improvements ✅
(primary)
-1.8% [-3.8%, -0.9%] 12
Improvements ✅
(secondary)
-4.4% [-7.8%, -2.7%] 4
All ❌✅ (primary) 2.0% [-3.8%, 11.5%] 85

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.0%, 0.3%] 15
Regressions ❌
(secondary)
0.2% [0.0%, 0.3%] 38
Improvements ✅
(primary)
-0.7% [-2.3%, -0.0%] 77
Improvements ✅
(secondary)
-8.7% [-18.9%, -0.3%] 14
All ❌✅ (primary) -0.6% [-2.3%, 0.3%] 92

Bootstrap: 667.994s -> 656.525s (-1.72%)
Artifact size: 315.79 MiB -> 314.40 MiB (-0.44%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Mar 31, 2024
@Mark-Simulacrum Mark-Simulacrum added the perf-regression-triaged The performance regression has been triaged. label Mar 31, 2024
@Mark-Simulacrum
Copy link
Member Author

It looks like the majority of the additional cost comes from additional indirection in calls to standard library functions, e.g., a diff like this:

-  f61064:       e8 73 f4 49 03          call   44004dc <_RINvNtCsavh3npScQaX_5alloc7raw_vec11finish_growNtNtB4_5alloc6GlobalECs1UKmS5rlRwk_21rustc_trait_selection>
+  ffe57a:       ff 15 a0 24 03 03       call   *0x30324a0(%rip)        # 4030a20 <_ZN5alloc7raw_vec11finish_grow17h78aea5cebcfaa28aE@Base>

This means more work, particularly for short-lived compilations, since the symbol needs to get resolved at runtime now. That cost should be eliminated with #122362, which might take some time to land but is making progress now. Most downstream programs also don't pay it since they're not linking std dynamically.

A little of the extra cost from there seems to be due to these non-inlined functions now being codegen'd with frame pointers (due to #122646). That doesn't seem like something worth worrying about, we accepted some regression from those changes already.

My sense is that something like this is probably still a good idea despite the regressions. We do see good improvements in binary sizes (including >1MB of librustc_driver.so), and bootstrap times are significantly reduced. That suggests that this is a pretty good win for the larger crates while being a slight loss for smaller crates (instruction count timings are down for some of the larger primary crates as well, e.g., ripgrep and cranelift). That seems consistent with the loss due to additional indirection due to librustc_driver dynamically linking with the standard library.

Going to mark as ready for review as such.

r? compiler

@rustbot rustbot assigned fmease and unassigned Mark-Simulacrum Mar 31, 2024
@Mark-Simulacrum Mark-Simulacrum marked this pull request as ready for review March 31, 2024 13:33
@rustbot

This comment was marked as resolved.

@fmease
Copy link
Member

fmease commented Apr 3, 2024

r? compiler

@rustbot rustbot assigned wesleywiser and unassigned fmease Apr 3, 2024
@Mark-Simulacrum
Copy link
Member Author

cc #14527

@Mark-Simulacrum
Copy link
Member Author

Poking @wesleywiser as it's been ~3 weeks here.

@saethlin
Copy link
Member

saethlin commented May 25, 2024

It's been another 3 weeks, and this looks really interesting.

r? saethlin

@rustbot rustbot assigned saethlin and unassigned wesleywiser May 25, 2024
Comment on lines +22 to +23
// This is generic, but it's only instantiated with a u32 argument and that instantiation is present
// in the local crate (see F above).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this testing what we want? I was expecting to see a cross-crate call to a #[inline(never)] generic function in a test, because I think the point of this PR is to change the behavior for such calls, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is indeed checking that behavior, though it's rather obscure. As per the comment just above ("These should not contribute...") we are implicitly checking that these functions are absent in the mono-items of the downstream crate (tests/codegen-units/partitioning/extern-generic.rs) since they're not listed as MONO-ITEM declarations in that file. They are referenced though via the foo function above which has to get codegen'd in the downstream crate since it is generic and never called in this one.

I suppose we could try to write a more direct test case? But this also tests that the property holds transitively (i.e., we don't codegen the called function twice with inline(never) even if it's not directly being called).

IIRC, before my changes, this would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! This is very interconnected, and I should really be used to that by now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I checked and yes the test does fail without the rest of this PR.

@Mark-Simulacrum
Copy link
Member Author

Sampling a few cases:

  • helloworld-check-Full -- essentially 100% of the regression is extra dynamic symbol lookup. This seems likely to be the "base" case, probably going to get mitigated by the static linking of libstd.
  • serde-derive-1.0.136-check-Full -- regressions here are mostly due to alloc::raw_vec::finish_grow (+53 million instructions, 30/50 are in finish_grow and I suspect the remaining 20 are harder to track down but at least partially due to that function -- seems like there were a bunch of functions that probably got inline-shuffled due to calling finish_grow now vs. partially inlining it?)
  • html5ever-check-Full -- finish_grow and dl-lookup
  • ripgrep-check-Full -- finish_grow and dl-lookup

I think dl-lookup we aren't going to make any real dent in, though it might look better in the future with static linking and potentially with a newer glibc, which could optimize that more.

finish_grow was "the" initial thing which drove me to push on this thread -- right now that gets codegen'd in leaf crates, I suspect much of our binary size win on rustc is not having those local copies. Let's try without the inline(never) on it, see what impact that has. @bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 9, 2024
@bors
Copy link
Contributor

bors commented Jun 9, 2024

⌛ Trying commit 74af8ba with merge 38d8997...

bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 9, 2024
…enerics, r=<try>

Enable -Zshare-generics for inline(never) functions

This avoids inlining cross-crate generic items when possible that are
already marked inline(never), implying that the author is not intending
for the function to be inlined by callers. As such, having a local copy
may make it easier for LLVM to optimize but mostly just adds to binary
bloat and codegen time. In practice our benchmarks indicate this is
indeed a win for larger compilations, where the extra cost in dynamic
linking to these symbols is diminished compared to the advantages in
fewer copies that need optimizing in each binary.

It might also make sense it expand this with other heuristics (e.g.,
`#[cold]`) in the future, but this seems like a good starting point.

FWIW, I expect that doing cleanup in where we make the decision
what should/shouldn't be shared is also a good idea. Way too
much code needed to be tweaked to check this. But I'm hoping
to leave that for a follow-up PR rather than blocking this on it.
@bors
Copy link
Contributor

bors commented Jun 9, 2024

☀️ Try build successful - checks-actions
Build commit: 38d8997 (38d8997653d314a96e1c58c74c13ca78b8c6ecf3)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (38d8997): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [0.2%, 16.8%] 154
Regressions ❌
(secondary)
2.7% [0.1%, 15.0%] 210
Improvements ✅
(primary)
-0.9% [-2.8%, -0.3%] 23
Improvements ✅
(secondary)
-3.8% [-6.1%, -2.9%] 5
All ❌✅ (primary) 1.2% [-2.8%, 16.8%] 177

Max RSS (memory usage)

Results (primary 1.5%, secondary 3.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.2% [1.1%, 4.4%] 5
Regressions ❌
(secondary)
3.3% [3.2%, 3.3%] 2
Improvements ✅
(primary)
-2.6% [-2.7%, -2.4%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.5% [-2.7%, 4.4%] 7

Cycles

Results (primary 2.3%, secondary 4.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.7% [0.8%, 12.4%] 21
Regressions ❌
(secondary)
5.6% [2.0%, 17.9%] 73
Improvements ✅
(primary)
-1.6% [-2.7%, -0.8%] 8
Improvements ✅
(secondary)
-6.6% [-9.5%, -3.2%] 10
All ❌✅ (primary) 2.3% [-2.7%, 12.4%] 29

Binary size

Results (primary -0.3%, secondary -2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.0%, 0.6%] 22
Regressions ❌
(secondary)
0.3% [0.1%, 0.3%] 37
Improvements ✅
(primary)
-0.4% [-1.6%, -0.0%] 69
Improvements ✅
(secondary)
-8.4% [-18.6%, -0.2%] 14
All ❌✅ (primary) -0.3% [-1.6%, 0.6%] 91

Bootstrap: missing data
Artifact size: 319.71 MiB -> 318.41 MiB (-0.41%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 9, 2024
@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 9, 2024
@bors
Copy link
Contributor

bors commented Jun 9, 2024

⌛ Trying commit a14d42e with merge 29a8b2d...

bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 9, 2024
…enerics, r=<try>

Enable -Zshare-generics for inline(never) functions

This avoids inlining cross-crate generic items when possible that are
already marked inline(never), implying that the author is not intending
for the function to be inlined by callers. As such, having a local copy
may make it easier for LLVM to optimize but mostly just adds to binary
bloat and codegen time. In practice our benchmarks indicate this is
indeed a win for larger compilations, where the extra cost in dynamic
linking to these symbols is diminished compared to the advantages in
fewer copies that need optimizing in each binary.

It might also make sense it expand this with other heuristics (e.g.,
`#[cold]`) in the future, but this seems like a good starting point.

FWIW, I expect that doing cleanup in where we make the decision
what should/shouldn't be shared is also a good idea. Way too
much code needed to be tweaked to check this. But I'm hoping
to leave that for a follow-up PR rather than blocking this on it.
@bors
Copy link
Contributor

bors commented Jun 9, 2024

☀️ Try build successful - checks-actions
Build commit: 29a8b2d (29a8b2de729aa4db6332c99732870a877483da75)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (29a8b2d): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [0.2%, 17.5%] 150
Regressions ❌
(secondary)
2.8% [0.1%, 15.5%] 205
Improvements ✅
(primary)
-1.0% [-2.8%, -0.3%] 24
Improvements ✅
(secondary)
-3.6% [-6.0%, -2.8%] 5
All ❌✅ (primary) 1.2% [-2.8%, 17.5%] 174

Max RSS (memory usage)

Results (primary -1.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.5% [2.2%, 4.7%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.1% [-7.8%, -2.0%] 4
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.6% [-7.8%, 4.7%] 6

Cycles

Results (primary 2.9%, secondary 5.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.1% [1.1%, 13.6%] 23
Regressions ❌
(secondary)
5.7% [2.0%, 15.0%] 74
Improvements ✅
(primary)
-1.8% [-2.6%, -1.2%] 6
Improvements ✅
(secondary)
-4.2% [-7.1%, -2.8%] 4
All ❌✅ (primary) 2.9% [-2.6%, 13.6%] 29

Binary size

Results (primary -0.3%, secondary -2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.3%] 21
Regressions ❌
(secondary)
0.3% [0.1%, 0.3%] 37
Improvements ✅
(primary)
-0.4% [-1.6%, -0.0%] 68
Improvements ✅
(secondary)
-8.4% [-18.7%, -0.2%] 14
All ❌✅ (primary) -0.3% [-1.6%, 0.3%] 89

Bootstrap: missing data
Artifact size: 319.77 MiB -> 317.92 MiB (-0.58%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 10, 2024
@saethlin
Copy link
Member

saethlin commented Jun 21, 2024

This list of affected benchmarks in terms of cycles is now way shorter, which is nice.

I'm still in favor of this PR because I think it makes the behavior of #[inline(never)] more intuitive; whether it's worth looking into other tuning (such as #126793) doesn't negate that.

The list of commits looks like this is still in a WIP state. I'm happy to approve this if you fix up the commits.

@saethlin saethlin added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 25, 2024
@saethlin
Copy link
Member

Whoops I never changed the labels back to author. Did that now.

@Zoxc
Copy link
Contributor

Zoxc commented Aug 12, 2024

You could do another perf run now that #122362 has landed.

@Dylan-DPC
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Aug 12, 2024
@bors
Copy link
Contributor

bors commented Aug 12, 2024

🔒 Merge conflict

This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again.

How do I rebase?

Assuming self is your fork and upstream is this repository, you can resolve the conflict following these steps:

  1. git checkout share-inline-never-generics (switch to your branch)
  2. git fetch upstream master (retrieve the latest master)
  3. git rebase upstream/master -p (rebase on top of it)
  4. Follow the on-screen instruction to resolve conflicts (check git status if you got lost).
  5. git push self share-inline-never-generics --force-with-lease (update this PR)

You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial.

Please avoid the "Resolve conflicts" button on GitHub. It uses git merge instead of git rebase which makes the PR commit history more difficult to read.

Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Cargo.lock conflict is handled during merge and rebase. This is normal, and you should still perform step 5 to update this PR.

Error message
Auto-merging tests/codegen/avr/avr-func-addrspace.rs
Auto-merging library/std/src/panicking.rs
Auto-merging library/alloc/src/vec/mod.rs
Auto-merging library/alloc/src/raw_vec.rs
CONFLICT (content): Merge conflict in library/alloc/src/raw_vec.rs
Auto-merging compiler/rustc_monomorphize/src/partitioning.rs
Auto-merging compiler/rustc_middle/src/ty/instance.rs
Auto-merging compiler/rustc_middle/src/ty/context.rs
Auto-merging compiler/rustc_middle/src/mir/mono.rs
Auto-merging compiler/rustc_codegen_ssa/src/back/symbol_export.rs
Auto-merging compiler/rustc_codegen_llvm/src/callee.rs
Auto-merging Cargo.lock
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your merge.renamelimit variable to at least 1968 and retry the command.
Automatic merge failed; fix conflicts and then commit the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-perf Status: Waiting on a perf run to be completed. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.