Enable -Zshare-generics for inline(never) functions #123244

Mark-Simulacrum · 2024-03-30T19:56:39Z

This avoids inlining cross-crate generic items when possible that are
already marked inline(never), implying that the author is not intending
for the function to be inlined by callers. As such, having a local copy
may make it easier for LLVM to optimize but mostly just adds to binary
bloat and codegen time. In practice our benchmarks indicate this is
indeed a win for larger compilations, where the extra cost in dynamic
linking to these symbols is diminished compared to the advantages in
fewer copies that need optimizing in each binary.

It might also make sense it expand this with other heuristics (e.g.,
#[cold]) in the future, but this seems like a good starting point.

FWIW, I expect that doing cleanup in where we make the decision
what should/shouldn't be shared is also a good idea. Way too
much code needed to be tweaked to check this. But I'm hoping
to leave that for a follow-up PR rather than blocking this on it.

Mark-Simulacrum · 2024-03-30T20:00:13Z

@bors try @rust-timer queue

…enerics, r=<try> Enable -Zshare-generics for inline(never) functions This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time (in theory, TBD on in practice). It might also make sense it expand this with other heuristics (e.g., #[cold]). FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this. r? `@Mark-Simulacrum` for perf at first

bors · 2024-03-30T20:01:22Z

⌛ Trying commit 5702d83 with merge 1f2a5ec...

bors · 2024-03-30T21:33:32Z

☀️ Try build successful - checks-actions
Build commit: 1f2a5ec (1f2a5ecd17d6b0415946e70d192b88b566dc73f8)

rust-timer · 2024-03-31T01:09:58Z

Finished benchmarking commit (1f2a5ec): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.4%	[0.3%, 18.5%]	203
Regressions ❌ (secondary)	2.7%	[0.3%, 16.2%]	207
Improvements ✅ (primary)	-1.2%	[-4.1%, -0.3%]	28
Improvements ✅ (secondary)	-3.9%	[-6.8%, -2.7%]	4
All ❌✅ (primary)	1.1%	[-4.1%, 18.5%]	231

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.5%	[0.6%, 3.9%]	16
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.7%	[-4.5%, -1.0%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.7%	[-4.5%, 3.9%]	19

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.6%	[0.9%, 11.5%]	73
Regressions ❌ (secondary)	4.5%	[1.2%, 14.3%]	119
Improvements ✅ (primary)	-1.8%	[-3.8%, -0.9%]	12
Improvements ✅ (secondary)	-4.4%	[-7.8%, -2.7%]	4
All ❌✅ (primary)	2.0%	[-3.8%, 11.5%]	85

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.0%, 0.3%]	15
Regressions ❌ (secondary)	0.2%	[0.0%, 0.3%]	38
Improvements ✅ (primary)	-0.7%	[-2.3%, -0.0%]	77
Improvements ✅ (secondary)	-8.7%	[-18.9%, -0.3%]	14
All ❌✅ (primary)	-0.6%	[-2.3%, 0.3%]	92

Bootstrap: 667.994s -> 656.525s (-1.72%)
Artifact size: 315.79 MiB -> 314.40 MiB (-0.44%)

Mark-Simulacrum · 2024-03-31T13:33:40Z

It looks like the majority of the additional cost comes from additional indirection in calls to standard library functions, e.g., a diff like this:

-  f61064:       e8 73 f4 49 03          call   44004dc <_RINvNtCsavh3npScQaX_5alloc7raw_vec11finish_growNtNtB4_5alloc6GlobalECs1UKmS5rlRwk_21rustc_trait_selection>
+  ffe57a:       ff 15 a0 24 03 03       call   *0x30324a0(%rip)        # 4030a20 <_ZN5alloc7raw_vec11finish_grow17h78aea5cebcfaa28aE@Base>

This means more work, particularly for short-lived compilations, since the symbol needs to get resolved at runtime now. That cost should be eliminated with #122362, which might take some time to land but is making progress now. Most downstream programs also don't pay it since they're not linking std dynamically.

A little of the extra cost from there seems to be due to these non-inlined functions now being codegen'd with frame pointers (due to #122646). That doesn't seem like something worth worrying about, we accepted some regression from those changes already.

My sense is that something like this is probably still a good idea despite the regressions. We do see good improvements in binary sizes (including >1MB of librustc_driver.so), and bootstrap times are significantly reduced. That suggests that this is a pretty good win for the larger crates while being a slight loss for smaller crates (instruction count timings are down for some of the larger primary crates as well, e.g., ripgrep and cranelift). That seems consistent with the loss due to additional indirection due to librustc_driver dynamically linking with the standard library.

Going to mark as ready for review as such.

r? compiler

fmease · 2024-04-03T08:29:18Z

r? compiler

Mark-Simulacrum · 2024-04-05T14:50:57Z

cc #14527

Mark-Simulacrum · 2024-04-27T17:49:42Z

Poking @wesleywiser as it's been ~3 weeks here.

saethlin · 2024-05-25T18:26:45Z

It's been another 3 weeks, and this looks really interesting.

r? saethlin

saethlin · 2024-05-25T22:42:31Z

tests/codegen-units/partitioning/auxiliary/cgu_generic_function.rs

+// This is generic, but it's only instantiated with a u32 argument and that instantiation is present
+// in the local crate (see F above).


Is this testing what we want? I was expecting to see a cross-crate call to a #[inline(never)] generic function in a test, because I think the point of this PR is to change the behavior for such calls, right?

I think this is indeed checking that behavior, though it's rather obscure. As per the comment just above ("These should not contribute...") we are implicitly checking that these functions are absent in the mono-items of the downstream crate (tests/codegen-units/partitioning/extern-generic.rs) since they're not listed as MONO-ITEM declarations in that file. They are referenced though via the foo function above which has to get codegen'd in the downstream crate since it is generic and never called in this one.

I suppose we could try to write a more direct test case? But this also tests that the property holds transitively (i.e., we don't codegen the called function twice with inline(never) even if it's not directly being called).

IIRC, before my changes, this would fail.

Oh I see! This is very interconnected, and I should really be used to that by now.

Also I checked and yes the test does fail without the rest of this PR.

…enerics, r=saethlin Enable -Zshare-generics for inline(never) functions This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time. In practice our benchmarks indicate this is indeed a win for larger compilations, where the extra cost in dynamic linking to these symbols is diminished compared to the advantages in fewer copies that need optimizing in each binary. It might also make sense it expand this with other heuristics (e.g., `#[cold]`) in the future, but this seems like a good starting point. FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this. But I'm hoping to leave that for a follow-up PR rather than blocking this on it.

bors · 2024-11-28T06:19:25Z

⌛ Testing commit efc4eb3 with merge 215dc0e...

bors · 2024-11-28T07:12:13Z

💔 Test failed - checks-actions

This reduces code sizes and better respects programmer intent when marking inline(never). Previously such a marking was essentially ignored for generic functions, as we'd still inline them in remote crates.

Mark-Simulacrum · 2024-11-28T19:52:11Z

@bors r=saethlin

More normalization, we build std with debuginfo on some builders that run tests, so that caused output to vary. Copied the normalization logic from other pre-existing tests that had panic output.

bors · 2024-11-28T19:52:13Z

📌 Commit 4a216a2 has been approved by saethlin

It is now in the queue for this repository.

bors · 2024-11-28T21:44:38Z

⌛ Testing commit 4a216a2 with merge d53f0b1...

bors · 2024-11-29T00:29:28Z

☀️ Test successful - checks-actions
Approved by: saethlin
Pushing d53f0b1 to master...

rust-timer · 2024-11-29T01:47:28Z

Finished benchmarking commit (d53f0b1): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Our benchmarks found a performance regression caused by this PR.
This might be an actual regression, but it can also be just noise.

Next Steps:

If the regression was expected or you think it can be justified,
please write a comment with sufficient written justification, and add
@rustbot label: +perf-regression-triaged to it, to mark the regression as triaged.
If you think that you know of a way to resolve the regression, try to create
a new PR with a fix for the regression.
If you do not understand the regression or you think that it is just noise,
you can ask the @rust-lang/wg-compiler-performance working group for help (members of this group
were already notified of this PR).

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	0.5%	[0.1%, 5.5%]	58
Regressions ❌ (secondary)	0.6%	[0.1%, 1.6%]	84
Improvements ✅ (primary)	-0.7%	[-3.8%, -0.1%]	44
Improvements ✅ (secondary)	-1.6%	[-8.0%, -0.2%]	25
All ❌✅ (primary)	-0.0%	[-3.8%, 5.5%]	102

Max RSS (memory usage)

Results (primary 0.5%, secondary 2.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.0%	[1.1%, 3.2%]	6
Regressions ❌ (secondary)	2.2%	[1.0%, 3.7%]	5
Improvements ✅ (primary)	-3.9%	[-4.1%, -3.7%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.5%	[-4.1%, 3.2%]	8

Cycles

Results (primary 0.0%, secondary 5.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.7%	[1.3%, 5.6%]	6
Regressions ❌ (secondary)	8.6%	[1.8%, 16.8%]	16
Improvements ✅ (primary)	-2.0%	[-3.8%, -1.1%]	8
Improvements ✅ (secondary)	-5.3%	[-8.9%, -2.6%]	5
All ❌✅ (primary)	0.0%	[-3.8%, 5.6%]	14

Binary size

Results (primary -0.5%, secondary -7.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.3%]	9
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.5%	[-2.1%, -0.0%]	69
Improvements ✅ (secondary)	-7.1%	[-21.0%, -0.3%]	19
All ❌✅ (primary)	-0.5%	[-2.1%, 0.3%]	78

Bootstrap: 791.904s -> 774.051s (-2.25%)
Artifact size: 335.89 MiB -> 331.96 MiB (-1.17%)

Mark-Simulacrum · 2024-11-29T02:00:36Z

Regressions remain pretty similar to what we saw before (primarily in incremental it looks like?), and bootstrap times reflect the expectation that this significantly helps with larger crate graphs where there's more opportunity for reuse. Binary size win is also pretty nice.

Perf regression remains triaged.

uweigand · 2024-12-03T16:00:32Z

Hi @Mark-Simulacrum , this seems to have somehow introduced a regression on s390x. I'm now seeing:

thread 'io::tests::try_oom_error' panicked at std/src/io/tests.rs:822:62:
called `Result::unwrap_err()` on an `Ok` value: ()
stack backtrace:
   0:      0x3fff7dd6702 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0eec3d9053c23c0f
   1:      0x3fff7e37506 - core::fmt::write::h66866b531685abe5
   2:      0x3fff7dc575e - std::io::Write::write_fmt::h89ced3ac9904279e
   3:      0x3fff7dd6570 - std::sys::backtrace::BacktraceLock::print::h363d5b9cad1f5c19
   4:      0x3fff7df62e4 - std::panicking::default_hook::{{closure}}::ha4b8eaf1f6a37f57
   5:      0x3fff7df60da - std::panicking::default_hook::hda41cc1e1c3b4efa
   6:      0x2aa00430d78 - test::test_main::{{closure}}::h4d9e2859f981c511
   7:      0x3fff7df6aa0 - std::panicking::rust_panic_with_hook::heff88192ef2a89fb
   8:      0x3fff7dd6d52 - std::panicking::begin_panic_handler::{{closure}}::hff5589d5c45993a6
   9:      0x3fff7dd69b4 - std::sys::backtrace::__rust_end_short_backtrace::h165daf71d9abcca8
  10:      0x3fff7df63ca - rust_begin_unwind
  11:      0x3fff7d4aa6a - core::panicking::panic_fmt::hec8c29ccd1751d1e
  12:      0x3fff7d4b948 - core::result::unwrap_failed::h47cf11019e236d96
  13:      0x2aa001c5a9a - core::ops::function::FnOnce::call_once::h5453841f675c42ec
  14:      0x2aa00436d74 - test::__rust_begin_short_backtrace::h31f93d45aa944e21
  15:      0x2aa00436f62 - test::run_test_in_process::h617ed5302028c350
  16:      0x2aa0042a67e - std::sys::backtrace::__rust_begin_short_backtrace::hbc434a15ea7a090f
  17:      0x2aa00425e14 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h2f86d2c09a8a35d2
  18:      0x3fff7df33a8 - std::sys::pal::unix::thread::Thread::new::thread_start::hce74d4c3b42eec78
  19:      0x3fff7bac3fa - start_thread
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/nptl/pthread_create.c:447:8
  20:      0x3fff7c2bde0 - thread_start
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/misc/../sysdeps/unix/sysv/linux/s390/s390-64/clone3.S:71
  21:                0x0 - <unknown>

Interestingly, a bisect shows that the regression is introduced on the merge commit:

commit d53f0b1d8e261f2f3535f1cd165c714fc0b0b298
Merge: a2545fd6fc6 4a216a25d14
Author: bors <[email protected]>
Date:   Thu Nov 28 21:44:34 2024 +0000

    Auto merge of #123244 - Mark-Simulacrum:share-inline-never-generics, r=saethlin

but in both parent commits (a2545fd and 4a216a2) the test passes. Not sure what's going on here ...

I've tried debugging the test, but if I'm reading this correctly, the test function was already completely optimized out and replaced by a failed assertion at compile time:

Dump of assembler code for function _ZN4core3ops8function6FnOnce9call_once17h5453841f675c42ecE:
   0x000002aa001c5a60 <+0>:     stmg    %r6,%r15,48(%r15)
   0x000002aa001c5a66 <+6>:     aghi    %r15,-168
   0x000002aa001c5a6a <+10>:    lgr     %r11,%r15
   0x000002aa001c5a6e <+14>:    lgrl    %r1,0x2aa00568f28
   0x000002aa001c5a74 <+20>:    lb      %r0,0(%r1)
   0x000002aa001c5a7a <+26>:    la      %r4,167(%r11)
   0x000002aa001c5a7e <+30>:    larl    %r2,0x2aa00481e7c <anon.6846cc147164699b42462cc8b979de03.18.llvm.3644326088524771271>
   0x000002aa001c5a84 <+36>:    lghi    %r3,46
   0x000002aa001c5a88 <+40>:    larl    %r5,0x2aa00545d08 <anon.6846cc147164699b42462cc8b979de03.17.llvm.3644326088524771271>
   0x000002aa001c5a8e <+46>:    larl    %r6,0x2aa00546f78 <anon.6846cc147164699b42462cc8b979de03.473.llvm.3644326088524771271>
   0x000002aa001c5a94 <+52>:    brasl   %r14,0x2aa0005c0e0 <_ZN4core6result13unwrap_failed17h47cf11019e236d96E@plt>

Note the unconditional call to unwrap_failed.

saethlin · 2024-12-03T16:30:47Z

Oh dear. Can you file a new issue for this problem?

And while you're at it, can you check whether you run into trouble with adding the flag -Zshare-generics before this PR?

uweigand · 2024-12-03T16:41:42Z

Oh dear. Can you file a new issue for this problem?

This is now #133806

And while you're at it, can you check whether you run into trouble with adding the flag -Zshare-generics before this PR?

What would be the best way to do that? I'm not really sure the compilation of which file is the problem. Can I add the flag in config.toml somewhere?

saethlin · 2024-12-03T17:06:35Z

I don't think it can be set in config.toml. I would try setting RUSTFLAGS_NOT_BOOTSTRAP=-Zshare-generics when running x test.

rustbot assigned Mark-Simulacrum Mar 30, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 30, 2024

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 30, 2024

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Mar 31, 2024

Mark-Simulacrum force-pushed the share-inline-never-generics branch from 5702d83 to 65a0301 Compare March 31, 2024 13:29

Mark-Simulacrum added the perf-regression-triaged The performance regression has been triaged. label Mar 31, 2024

rustbot assigned fmease and unassigned Mark-Simulacrum Mar 31, 2024

Mark-Simulacrum marked this pull request as ready for review March 31, 2024 13:33

This comment was marked as resolved.

Sign in to view

rustbot assigned wesleywiser and unassigned fmease Apr 3, 2024

rustbot assigned saethlin and unassigned wesleywiser May 25, 2024

saethlin reviewed May 25, 2024

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 27, 2024

This comment has been minimized.

Sign in to view

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 28, 2024

Mark-Simulacrum force-pushed the share-inline-never-generics branch from efc4eb3 to 9bce05f Compare November 28, 2024 18:08

This comment has been minimized.

Sign in to view

Share inline(never) generics across crates

4a216a2

This reduces code sizes and better respects programmer intent when marking inline(never). Previously such a marking was essentially ignored for generic functions, as we'd still inline them in remote crates.

Mark-Simulacrum force-pushed the share-inline-never-generics branch from 9bce05f to 4a216a2 Compare November 28, 2024 18:43

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 28, 2024

bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 29, 2024

bors merged commit d53f0b1 into rust-lang:master Nov 29, 2024
7 checks passed

rustbot added this to the 1.85.0 milestone Nov 29, 2024

Mark-Simulacrum deleted the share-inline-never-generics branch November 29, 2024 01:58

Kobzol mentioned this pull request Dec 4, 2024

Do not unify dereferences of shared borrows in GVN #133474

Merged

Zalathar mentioned this pull request Dec 7, 2024

Test failures in tests/ui/panics on aarch64-apple-darwin #133997

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable -Zshare-generics for inline(never) functions #123244

Enable -Zshare-generics for inline(never) functions #123244

Mark-Simulacrum commented Mar 30, 2024 •

edited

Loading

Mark-Simulacrum commented Mar 30, 2024

This comment has been minimized.

bors commented Mar 30, 2024

This comment has been minimized.

bors commented Mar 30, 2024

This comment has been minimized.

rust-timer commented Mar 31, 2024

Mark-Simulacrum commented Mar 31, 2024

This comment was marked as resolved.

fmease commented Apr 3, 2024

Mark-Simulacrum commented Apr 5, 2024

Mark-Simulacrum commented Apr 27, 2024

saethlin commented May 25, 2024 •

edited

Loading

saethlin May 25, 2024

Mark-Simulacrum May 27, 2024

saethlin May 27, 2024

saethlin May 27, 2024

bors commented Nov 28, 2024

This comment has been minimized.

bors commented Nov 28, 2024

This comment has been minimized.

Mark-Simulacrum commented Nov 28, 2024

bors commented Nov 28, 2024

bors commented Nov 28, 2024

bors commented Nov 29, 2024

rust-timer commented Nov 29, 2024

Mark-Simulacrum commented Nov 29, 2024

uweigand commented Dec 3, 2024

saethlin commented Dec 3, 2024

uweigand commented Dec 3, 2024

saethlin commented Dec 3, 2024

		// This is generic, but it's only instantiated with a u32 argument and that instantiation is present
		// in the local crate (see F above).

Enable -Zshare-generics for inline(never) functions #123244

Enable -Zshare-generics for inline(never) functions #123244

Conversation

Mark-Simulacrum commented Mar 30, 2024 • edited Loading

Mark-Simulacrum commented Mar 30, 2024

This comment has been minimized.

bors commented Mar 30, 2024

This comment has been minimized.

bors commented Mar 30, 2024

This comment has been minimized.

rust-timer commented Mar 31, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Mark-Simulacrum commented Mar 31, 2024

This comment was marked as resolved.

fmease commented Apr 3, 2024

Mark-Simulacrum commented Apr 5, 2024

Mark-Simulacrum commented Apr 27, 2024

saethlin commented May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bors commented Nov 28, 2024

This comment has been minimized.

bors commented Nov 28, 2024

This comment has been minimized.

Mark-Simulacrum commented Nov 28, 2024

bors commented Nov 28, 2024

bors commented Nov 28, 2024

bors commented Nov 29, 2024

rust-timer commented Nov 29, 2024

Overall result: ❌✅ regressions and improvements - please read the text below

Mark-Simulacrum commented Nov 29, 2024

uweigand commented Dec 3, 2024

saethlin commented Dec 3, 2024

uweigand commented Dec 3, 2024

saethlin commented Dec 3, 2024

Mark-Simulacrum commented Mar 30, 2024 •

edited

Loading

saethlin commented May 25, 2024 •

edited

Loading