-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manipulating slice through &mut parameter not optimized well #27130
Comments
The first example is much harder for LLVM to optimize because each iteration of the loop stores to a value which would be globally visible if the loop panics. I think LLVM would figure it out with the right set of optimization passes, though. The other examples are much easier to optimize because LLVM can trivially eliminate the stack traffic using scalarrepl. |
According to the LLVM IR, the |
The slice "escapes" like so: use std::sync::Mutex;
use std::sync::Arc;
pub fn trim_in_place(a: &mut &[u8]) {
while a.first() == Some(&42) {
*a = &a[2..];
}
}
fn main() {
static X: &'static [u8] = &[42, 0, 42];
let m = Arc::new(Mutex::new(X));
let m2 = m.clone();
let _ = std::thread::spawn(move || {
let mut r = m.lock().unwrap();
trim_in_place(&mut *r)
}).join();
let r = match m2.lock() {
Ok(r) => r,
Err(r) => r.into_inner()
};
assert_eq!(*r, &[42]);
} |
The mentioned transformation that leads to better code is what @reinerp suggested here: #26494 (comment) |
I'm preparing a PR that will eliminate the extra null check. For the loop itself, I don't have a fix for rustc (yet?), but this optimizes better: pub fn trim_in_place(a: &mut &[u8]) {
let mut x = *a;
while x.first() == Some(&42) {
x = &x[1..];
}
*a = x;
} You get the tight loop with that, and with the PR I'm preparing, the result is this: _ZN13trim_in_place20h5d93414c92587a1aeaaE:
.cfi_startproc
movq (%rdi), %rax
movq 8(%rdi), %rdx
xorl %ecx, %ecx
testq %rdx, %rdx
je .LBB0_5
xorl %ecx, %ecx
.align 16, 0x90
.LBB0_2:
movzbl (%rax), %esi
cmpl $42, %esi
jne .LBB0_3
incq %rax
decq %rdx
jne .LBB0_2
jmp .LBB0_5
.LBB0_3:
movq %rdx, %rcx
.LBB0_5:
movq %rax, (%rdi)
movq %rcx, 8(%rdi)
retq Still slightly weird WRT its usage of |
@dotdash yeah, LLVM register usage can be a little weird sometimes. It's live-range analysis can produce some odd results at times, meaning that it won't re-use a register even it can. |
We currently copy fat pointers using memcpy intrinsics, which SROA will split up into load/store pairs. Unfortunately, nonnull metadata on loads from said fat pointers might get lost in the process. So instead of using memcpy we can go ahead and use load/store pairs right away including the appropriate nonnull metadata on the loads. That way the metadata doesn't get lost and LLVM has better chances of eliminating null checks. One example for this is the code in rust-lang#27130 that currently has an extra null check that disappears after this change.
@rkruppe still a problem? |
@brson still as bad in stable and beta, but it even regressed from there in nightly :-/ _ZN8rust_out13trim_in_place17ha198dc1ee798e461E:
.cfi_startproc
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
movq 8(%rdi), %rax
jmp .LBB0_1
.p2align 4, 0x90
.LBB0_5:
incq (%rdi)
movq %rax, 8(%rdi)
.LBB0_1:
decq %rax
cmpq $-1, %rax
setne %cl
je .LBB0_3
movq (%rdi), %rcx
cmpb $42, (%rcx)
sete %cl
.LBB0_3:
testb %cl, %cl
je .LBB0_6
cmpq $-1, %rax
jne .LBB0_5
movl $1, %edi
xorl %esi, %esi
callq _ZN4core5slice22slice_index_order_fail17h9eb379df958d4186E@PLT
.LBB0_6:
popq %rax
retq
.Lfunc_end0:
.size _ZN8rust_out13trim_in_place17ha198dc1ee798e461E, .Lfunc_end0-_ZN8rust_out13trim_in_place17ha198dc1ee798e461E
.cfi_endproc |
OK, so the regression here is interesting. Thanks to a change in #38854 which makes us avoid FCA loads/stores in favor of memcpys, LLVM got a bit better at understanding this code and sees that |
I've found some a few places where tweaking things in LLVM would prevent this from regressing, the most promising one being a fix to folding zexts into PHIs with only two incoming values, a special case that is currently not handled. A patch on top of the current LLVM trunk at least gets rid of the regression, though the original problem still remains. |
The patch to handle the "regression" has landed in LLVM trunk |
Could we backport it? |
@arielb1 we probably could (can't check right now), but I'm not sure how useful it is in real world code. The commit is llvm-mirror/llvm@7c4d39c |
The "regression" is no more, but the original issue still persists. Cc @rust-lang/wg-codegen |
Looks like the bounds check is being optimized away since rustc 1.25. |
Do we have any perf tests that tell us when we regress it? |
I would assume that a regression would show up somewhere in lolbench, but there are a lot of events showing up there currently with no effort yet to triage or reproduce them, so I doubt we'll know based on those until the tooling is further along. |
I don't know how helpful this is, but it looks like rustc really likes to math patterns: https://rust.godbolt.org/z/TP9zKc |
This issue seems to be fixed in nightly. |
Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.
Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.
Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.
Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.
Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.
Rollup of 14 pull requests Successful merges: - rust-lang#75832 (Move to intra-doc links for wasi/ext/fs.rs, os_str_bytes.rs…) - rust-lang#75852 (Switch to intra-doc links in `core::hash`) - rust-lang#75874 (Shorten liballoc doc intra link while readable) - rust-lang#75881 (Expand rustdoc theme chooser x padding) - rust-lang#75885 (Fix another clashing_extern_declarations false positive.) - rust-lang#75892 (Fix typo in TLS Model in Unstable Book) - rust-lang#75910 (Add test for issue rust-lang#27130) - rust-lang#75917 (Move to intra doc links for core::ptr::non_null) - rust-lang#75975 (Allow --bess ing expect-tests in tools) - rust-lang#75990 (Add __fastfail for Windows on arm/aarch64) - rust-lang#76015 (Fix loading pretty-printers in rust-lldb script) - rust-lang#76022 (Clean up rustdoc front-end source code) - rust-lang#76029 (Move to intra-doc links for library/core/src/sync/atomic.rs) - rust-lang#76057 (Move retokenize hack to save_analysis) Failed merges: r? @ghost
The test is added by #75910, closing. |
When passing a
&mut &[u8]
to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.This is a minimal example (playground of all following snippets):
I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.
Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the
a[1..]
slicing).The
.LBB0_9
branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:It does all the loop calculations in registers. Also, marking the first function as
inline(always)
and defining the second function in terms of it ({ trim_in_place(&mut a); a }
) optimizes to the same code.Both variants check if the slice is empty before entering the loop and return immediately in that case.
Background
I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying
inline(always)
to that function gives a small but measurable performance improvement across the board,but increases code size by 10 KiB(Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.Meta
Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local
rustc
has the same problem. The exact commit is meaningless because it only exists in myrust.git
fork. It's basedf9f580936d81ed527115ae86375f69bb77723b1c
.The text was updated successfully, but these errors were encountered: