Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop emitting one-at-a-time byte ops when swapping byte arrays #134946

Open
scottmcm opened this issue Dec 31, 2024 · 0 comments · May be fixed by #134954
Open

Stop emitting one-at-a-time byte ops when swapping byte arrays #134946

scottmcm opened this issue Dec 31, 2024 · 0 comments · May be fixed by #134954
Assignees
Labels
A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

scottmcm commented Dec 31, 2024

Mentioned on Discord https://discord.com/channels/273534239310479360/592856094527848449/1319367290902286367

Demo when swapping [u8; 44]: https://rust.godbolt.org/z/rznror9aG

Especially with opt-level 2, there's still

        movzx   eax, byte ptr [rdi + 32]
        movzx   ecx, byte ptr [rsi + 32]
        mov     byte ptr [rdi + 32], cl
        mov     byte ptr [rsi + 32], al
        movzx   eax, byte ptr [rdi + 33]
        movzx   ecx, byte ptr [rsi + 33]
        mov     byte ptr [rdi + 33], cl
        mov     byte ptr [rsi + 33], al
        movzx   eax, byte ptr [rdi + 34]
        movzx   ecx, byte ptr [rsi + 34]
        mov     byte ptr [rdi + 34], cl
        mov     byte ptr [rsi + 34], al
        movzx   eax, byte ptr [rdi + 35]
        movzx   ecx, byte ptr [rsi + 35]
        mov     byte ptr [rdi + 35], cl
        mov     byte ptr [rsi + 35], al
        movzx   eax, byte ptr [rdi + 36]
        movzx   ecx, byte ptr [rsi + 36]
        mov     byte ptr [rdi + 36], cl
        mov     byte ptr [rsi + 36], al
        movzx   eax, byte ptr [rdi + 37]
        movzx   ecx, byte ptr [rsi + 37]
        mov     byte ptr [rdi + 37], cl
        mov     byte ptr [rsi + 37], al
        movzx   eax, byte ptr [rdi + 38]
        movzx   ecx, byte ptr [rsi + 38]
        mov     byte ptr [rdi + 38], cl
        mov     byte ptr [rsi + 38], al
        movzx   eax, byte ptr [rdi + 39]
        movzx   ecx, byte ptr [rsi + 39]
        mov     byte ptr [rdi + 39], cl
        mov     byte ptr [rsi + 39], al
        movzx   eax, byte ptr [rdi + 40]
        movzx   ecx, byte ptr [rsi + 40]
        mov     byte ptr [rdi + 40], cl
        mov     byte ptr [rsi + 40], al
        movzx   eax, byte ptr [rdi + 41]
        movzx   ecx, byte ptr [rsi + 41]
        mov     byte ptr [rdi + 41], cl
        mov     byte ptr [rsi + 41], al
        movzx   eax, byte ptr [rdi + 42]
        movzx   ecx, byte ptr [rsi + 42]
        mov     byte ptr [rdi + 42], cl
        mov     byte ptr [rsi + 42], al
        movzx   eax, byte ptr [rdi + 43]
        movzx   ecx, byte ptr [rsi + 43]
        mov     byte ptr [rdi + 43], cl
        mov     byte ptr [rsi + 43], al
        ret

And in opt-level 3 it's

  %x16.i.i.i = getelementptr inbounds i8, ptr %a, i64 32, !dbg !55
  %y18.i.i.i = getelementptr inbounds i8, ptr %b, i64 32, !dbg !62
  %a19.i.i.i = load i8, ptr %x16.i.i.i, align 1, !dbg !12
  %b20.i.i.i = load i8, ptr %y18.i.i.i, align 1, !dbg !44
  store i8 %b20.i.i.i, ptr %x16.i.i.i, align 1, !dbg !48
  store i8 %a19.i.i.i, ptr %y18.i.i.i, align 1, !dbg !53
  %x16.i.i.i.1 = getelementptr inbounds i8, ptr %a, i64 33, !dbg !55
  %y18.i.i.i.1 = getelementptr inbounds i8, ptr %b, i64 33, !dbg !62
  %a19.i.i.i.1 = load i8, ptr %x16.i.i.i.1, align 1, !dbg !12
  %b20.i.i.i.1 = load i8, ptr %y18.i.i.i.1, align 1, !dbg !44
  store i8 %b20.i.i.i.1, ptr %x16.i.i.i.1, align 1, !dbg !48
  store i8 %a19.i.i.i.1, ptr %y18.i.i.i.1, align 1, !dbg !53
  %x16.i.i.i.2 = getelementptr inbounds i8, ptr %a, i64 34, !dbg !55
  %y18.i.i.i.2 = getelementptr inbounds i8, ptr %b, i64 34, !dbg !62
  %a19.i.i.i.2 = load i8, ptr %x16.i.i.i.2, align 1, !dbg !12
  %b20.i.i.i.2 = load i8, ptr %y18.i.i.i.2, align 1, !dbg !44
  store i8 %b20.i.i.i.2, ptr %x16.i.i.i.2, align 1, !dbg !48
  store i8 %a19.i.i.i.2, ptr %y18.i.i.i.2, align 1, !dbg !53
  %x16.i.i.i.3 = getelementptr inbounds i8, ptr %a, i64 35, !dbg !55
  %y18.i.i.i.3 = getelementptr inbounds i8, ptr %b, i64 35, !dbg !62
  %a19.i.i.i.3 = load i8, ptr %x16.i.i.i.3, align 1, !dbg !12
  %b20.i.i.i.3 = load i8, ptr %y18.i.i.i.3, align 1, !dbg !44
  store i8 %b20.i.i.i.3, ptr %x16.i.i.i.3, align 1, !dbg !48
  store i8 %a19.i.i.i.3, ptr %y18.i.i.i.3, align 1, !dbg !53
  %x16.i.i.i.4 = getelementptr inbounds i8, ptr %a, i64 36, !dbg !55
  %y18.i.i.i.4 = getelementptr inbounds i8, ptr %b, i64 36, !dbg !62
  %a19.i.i.i.4 = load i8, ptr %x16.i.i.i.4, align 1, !dbg !12
  %b20.i.i.i.4 = load i8, ptr %y18.i.i.i.4, align 1, !dbg !44
  store i8 %b20.i.i.i.4, ptr %x16.i.i.i.4, align 1, !dbg !48
  store i8 %a19.i.i.i.4, ptr %y18.i.i.i.4, align 1, !dbg !53
  %x16.i.i.i.5 = getelementptr inbounds i8, ptr %a, i64 37, !dbg !55
  %y18.i.i.i.5 = getelementptr inbounds i8, ptr %b, i64 37, !dbg !62
  %a19.i.i.i.5 = load i8, ptr %x16.i.i.i.5, align 1, !dbg !12
  %b20.i.i.i.5 = load i8, ptr %y18.i.i.i.5, align 1, !dbg !44
  store i8 %b20.i.i.i.5, ptr %x16.i.i.i.5, align 1, !dbg !48
  store i8 %a19.i.i.i.5, ptr %y18.i.i.i.5, align 1, !dbg !53
  %x16.i.i.i.6 = getelementptr inbounds i8, ptr %a, i64 38, !dbg !55
  %y18.i.i.i.6 = getelementptr inbounds i8, ptr %b, i64 38, !dbg !62
  %a19.i.i.i.6 = load i8, ptr %x16.i.i.i.6, align 1, !dbg !12
  %b20.i.i.i.6 = load i8, ptr %y18.i.i.i.6, align 1, !dbg !44
  store i8 %b20.i.i.i.6, ptr %x16.i.i.i.6, align 1, !dbg !48
  store i8 %a19.i.i.i.6, ptr %y18.i.i.i.6, align 1, !dbg !53
  %x16.i.i.i.7 = getelementptr inbounds i8, ptr %a, i64 39, !dbg !55
  %y18.i.i.i.7 = getelementptr inbounds i8, ptr %b, i64 39, !dbg !62
  %a19.i.i.i.7 = load i8, ptr %x16.i.i.i.7, align 1, !dbg !12
  %b20.i.i.i.7 = load i8, ptr %y18.i.i.i.7, align 1, !dbg !44
  store i8 %b20.i.i.i.7, ptr %x16.i.i.i.7, align 1, !dbg !48
  store i8 %a19.i.i.i.7, ptr %y18.i.i.i.7, align 1, !dbg !53
  %x16.i.i.i.8 = getelementptr inbounds i8, ptr %a, i64 40, !dbg !55
  %y18.i.i.i.8 = getelementptr inbounds i8, ptr %b, i64 40, !dbg !62
  %a19.i.i.i.8 = load i8, ptr %x16.i.i.i.8, align 1, !dbg !12
  %b20.i.i.i.8 = load i8, ptr %y18.i.i.i.8, align 1, !dbg !44
  store i8 %b20.i.i.i.8, ptr %x16.i.i.i.8, align 1, !dbg !48
  store i8 %a19.i.i.i.8, ptr %y18.i.i.i.8, align 1, !dbg !53
  %x16.i.i.i.9 = getelementptr inbounds i8, ptr %a, i64 41, !dbg !55
  %y18.i.i.i.9 = getelementptr inbounds i8, ptr %b, i64 41, !dbg !62
  %a19.i.i.i.9 = load i8, ptr %x16.i.i.i.9, align 1, !dbg !12
  %b20.i.i.i.9 = load i8, ptr %y18.i.i.i.9, align 1, !dbg !44
  store i8 %b20.i.i.i.9, ptr %x16.i.i.i.9, align 1, !dbg !48
  store i8 %a19.i.i.i.9, ptr %y18.i.i.i.9, align 1, !dbg !53
  %x16.i.i.i.10 = getelementptr inbounds i8, ptr %a, i64 42, !dbg !55
  %y18.i.i.i.10 = getelementptr inbounds i8, ptr %b, i64 42, !dbg !62
  %a19.i.i.i.10 = load i8, ptr %x16.i.i.i.10, align 1, !dbg !12
  %b20.i.i.i.10 = load i8, ptr %y18.i.i.i.10, align 1, !dbg !44
  store i8 %b20.i.i.i.10, ptr %x16.i.i.i.10, align 1, !dbg !48
  store i8 %a19.i.i.i.10, ptr %y18.i.i.i.10, align 1, !dbg !53
  %x16.i.i.i.11 = getelementptr inbounds i8, ptr %a, i64 43, !dbg !55
  %y18.i.i.i.11 = getelementptr inbounds i8, ptr %b, i64 43, !dbg !62
  %a19.i.i.i.11 = load i8, ptr %x16.i.i.i.11, align 1, !dbg !12
  %b20.i.i.i.11 = load i8, ptr %y18.i.i.i.11, align 1, !dbg !44
  store i8 %b20.i.i.i.11, ptr %x16.i.i.i.11, align 1, !dbg !48
  store i8 %a19.i.i.i.11, ptr %y18.i.i.i.11, align 1, !dbg !53
  ret void, !dbg !65
@scottmcm scottmcm self-assigned this Dec 31, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 31, 2024
@saethlin saethlin added A-codegen Area: Code generation T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 31, 2024
bors added a commit to rust-lang-ci/rust that referenced this issue Dec 31, 2024
Redo the swap code for better tail & padding handling

A couple of parts here

## Fixes rust-lang#134713

Actually using the type passed to `ptr::swap_nonoverlapping` for anything other than its size + align turns out to not work, so this redo goes back to *always* swapping via not-`!noundef` integers.

(Except in `const`, which keeps doing the same thing as before to preserve `@RalfJung's` fix from rust-lang#134689)

## Fixes rust-lang#134946

I'd previously moved the swapping to use auto-vectorization, but someone pointed out on Discord that the tail loop handling from that left a whole bunch of byte-by-byte swapping around.  This PR goes back to a manual chunk, with at most logarithmic more instructions for the tail.

(There are other ways that could potentially handle the tail even better, but this seems to be pretty good, since it's how LLVM ends up lowering operations on types like `i56`.)

## Polymorphization

Since it's silly to have separate copies of swapping -- especially *untyped* swapping! -- for `u32`, `i32`, `f32`, `[u16; 2]`, etc, this sends everything to byte versions, but still mono'd by alignment.  That should make it more ok that the code is somewhat more complex, since we only get 7 monomorphizations of the complicated bit.

(One day we'll be able to remove some of the hacks by being able to just call `foo::<{align_of::<T>()}>`, but since alignments are only powers of two, the manual dispatch out isn't a big deal.)
@jieyouxu jieyouxu added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants