Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u32::from_be_bytes(*bytes) generates suboptimal code on riscv #88852

Closed
xobs opened this issue Sep 11, 2021 · 9 comments
Closed

u32::from_be_bytes(*bytes) generates suboptimal code on riscv #88852

xobs opened this issue Sep 11, 2021 · 9 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-riscv Target: RISC-V architecture T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@xobs
Copy link
Contributor

xobs commented Sep 11, 2021

I tried this code:

pub fn from_be_slice_manual(bytes: &[u8; 4]) -> u32 {
    (bytes[0] as u32) << 24
    | ((bytes[1] as u32) << 16)
    | ((bytes[2] as u32) << 8)
    | (bytes[3] as u32)
}

pub fn from_be_slice_intrinsic(bytes: &[u8; 4]) -> u32 {
    u32::from_be_bytes(*bytes)
}

I expected to see this happen: Both should produce equal output, or at the very least the one using the intrinsic should be acceptable. When building on x86 these produce the same output, and on arm-unknown-linux-gnueabi the output is different but not terrible. On riscv64gc-unknown-linux-gnu the asm generated by the intrinsic is massive.

Instead, this happened:

example::from_be_slice_manual:
        lb      a1, 0(a0)
        lbu     a2, 1(a0)
        slli    a1, a1, 24
        lbu     a3, 2(a0)
        slli    a2, a2, 16
        lbu     a0, 3(a0)
        or      a1, a1, a2
        slli    a2, a3, 8
        or      a1, a1, a2
        or      a0, a0, a1
        ret

example::from_be_slice_intrinsic:
        lbu     a1, 1(a0)
        lbu     a2, 0(a0)
        lb      a3, 3(a0)
        lbu     a0, 2(a0)
        slli    a1, a1, 8
        or      a1, a1, a2
        slli    a2, a3, 8
        or      a0, a0, a2
        slli    a0, a0, 16
        or      a0, a0, a1
        slli    a1, a0, 8
        addi    a2, zero, 255
        slli    a3, a2, 32
        and     a1, a1, a3
        slli    a3, a0, 24
        slli    a4, a2, 40
        and     a3, a3, a4
        or      a1, a1, a3
        slli    a3, a0, 40
        slli    a2, a2, 48
        and     a2, a2, a3
        slli    a0, a0, 56
        or      a0, a0, a2
        or      a0, a0, a1
        srli    a0, a0, 32
        ret

Meta

rustc --version --verbose:

rustc 1.55.0
1
rustc 1.55.0 - 786ms

rustc 1.55.0 (c8dfcfe04 2021-09-06)
binary: rustc
commit-hash: c8dfcfe046a7680554bf4eb612bad840e7631c4b
commit-date: 2021-09-06
host: x86_64-unknown-linux-gnu
release: 1.55.0
LLVM version: 12.0.1

Godbolt link: https://godbolt.org/z/aPPdnond5

@xobs xobs added the C-bug Category: This is a bug. label Sep 11, 2021
@workingjubilee workingjubilee added I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-riscv Target: RISC-V architecture labels Sep 12, 2021
@klensy
Copy link
Contributor

klensy commented Sep 12, 2021

This is not slices, this is arrays, yes?

Taking array by value converts from_be_slice_intrinsic to from_be_slice_manual (or better)
arm-unknown-linux-gnueabi https://godbolt.org/z/684TfW15E
riscv64gc-unknown-linux-gnu https://godbolt.org/z/fGYc5Yvcr

this examples on rustc 1.57.0-nightly (b69fe5726 2021-09-10) or newer

@xobs
Copy link
Contributor Author

xobs commented Sep 13, 2021

I'm not sure if these are considered slices or arrays -- I'm a bit fuzzy on the terminology.

Some background: This issue was spotted in some networking code that operates on &[u8], and gets subslices with val[0..4]. These subslices are then converted into integers, which is why we're using big endian. When bytes come in from the network, they're behind a slice, which is why we need to dereference things at some point.

On RISC-V, taking array by value is still less efficient than manually packing because it adds two more instructions.

https://godbolt.org/z/5Khj5M3GK

@klensy
Copy link
Contributor

klensy commented Sep 13, 2021

u32::from_be_bytes (with --emit=llvm-ir flag to view IR) use single instruction call i32 @llvm.bswap.i32(i32 %0), where from_be_slice_manual compiles into a bunch of instructions. Even if manual version it shorter, is it faster?

Ofc there possibility that implementation of bswap for that target not the fastest.

@nagisa nagisa added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Sep 13, 2021
@workingjubilee
Copy link
Member

I'm not sure if these are considered slices or arrays -- I'm a bit fuzzy on the terminology.

[u8; 4] is an array.
[u8] is a slice.
&[u8; 4] is a borrowed array (or reference to an array).
&[u8] is a borrowed slice (or reference to a slice).
[u8] is dynamically sized so it must be used via indirection.

I believe borrowing arrays can coerce them into slices in some cases, though I am always slightly fuzzy on the rules for that, but it is something to watch out for, as an array generally has a known length and so is easier to optimize. Likewise small byte counts like this should generally be taken by-value and not by-reference.

@workingjubilee workingjubilee added the A-codegen Area: Code generation label Sep 13, 2021
@workingjubilee
Copy link
Member

I should note that "two instructions" is not necessarily a significant performance hit. I am not super-familiar with the performance details of RISC-V architectures, but there is the usual issue of superscalar out-of-order execution leading to surprising, non-intuitive results, and for RISC-V at least some implementers advocate "macro-op fusion". At decode time the architecture is entitled to fuse two machine-assembly instructions and execute them as one machine-internal operation. So we should be very careful when we talk about "efficient"... do we mean in speed or in binary size? On the large scale, they are often very similar, but not always on this tiny nanosecond scale. This is why I added I-heavy and not I-slow.

When changing your version of the code to

pub fn from_be_array_manual(bytes: [u8; 4]) -> u32 {
    (bytes[0] as u32) << 24
    | ((bytes[1] as u32) << 16)
    | ((bytes[2] as u32) << 8)
    | (bytes[3] as u32)
}

it emits identical code to

pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
    u32::from_be_bytes(bytes)
}

I think this issue still matters as this is unexpectedly heavy in the case with the indirection: Either rustc or LLVM should see we do a copy anyways and thus try to make everything more transparent. Since you mentioned this is an issue in a larger codebase, I would like to know if you have a slightly more complex example that shows it does not optimize well in the hot path of some loop: that would help troubleshoot this greatly.

@memoryruins
Copy link
Contributor

Trying the basic bit manipulation target feature mentioned in #100528, some of these examples reduce in size.

pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
    u32::from_be_bytes(bytes)
}

Without the target feature,

example::from_be_array_intrinsic:
        srliw   a1, a0, 8
        lui     a2, 16
        addiw   a2, a2, -256
        and     a1, a1, a2
        srliw   a2, a0, 24
        or      a1, a1, a2
        slli    a2, a0, 8
        lui     a3, 4080
        and     a2, a2, a3
        slliw   a0, a0, 24
        or      a0, a0, a2
        or      a0, a0, a1
        ret

With -Ctarget-feature=+zbb,

example::from_be_array_intrinsic:
        rev8    a0, a0
        srli    a0, a0, 32
        ret

https://godbolt.org/z/5abahK6r1

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
@istankovic
Copy link
Contributor

With 1.73, even without using the target feature, I think this is good enough with respect to the original description:
image

@istankovic
Copy link
Contributor

@xobs could you please check whether current rustc behaves as you expect so we can close this?

@xobs
Copy link
Contributor Author

xobs commented Jan 31, 2024

Yes, this behaves as expected. Thank you for following up!

@xobs xobs closed this as completed Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-riscv Target: RISC-V architecture T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants