[Issue #296] Rust: Use a tighter inner type for Error #356

foxhlchen · 2021-06-05T09:35:43Z

Use NonZeroI32 as the inner type. This allows Result<()> to fit in 32-bit, instead of 64-bit.

V1->V2

Use NonZeroI32 instead of NonZeroI16
Rewrite kernel_const_to_error as a const function
Use unsafe NonZeroI32 creator in from_kernel_errno to avoid unwrap()

V2->V3

Add a comment explaining the reason for choosing NonZeroI32 to Error
Rewrite kernel_const_to_error() back to a macro and add static_assertion! to it.
Use Error(unsafe { NonZeroI16::new_unchecked(errno as i16) }) in from_kernel_error() to make it terser

V3->V4

Following Miguel's review, fix some comment format problems

V4->V5

Rewrite kernel_const_to_error() to a const function again. 😮‍💨

Related commit:
#296
May influence:
#358

TheSven73

This is feedback mostly for @nbdd0121 and @ojeda - because @foxhlchen is doing what #296 is asking for.

What are the benefits of using a NonZero inner type here? Sure, it cannot be 0 - but that doesn't provide us with any benefit, since we already strictly enforce the [-MAX_ERRNO..-1] invariant.

I see plenty of downsides:

more complex code for little or no gain
if we use NonZeroI16::new() (as happens here) there's no need for unsafe code, but we need an unwrap() which could potentially panic! if things go wrong. this is rather the opposite of the "fail gracefully" that we are working towards
if we use unsafe NonZeroI16::new_unchecked() (as also happens here) then we are using unsafe code for no good reason, plus a good chance of introducing UB when 0 is passed by mistake. Again, this is the opposite of the "fail gracefully" that we are working towards

Suggestion: let's simply use an i16 for the inner type. The invariant is still enforced. And it allows us to get rid of awkward casts. This is actually a very elegant and simple change.

nbdd0121 · 2021-06-05T15:44:00Z

It has performance implication. NonZeroU16 gives compiler opportunity to use value 0 to represent other meanings. So for Result<(), NonZeroU16>, Ok(()) is represented just as value 0 while Err(value) is represented as value. For Result<(), u16> however, Err(value) is represented with the errno plus a tag 1. So we need two registers for the return value instead of just one. You can see the assembly comparison: https://godbolt.org/z/r7E7vMh9x

nbdd0121 · 2021-06-05T15:45:08Z

rust/kernel/error.rs

+/// # Safety
+///
+/// The parameter must be a valid kernel error number.
+macro_rules! kernel_const_to_error {


This can be made into a const fn.

+1 on converting this to a const function. I think it makes sense for it to be a constructor of Error, something like:

const fn from_const(errno: u32) -> Self { let value = -(errno as i32); if value < -(bindings::MAX_ERRNO as i32) || value >= 0 { panic!("Error number out of range"); } // SAFETY: We have checked above that `v` is within the right range. unsafe { Self::from_kernel_errno_unchecked(value) } }

We should only use macros when no reasonable alternative exists within the core language.

nbdd0121 · 2021-06-05T15:46:34Z

rust/kernel/error.rs

@@ -76,7 +87,8 @@ impl Error {

        // INVARIANT: the check above ensures the type invariant
        // will hold.
-        Error(errno)
+        let nzi16_errno = NonZeroI16::new(errno as i16).unwrap();


Use new_unchecked since errno is checked to be non-zero.

TheSven73 · 2021-06-05T15:58:58Z

It has performance implication. NonZeroU16 gives compiler opportunity to use value 0 to represent other meanings.

That's fascinating to see - genuinely, no snark! Yet, do we have any measurements or benchmarks to show that this really makes a difference in the real world?

In the absence of the latter, should we choose a solution that favours simplicity, and least chance of inadvertent panic! or UB? I believe i16 could be that solution.

There's nothing preventing us from reviewing a NonZero inner type in a separate PR. But my review comment there would be identical - favour simplicity unless real-world evidence that it does make a difference. And even then, it's a trade-off.

TheSven73

I am not on board with a NonZero inner type, but will review just in case :)

TheSven73 · 2021-06-05T16:10:19Z

rust/kernel/error.rs

@@ -102,11 +117,14 @@ impl fmt::Debug for Error {
            fn rust_helper_errname(err: c_types::c_int) -> *const c_types::c_char;
        }
        // SAFETY: FFI call.
-        let name = unsafe { rust_helper_errname(-self.0) };
+        let name = unsafe { rust_helper_errname(-self.to_kernel_errno()) };


Is the to_kernel_errno() call required here? Why not use -self.0.get() ?

Is the to_kernel_errno() call required here?

no

Why not use -self.0.get() ?

To me, using to_kernel_errno() is clearer, and if we change the inner representation of Error again in the future, we don't need to touch this code. But that is just my personal taste. :)

TheSven73 · 2021-06-05T16:11:30Z

rust/kernel/error.rs

+        //
+        // Safety: `errno` must not be zero, which is guaranteed by the contract
+        // of this function.
+        Error(unsafe { NonZeroI16::new_unchecked(errno as i16) })


Does this need a // CAST annotation that explains why casting a c_int to an i16 won't lose any bits and result in UB?

ojeda · 2021-06-05T16:13:29Z

Hmm... you guys make it tough for me to choose. :-)

On one hand, I agree with @TheSven73 that it might be slightly "too much" before we hit mainline. People will be likely to look at this file in particular, since some may have heard about "Rust nice error handling", "Result", "pattern matching", "sum types", etc. and instead of seeing something straightforward and having the impression "oh, this actually makes sense to me, neat!", they may leave with a feeling of "wow... too fancy, another C++?!".

On the other hand, perhaps Result<()> is common enough that it merits this. And we can show there is an actual advantage with a simple Compiler Explorer example. Some people like a lot seeing such kind of optimizations.

TheSven73 · 2021-06-05T16:16:10Z

they may leave with a feeling of "wow... too fancy, another C++?!".
[...]
Some people like a lot seeing such kind of optimizations.

Consider very carefully the viewpoint and incentives of the people which are most important for you to convince.

TheSven73 · 2021-06-05T16:20:40Z

On the other hand, perhaps Result<()> is common enough that it merits this.

Allow me to play Devil's advocate - I love doing that :)
How would you (or we) know that it does?

ojeda · 2021-06-05T16:22:19Z

A benchmark! :)

In general, we should strive for simplicity unless there is a reason not to (a reason being the benchmark showing it makes a difference), specially before mainline.

TheSven73 · 2021-06-05T16:24:41Z

A benchmark! :)

🥇 I'm assuming we only have a single non-trivial, performance-sensitive, real-world Rust-for-Linux application right now? That would be binder?

Something tells me that @wedsonaf might have benchmarks somewhere. But that discussion would belong in a separate PR IMHO.

ojeda · 2021-06-05T16:26:57Z

That would be binder?

Yeah, I would say so. (Wedson is on holidays, I think he is back next week).

nbdd0121 · 2021-06-05T16:27:56Z

Actually we maybe can make the inner type a NonZeroI32 -- in that case Result<(), Error> is ABI-compatible with a c_int! It'll be truly zero-cost ;)

foxhlchen · 2021-06-05T18:02:35Z

Actually we maybe can make the inner type a NonZeroI32 -- in that case Result<(), Error> is ABI-compatible with a c_int! It'll be truly zero-cost ;)

Then it's no better than using i16? both will occupy 4byte

nbdd0121 · 2021-06-05T21:35:00Z

Then it's no better than using i16? both will occupy 4byte

You don't need to have sign extensions when converting a Result<(), Error> to c_int. Many architecture can handle i32 better than i16, e.g. in x64, a simple mov will clear upper 32-bits (so can be elided away in many cases) but you'll need a movsx to clear upper 48-bits.

foxhlchen · 2021-06-06T00:52:05Z

Then it's no better than using i16? both will occupy 4byte

You don't need to have sign extensions when converting a Result<(), Error> to c_int. Many architecture can handle i32 better than i16, e.g. in x64, a simple mov will clear upper 32-bits (so can be elided away in many cases) but you'll need a movsx to clear upper 48-bits.

Ok, But this is also true for NonZeroI16, right? We can save a lot of space but may incur performance penalties.

nbdd0121 · 2021-06-06T00:56:55Z

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

foxhlchen · 2021-06-06T01:15:19Z

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

Got it, thanks for the explanation. I now think NonZerol32 is the best choice. :)

Let me change it to use NonZeroI32.

TheSven73 · 2021-06-06T08:24:04Z

rust/kernel/error.rs

+        //
+        // Safety: `errno` must be non zero, which is guaranteed by the check
+        // above.
+        let nz_errno = unsafe { NonZeroI32::new_unchecked(errno) };


Would it be better to call from_kernel_errno_unchecked() here to prevent having to write

Error(unsafe { NonZeroI32::new_unchecked(errno) })

twice?

you mean

unsafe{ Error::from_kernel_errno_unchecked() }

?
it's not terser

TheSven73 · 2021-06-06T08:27:36Z

rust/kernel/error.rs

@@ -21,44 +21,49 @@ use core::str::{self, Utf8Error};
 ///
 /// The value is a valid `errno` (i.e. `>= -MAX_ERRNO && < 0`).
 #[derive(Clone, Copy, PartialEq, Eq)]
-pub struct Error(c_types::c_int);
+pub struct Error(NonZeroI32);


If you guys believe that NonZeroI32 is optimal, perhaps put a comment that explains why?
Otherwise we might get PRs from newcomers who think that NonZeroI32 occupies too much space, so they'll replace with NonZeroI16...

TheSven73 · 2021-06-06T08:33:25Z

rust/kernel/error.rs


 impl Error {
    /// Invalid argument.
-    pub const EINVAL: Self = Error(-(bindings::EINVAL as i32));
+    pub const EINVAL: Self = kernel_const_to_error(bindings::EINVAL);


I am unfamiliar with Rust macros, but I wonder if it would be possible to write a macro that takes the Error name as the input parameter, and expands to the whole pub const definition? Maybe even a static_assert! too?

Example:

declare_const_error!(EINVAL); // would expand to static_assert!(bindings::EINVAL < 0 && bindings::EINVAL >= -MAX_ERRNO); pub const EINVAL: Self = Error(unsafe { NonZeroI32::new_unchecked(-(bindings::EINVAL as i32)) });

Good idea!

But I don't know why even static_assert!(bindings::ENOMEM < bindings::MAX_ERRNO) doesn't work.

I got the following error:

error: `const` items in this context need a name --> rust/kernel/static_assert.rs:36:15 | 30 | / macro_rules! static_assert { 31 | | ($condition:expr) => { 32 | | // Based on the latest one in `rustc`'s one before it was [removed]. 33 | | // ... | 36 | | const _: () = [()][!($condition) as usize]; | | ^ `_` is not a valid name for this `const` item 37 | | }; 38 | | } | |_- in this expansion of `static_assert!` | ::: rust/kernel/error.rs:42:5 | 42 | static_assert!(bindings::ENOMEM < bindings::MAX_ERRNO); | ------------------------------------------------------- in this macro invocation

If you write like what @TheSven73 suggests, const _ in static_assert! will become an associated constant rather than just a normal item, so it must have a name. It'll need to expand into something like

pub const EINVAL: Self = { static_assert!(bindings::EINVAL < 0 && bindings::EINVAL >= -MAX_ERRNO); Error(unsafe { NonZeroI32::new_unchecked(-(bindings::EINVAL as i32)) }) };

But I think it'll be better to put the check in a const fn + const panic instead of using static_assert in this case:

const fn i_am_bad_at_naming_things(errno: i32) -> Error { assert!(errno < -(bindings::MAX_ERRNO as i32) || errno >= 0); Error(unsafe { NonZeroI32::new_unchecked(errno) }) } pub const EINVAL: Self = i_am_bad_at_naming_things(-(bindings::EINVAL as i32));

Got it. Thank you, Gary!!

Can I combine this? i.e. declare a i_am_bad_at_naming_things() function and write a macro to wrap
pub const EINVAL: Self = i_am_bad_at_naming_things(-(bindings::EINVAL as i32));
into declare_const_error!(EINVAL);

Do you think it's a good idea??

nvm, I opted for the first approach.

macro_rules! declare_const_error { ($($tt:tt)*) => { pub const $($tt)*: Self = { static_assert!(bindings::$($tt)* <= bindings::MAX_ERRNO && bindings::$($tt)* > 0); Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$($tt)* as i32)) }) }; }; }

Thank you again!

Agree, putting doc inside the macro invocation is bizarre to me.
I don't think it will be clearer.

Actually, you can have everything in a single macro:

macro_rules! declare_const_error { ($($(#[$attr:meta])* $name:ident)*) => { $( $(#[$attr])* pub const $name: Self = { static_assert!(bindings::$name <= bindings::MAX_ERRNO && bindings::$name > 0); Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$name as i32)) }) }; )* }; } declare_const_error! { /// Invalid argument. EINVAL /// Out of memory. ENOMEM // and more }

I wonder if this const error declaration macro (and its invocations) should go into its own dedicated PR?
Also, perhaps it could link up with @sladyn98's #358 somehow?

Actually, you can have everything in a single macro:

macro_rules! declare_const_error { ($($(#[$attr:meta])* $name:ident)*) => { $( $(#[$attr])* pub const $name: Self = { static_assert!(bindings::$name <= bindings::MAX_ERRNO && bindings::$name > 0); Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$name as i32)) }) }; )* }; } declare_const_error! { /// Invalid argument. EINVAL /// Out of memory. ENOMEM // and more }

This is cool. However, I strongly disagree we use something like it.
After messing around, I realize the initial way is clearest.

For

impl Error { pub const EINVAL: Self = kernel_const_to_error(bindings::EINVAL); }

A glimpse is enough to let you understand that we have an EINVAL const in the Error scope whose value is somehow related to the kernel's EINVAL (and you can easily infer the value of EINVAL equals to negative bindings::EINVAL).

But with

impl Error{ declare_const_error! { /// Invalid argument. EINVAL /// Out of memory. ENOMEM // and more } }

You need to spend time figuring out the magic here especially when you are not familiar with rust macro.
It adds more mental burden to developers.

I wonder if this const error declaration macro (and its invocations) should go into its own dedicated PR?
Also, perhaps it could link up with @sladyn98's #358 somehow?

Let me add a reference.

bjorn3 · 2021-06-06T09:05:15Z

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

Type is uninhabited or a ZST => skip it.
Type is a scalar => pass in a single register.
Type is a scalar pair => pass in two registers.
Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)
Type is an aggregate:
- Type is at most two registers big => pass in at most two registers.
- Type is bigger than two registers => pass a pointer to the value.

https://github.com/rust-lang/rust/blob/9a576175cc9a0aecb85d0764a4f66ee29e26e155/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

foxhlchen · 2021-06-06T09:55:00Z

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

Type is uninhabited or a ZST => skip it.

Type is a scalar => pass in a single register.

Type is a scalar pair => pass in two registers.

Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)

Type is an aggregate:

Type is at most two registers big => pass in at most two registers.

Type is bigger than two registers => pass a pointer to the value.

https://github.com/rust-lang/rust/blob/9a576175cc9a0aecb85d0764a4f66ee29e26e155/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

Ok.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

But his conclusion is correct.
https://godbolt.org/z/GPqfEGfhd

bjorn3 · 2021-06-06T12:34:30Z

But his conclusion is correct.

Indeed

nbdd0121 · 2021-06-06T14:23:33Z

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

Type is uninhabited or a ZST => skip it.

Type is a scalar => pass in a single register.

Type is a scalar pair => pass in two registers.

Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)

Type is an aggregate:

Type is at most two registers big => pass in at most two registers.

Type is bigger than two registers => pass a pointer to the value.

rust-lang/rust@9a57617/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

Thanks for the explanation and the pointer! Is scalar pair Rust-specific? Is there a document describing the Rust or LLVM behaviour about this? I couldn't seem to find much material on the Internet and there're no mentions of scalar pairs in SysV ABI.

bjorn3 · 2021-06-06T18:03:02Z

Is scalar pair Rust-specific?

AFAIK it is indeed rust specific. The Rust abi is unstable and apart from the source code I linked I believe there is no documentation. I had to look at it in the past to use the same abi in rustc_codegen_cranelift as in the llvm backend.

TheSven73 · 2021-06-09T02:00:04Z

In Rust, Similar as chosen for C. In the client executables, -O3.
ETA: kernel C has Optimize for performance

TheSven73 · 2021-06-09T02:03:41Z

I have a performance improvement of 1.96‰

Just making sure that I read this correctly: that's 1.96 per-thousand (per mille), not 1.96 percent, correct?

foxhlchen · 2021-06-09T02:06:09Z

Can you do a microbenchmark just for the Result??
I haven't looked at the binder's code. But I don't think this can really say something.
Too many factors can add up to the variation.

nbdd0121 · 2021-06-09T02:09:34Z

I have a performance improvement of 1.96‰

Just making sure that I read this correctly: that's 1.96 per-thousand, not 1.96 percent, correct?

Yes, I misread the number of zeros originally 🤦 and write percent. I was a bit surprised by such a large change and double-checked the result.

I think binder might not be the reliable benchmark for other code (we are more likely just testing how often the slow path of mutex_lock is being hit, which is affected by many factors in SMP systems, e.g. slower code can make mutex less contentious and thus make things faster).

TheSven73 · 2021-06-09T02:12:44Z

AFAIK the binder benchmark is a real-world performance benchmark which returns Results all over the place, including in more than a few hot paths? Curious to hear @wedsonaf 's opinion.

wedsonaf · 2021-06-10T14:17:38Z

AFAIK the binder benchmark is a real-world performance benchmark which returns Results all over the place, including in more than a few hot paths? Curious to hear @wedsonaf 's opinion.

I think it is a real-world benchmark but there are lots of other real-world combinations that are also important but aren't captured by this single benchmark. IOW, I think we can use this benchmark as a data point, but we shouldn't base our decisions on it alone.

Android has more extensive benchmarks involving Binder, for example: https://source.android.com/compatibility/vts/performance

Once we have enough of it implemented to boot Android, we'll be able to get these to run. Those will be better numbers to base our decisions on.

TheSven73 · 2021-06-10T14:19:41Z

Once we have enough of it implemented to boot Android, we'll be able to get these to run. Those will be better numbers to base our decisions on.

Excellent point. How far are you from being able to run at least some of those?

foxhlchen · 2021-06-11T06:54:14Z

I've written a tiny microbenchmark.
https://github.com/foxhlchen/rustlinux_error_mb.git

assembly is here:
https://godbolt.org/z/6WvnMze1b

On x86 differences are in return:
i16 uses two 16bit registers
i32 uses two 32bit registers
nzi32 uses one 32bit register
nzi16 uses one 16bit register

use_result is no different for all four types, it uses ax to decide whether the return is Ok or Error.

fox:rustlinux_error_mb/ (main) $ cargo bench
    Finished bench [optimized] target(s) in 0.02s
     Running unittests (target/release/deps/error_benchmark-2ca556197417b0f8)

running 12 tests
test bench::bench_i16_100       ... bench:           2 ns/iter (+/- 0)
test bench::bench_i16_10000     ... bench:           2 ns/iter (+/- 0)
test bench::bench_i16_1000000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_100       ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_10000     ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_1000000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_100     ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_1000000 ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_100     ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_10000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_1000000 ... bench:           2 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 19.20s

no significant performance difference was found from the benchmark.

nbdd0121 · 2021-06-11T07:00:13Z

It seems that you didn't use black_box properly and everything has been optimized out. You'll need to give all the functions that returns result an black boxed input and black box the output before giving it to use_result.

nbdd0121 · 2021-06-11T07:01:07Z

Somehow my phone bugged and closed the PR, sorry

foxhlchen · 2021-06-11T07:12:31Z

It seems that you didn't use black_box properly and everything has been optimized out. You'll need to give all the functions that returns result an black boxed input and black box the output before giving it to use_result.

Can you elaborate on it??

I add state to test functions and let return depend on the inner state to prevent optimization. But I am not sure if it works. How do we know if it is optimized out or not??

nbdd0121 · 2021-06-11T07:32:59Z

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

foxhlchen · 2021-06-11T08:06:35Z

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

Thanks!

Let me try to use black_box instead.

foxhlchen · 2021-06-11T09:45:38Z

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

Another question

Why can't we simply shut it off?
I can't find a way to turn off optimizations in 'cargo bench'

ojeda · 2021-06-11T09:53:09Z

Normally you would want to benchmark with optimizations enabled, but you can customize profiles in Cargo.toml.

foxhlchen · 2021-06-11T13:58:05Z

Normally you would want to benchmark with optimizations enabled, but you can customize profiles in Cargo.toml.

Oh, I figured it out! Thanks!
I should put

[profile.bench]
opt-level = 0

in Cargo.toml

now it looks reasonable

fox:rustlinux_error_mb/ (main✗) $ cargo bench                                                   [21:51:39]
   Compiling error_benchmark v0.1.0 (/Users/fox/Workspace/RustProjects/rustlinux_error_mb)
    Finished bench [unoptimized] target(s) in 0.94s
     Running unittests (target/release/deps/error_benchmark-cb0c48d3b0fd54b7)

running 12 tests
test bench::bench_i16_100       ... bench:       2,069 ns/iter (+/- 23)
test bench::bench_i16_10000     ... bench:     197,681 ns/iter (+/- 591)
test bench::bench_i16_1000000   ... bench:  19,798,150 ns/iter (+/- 175,781)
test bench::bench_i32_100       ... bench:       1,754 ns/iter (+/- 15)
test bench::bench_i32_10000     ... bench:     171,666 ns/iter (+/- 4,611)
test bench::bench_i32_1000000   ... bench:  16,894,095 ns/iter (+/- 334,175)
test bench::bench_nzi16_100     ... bench:       2,280 ns/iter (+/- 28)
test bench::bench_nzi16_10000   ... bench:     218,514 ns/iter (+/- 4,271)
test bench::bench_nzi16_1000000 ... bench:  22,019,433 ns/iter (+/- 1,272,562)
test bench::bench_nzi32_100     ... bench:       1,850 ns/iter (+/- 20)
test bench::bench_nzi32_10000   ... bench:     180,518 ns/iter (+/- 2,496)
test bench::bench_nzi32_1000000 ... bench:  18,052,704 ns/iter (+/- 412,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 26.04s

It seems i32 is the best on my testing machine (aarch64)

bjorn3 · 2021-06-11T14:28:25Z

With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower.

black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.

With these changes I get the following result:

running 12 tests
test bench::bench_i16_100       ... bench:         195 ns/iter (+/- 1)
test bench::bench_i16_10000     ... bench:      18,602 ns/iter (+/- 58)
test bench::bench_i16_1000000   ... bench:   1,859,804 ns/iter (+/- 10,857)
test bench::bench_i32_100       ... bench:         233 ns/iter (+/- 1)
test bench::bench_i32_10000     ... bench:      22,339 ns/iter (+/- 190)
test bench::bench_i32_1000000   ... bench:   2,230,379 ns/iter (+/- 7,577)
test bench::bench_nzi16_100     ... bench:         196 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:      18,607 ns/iter (+/- 280)
test bench::bench_nzi16_1000000 ... bench:   1,858,772 ns/iter (+/- 7,667)
test bench::bench_nzi32_100     ... bench:         233 ns/iter (+/- 1)
test bench::bench_nzi32_10000   ... bench:      22,298 ns/iter (+/- 112)
test bench::bench_nzi32_1000000 ... bench:   2,230,413 ns/iter (+/- 7,312)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 9.52s

$ cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz

Diff:

@@ -1,8 +1,10 @@
-#![feature(test)]
+#![feature(test, bench_black_box)]
 
-use core::num::{NonZeroI16, NonZeroI32};
 extern crate test;
 
+use std::hint::black_box;
+use std::num::{NonZeroI16, NonZeroI32};
+
 #[derive(Clone, Copy, PartialEq, Eq)]
 pub struct ErrorNzi32(NonZeroI32);
 pub struct ErrorNzi16(NonZeroI16);
@@ -67,13 +69,12 @@
         }
     }
 
-    fn use_result<F, V, E>(n: i32, f: F) -> (i32, Result<V, E>)
-    where
-        F: Fn() -> Result<V, E>,
+    #[inline(never)]
+    fn use_result<V, E>(n: i32, f: fn() -> Result<V, E>) -> (i32, Result<V, E>)
     {
         let mut rt :i32 = 0;
         for _ in 0..n {
-            rt = match f() {
+            rt += match f() {
                 Ok(_) => 0,
                 Err(_) => -1,
             };
@@ -84,62 +85,62 @@
 
     #[bench]
     fn bench_nzi32_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_nzi32));
+        b.iter(|| use_result(100, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi32_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_nzi32));
+        b.iter(|| use_result(10000, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi32_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_nzi32));
+        b.iter(|| use_result(1000000, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi16_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_nzi16));        
+        b.iter(|| use_result(100, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_nzi16_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_nzi16));        
+        b.iter(|| use_result(10000, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_nzi16_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_nzi16));        
+        b.iter(|| use_result(1000000, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_i32_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_i32)); 
+        b.iter(|| use_result(100, black_box(return_i32)));
     }
 
     #[bench]
     fn bench_i32_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_i32)); 
+        b.iter(|| use_result(10000, black_box(return_i32)));
     }
 
     #[bench]
     fn bench_i32_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_i32)); 
+        b.iter(|| use_result(1000000, black_box(return_i32)));
     }    
 
     #[bench]
     fn bench_i16_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_i16)); 
+        b.iter(|| use_result(100, black_box(return_i16)));
     }
 
     #[bench]
     fn bench_i16_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_i16)); 
+        b.iter(|| use_result(10000, black_box(return_i16)));
     }
 
     #[bench]
     fn bench_i16_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_i16)); 
+        b.iter(|| use_result(1000000, black_box(return_i16)));
     }    
 }

nbdd0121 · 2021-06-11T14:44:50Z

Actually I looked at the assembly and black_box requires thing to be stored on the memory and re-load them, so probably it's better to avoid black_box-ing the Result.

By Bjorn3: With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower. black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.

foxhlchen · 2021-06-11T14:52:17Z

With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower.

black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.

With these changes I get the following result:
running 12 tests
test bench::bench_i16_100       ... bench:         195 ns/iter (+/- 1)
test bench::bench_i16_10000     ... bench:      18,602 ns/iter (+/- 58)
test bench::bench_i16_1000000   ... bench:   1,859,804 ns/iter (+/- 10,857)
test bench::bench_i32_100       ... bench:         233 ns/iter (+/- 1)
test bench::bench_i32_10000     ... bench:      22,339 ns/iter (+/- 190)
test bench::bench_i32_1000000   ... bench:   2,230,379 ns/iter (+/- 7,577)
test bench::bench_nzi16_100     ... bench:         196 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:      18,607 ns/iter (+/- 280)
test bench::bench_nzi16_1000000 ... bench:   1,858,772 ns/iter (+/- 7,667)
test bench::bench_nzi32_100     ... bench:         233 ns/iter (+/- 1)
test bench::bench_nzi32_10000   ... bench:      22,298 ns/iter (+/- 112)
test bench::bench_nzi32_1000000 ... bench:   2,230,413 ns/iter (+/- 7,312)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 9.52s
$ cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
Diff:

Wow, thanks for such a detailed explanation!

I've merged this change and run it on my laptop (aarch64 M1)

   Compiling error_benchmark v0.1.0 (/Users/fox/Workspace/RustProjects/rustlinux_error_mb)
    Finished bench [optimized] target(s) in 1.31s
     Running unittests (target/release/deps/error_benchmark-2ca556197417b0f8)

running 12 tests
test bench::bench_i16_100       ... bench:         263 ns/iter (+/- 4)
test bench::bench_i16_10000     ... bench:      25,172 ns/iter (+/- 1,867)
test bench::bench_i16_1000000   ... bench:   2,504,335 ns/iter (+/- 18,154)
test bench::bench_i32_100       ... bench:         294 ns/iter (+/- 8)
test bench::bench_i32_10000     ... bench:      28,133 ns/iter (+/- 176)
test bench::bench_i32_1000000   ... bench:   2,835,037 ns/iter (+/- 25,389)
test bench::bench_nzi16_100     ... bench:         298 ns/iter (+/- 7)
test bench::bench_nzi16_10000   ... bench:      28,260 ns/iter (+/- 265)
test bench::bench_nzi16_1000000 ... bench:   2,815,281 ns/iter (+/- 16,268)
test bench::bench_nzi32_100     ... bench:         262 ns/iter (+/- 2)
test bench::bench_nzi32_10000   ... bench:      25,016 ns/iter (+/- 1,392)
test bench::bench_nzi32_1000000 ... bench:   2,502,383 ns/iter (+/- 11,788)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 13.84s

nbdd0121 · 2021-06-11T16:08:23Z

It seems that compiler optimizes away

if I % 2 == 0 {
    return Ok(());
}

for the i16/i32 case to return_value.discriminant = I % 2 because Ok has discriminant of 0 and Err has discriminant of 1. Maybe this benchmark is little bit too synthetic.

Have you started working on ramfs @foxhlchen?

foxhlchen · 2021-06-11T16:23:49Z

Have you started working on ramfs @foxhlchen?

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.

Your opinion??

nbdd0121 · 2021-06-11T16:27:39Z

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.

Your opinion??

I just think that ramfs might be a good real-life benchmark for this PR ;)

foxhlchen · 2021-06-11T17:10:45Z

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.
Your opinion??

I just think that ramfs might be a good real-life benchmark for this PR ;)

Ah, I'm suspicious. Even though Result is used basically everywhere, but it takes a fraction of time -
the results will end up buried in noise.

I still think a microbenchmark will be better. If it proves nzi32 is better, than we put it in a real life benchmark only to see if there is a significant regression.

With latest upstream llvm18, the following test cases failed: $ ./test_progs -j #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL #77 fentry_fexit:FAIL #78/1 fentry_test/fentry:FAIL #78 fentry_test:FAIL #82/1 fexit_test/fexit:FAIL #82 fexit_test:FAIL #112/1 kprobe_multi_test/skel_api:FAIL #112/2 kprobe_multi_test/link_api_addrs:FAIL [...] #112 kprobe_multi_test:FAIL #356/17 test_global_funcs/global_func17:FAIL #356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for #13/#77 etc. except #356. For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

This comment has been minimized.

Sign in to view

TheSven73 reviewed Jun 5, 2021

View reviewed changes

nbdd0121 reviewed Jun 5, 2021

View reviewed changes

TheSven73 reviewed Jun 5, 2021

View reviewed changes

foxhlchen force-pushed the tighter-inner-type-Error branch from 70a5b2d to f98e62a Compare June 6, 2021 02:36

This comment has been minimized.

Sign in to view

foxhlchen force-pushed the tighter-inner-type-Error branch from f98e62a to 4dd79c0 Compare June 6, 2021 02:38

This comment has been minimized.

Sign in to view

foxhlchen force-pushed the tighter-inner-type-Error branch from 4dd79c0 to dafb4b0 Compare June 6, 2021 02:46

This comment has been minimized.

Sign in to view

TheSven73 reviewed Jun 6, 2021

View reviewed changes

nbdd0121 closed this Jun 11, 2021

nbdd0121 reopened this Jun 11, 2021

sodar mentioned this pull request Jul 9, 2021

error.rs: use a tighter inner type for Error #296

Open

nbdd0121 mentioned this pull request Aug 5, 2022

The patch adds core::num::NonZeroI16 as Error inner type. #857

Open

foxhlchen closed this Jan 3, 2023

[Issue #296] Rust: Use a tighter inner type for Error #356

[Issue #296] Rust: Use a tighter inner type for Error #356

Conversation

foxhlchen commented Jun 5, 2021 • edited Loading

This comment has been minimized.

TheSven73 left a comment

Choose a reason for hiding this comment

nbdd0121 commented Jun 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheSven73 commented Jun 5, 2021

TheSven73 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ojeda commented Jun 5, 2021

TheSven73 commented Jun 5, 2021

TheSven73 commented Jun 5, 2021

ojeda commented Jun 5, 2021

TheSven73 commented Jun 5, 2021 • edited Loading

ojeda commented Jun 5, 2021

nbdd0121 commented Jun 5, 2021

foxhlchen commented Jun 5, 2021

nbdd0121 commented Jun 5, 2021

foxhlchen commented Jun 6, 2021

nbdd0121 commented Jun 6, 2021 • edited Loading

foxhlchen commented Jun 6, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxhlchen Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjorn3 commented Jun 6, 2021

foxhlchen commented Jun 6, 2021 • edited Loading

bjorn3 commented Jun 6, 2021

nbdd0121 commented Jun 6, 2021

bjorn3 commented Jun 6, 2021

TheSven73 commented Jun 9, 2021 • edited Loading

TheSven73 commented Jun 9, 2021 • edited Loading

foxhlchen commented Jun 9, 2021

nbdd0121 commented Jun 9, 2021

TheSven73 commented Jun 9, 2021 • edited Loading

wedsonaf commented Jun 10, 2021

TheSven73 commented Jun 10, 2021

foxhlchen commented Jun 11, 2021

nbdd0121 commented Jun 11, 2021

nbdd0121 commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

nbdd0121 commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

ojeda commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

bjorn3 commented Jun 11, 2021 • edited Loading

nbdd0121 commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

nbdd0121 commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

nbdd0121 commented Jun 11, 2021

foxhlchen commented Jun 11, 2021

foxhlchen commented Jun 5, 2021 •

edited

Loading

TheSven73 commented Jun 5, 2021 •

edited

Loading

nbdd0121 commented Jun 6, 2021 •

edited

Loading

foxhlchen Jun 8, 2021 •

edited

Loading

foxhlchen commented Jun 6, 2021 •

edited

Loading

TheSven73 commented Jun 9, 2021 •

edited

Loading

TheSven73 commented Jun 9, 2021 •

edited

Loading

TheSven73 commented Jun 9, 2021 •

edited

Loading

bjorn3 commented Jun 11, 2021 •

edited

Loading