Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #296] Rust: Use a tighter inner type for Error #356

Closed

Conversation

foxhlchen
Copy link

@foxhlchen foxhlchen commented Jun 5, 2021

Use NonZeroI32 as the inner type. This allows Result<()> to fit in 32-bit, instead of 64-bit.

V1->V2

  • Use NonZeroI32 instead of NonZeroI16
  • Rewrite kernel_const_to_error as a const function
  • Use unsafe NonZeroI32 creator in from_kernel_errno to avoid unwrap()

V2->V3

  • Add a comment explaining the reason for choosing NonZeroI32 to Error
  • Rewrite kernel_const_to_error() back to a macro and add static_assertion! to it.
  • Use Error(unsafe { NonZeroI16::new_unchecked(errno as i16) }) in from_kernel_error() to make it terser

V3->V4

  • Following Miguel's review, fix some comment format problems

V4->V5

  • Rewrite kernel_const_to_error() to a const function again. 😮‍💨

Related commit:
#296
May influence:
#358

@ksquirrel

This comment has been minimized.

Copy link
Collaborator

@TheSven73 TheSven73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is feedback mostly for @nbdd0121 and @ojeda - because @foxhlchen is doing what #296 is asking for.

What are the benefits of using a NonZero inner type here? Sure, it cannot be 0 - but that doesn't provide us with any benefit, since we already strictly enforce the [-MAX_ERRNO..-1] invariant.

I see plenty of downsides:

  • more complex code for little or no gain
  • if we use NonZeroI16::new() (as happens here) there's no need for unsafe code, but we need an unwrap() which could potentially panic! if things go wrong. this is rather the opposite of the "fail gracefully" that we are working towards
  • if we use unsafe NonZeroI16::new_unchecked() (as also happens here) then we are using unsafe code for no good reason, plus a good chance of introducing UB when 0 is passed by mistake. Again, this is the opposite of the "fail gracefully" that we are working towards

Suggestion: let's simply use an i16 for the inner type. The invariant is still enforced. And it allows us to get rid of awkward casts. This is actually a very elegant and simple change.

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 5, 2021

It has performance implication. NonZeroU16 gives compiler opportunity to use value 0 to represent other meanings. So for Result<(), NonZeroU16>, Ok(()) is represented just as value 0 while Err(value) is represented as value. For Result<(), u16> however, Err(value) is represented with the errno plus a tag 1. So we need two registers for the return value instead of just one. You can see the assembly comparison: https://godbolt.org/z/r7E7vMh9x

/// # Safety
///
/// The parameter must be a valid kernel error number.
macro_rules! kernel_const_to_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made into a const fn.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on converting this to a const function. I think it makes sense for it to be a constructor of Error, something like:

    const fn from_const(errno: u32) -> Self {
        let value = -(errno as i32);
        if value < -(bindings::MAX_ERRNO as i32) || value >= 0 {
            panic!("Error number out of range");
        }
        // SAFETY: We have checked above that `v` is within the right range.
        unsafe { Self::from_kernel_errno_unchecked(value) }
    }

We should only use macros when no reasonable alternative exists within the core language.

@@ -76,7 +87,8 @@ impl Error {

// INVARIANT: the check above ensures the type invariant
// will hold.
Error(errno)
let nzi16_errno = NonZeroI16::new(errno as i16).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use new_unchecked since errno is checked to be non-zero.

@TheSven73
Copy link
Collaborator

It has performance implication. NonZeroU16 gives compiler opportunity to use value 0 to represent other meanings.

That's fascinating to see - genuinely, no snark! Yet, do we have any measurements or benchmarks to show that this really makes a difference in the real world?

In the absence of the latter, should we choose a solution that favours simplicity, and least chance of inadvertent panic! or UB? I believe i16 could be that solution.

There's nothing preventing us from reviewing a NonZero inner type in a separate PR. But my review comment there would be identical - favour simplicity unless real-world evidence that it does make a difference. And even then, it's a trade-off.

Copy link
Collaborator

@TheSven73 TheSven73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not on board with a NonZero inner type, but will review just in case :)

@@ -102,11 +117,14 @@ impl fmt::Debug for Error {
fn rust_helper_errname(err: c_types::c_int) -> *const c_types::c_char;
}
// SAFETY: FFI call.
let name = unsafe { rust_helper_errname(-self.0) };
let name = unsafe { rust_helper_errname(-self.to_kernel_errno()) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the to_kernel_errno() call required here? Why not use -self.0.get() ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the to_kernel_errno() call required here?

no

Why not use -self.0.get() ?

To me, using to_kernel_errno() is clearer, and if we change the inner representation of Error again in the future, we don't need to touch this code. But that is just my personal taste. :)

//
// Safety: `errno` must not be zero, which is guaranteed by the contract
// of this function.
Error(unsafe { NonZeroI16::new_unchecked(errno as i16) })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a // CAST annotation that explains why casting a c_int to an i16 won't lose any bits and result in UB?

@ojeda
Copy link
Member

ojeda commented Jun 5, 2021

Hmm... you guys make it tough for me to choose. :-)

On one hand, I agree with @TheSven73 that it might be slightly "too much" before we hit mainline. People will be likely to look at this file in particular, since some may have heard about "Rust nice error handling", "Result", "pattern matching", "sum types", etc. and instead of seeing something straightforward and having the impression "oh, this actually makes sense to me, neat!", they may leave with a feeling of "wow... too fancy, another C++?!".

On the other hand, perhaps Result<()> is common enough that it merits this. And we can show there is an actual advantage with a simple Compiler Explorer example. Some people like a lot seeing such kind of optimizations.

@TheSven73
Copy link
Collaborator

they may leave with a feeling of "wow... too fancy, another C++?!".
[...]
Some people like a lot seeing such kind of optimizations.

Consider very carefully the viewpoint and incentives of the people which are most important for you to convince.

@TheSven73
Copy link
Collaborator

On the other hand, perhaps Result<()> is common enough that it merits this.

Allow me to play Devil's advocate - I love doing that :)
How would you (or we) know that it does?

@ojeda
Copy link
Member

ojeda commented Jun 5, 2021

A benchmark! :)

In general, we should strive for simplicity unless there is a reason not to (a reason being the benchmark showing it makes a difference), specially before mainline.

@TheSven73
Copy link
Collaborator

TheSven73 commented Jun 5, 2021

A benchmark! :)

🥇 I'm assuming we only have a single non-trivial, performance-sensitive, real-world Rust-for-Linux application right now? That would be binder?

Something tells me that @wedsonaf might have benchmarks somewhere. But that discussion would belong in a separate PR IMHO.

@ojeda
Copy link
Member

ojeda commented Jun 5, 2021

That would be binder?

Yeah, I would say so. (Wedson is on holidays, I think he is back next week).

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 5, 2021

Actually we maybe can make the inner type a NonZeroI32 -- in that case Result<(), Error> is ABI-compatible with a c_int! It'll be truly zero-cost ;)

@foxhlchen
Copy link
Author

Actually we maybe can make the inner type a NonZeroI32 -- in that case Result<(), Error> is ABI-compatible with a c_int! It'll be truly zero-cost ;)

Then it's no better than using i16? both will occupy 4byte

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 5, 2021

Then it's no better than using i16? both will occupy 4byte

You don't need to have sign extensions when converting a Result<(), Error> to c_int. Many architecture can handle i32 better than i16, e.g. in x64, a simple mov will clear upper 32-bits (so can be elided away in many cases) but you'll need a movsx to clear upper 48-bits.

@foxhlchen
Copy link
Author

Then it's no better than using i16? both will occupy 4byte

You don't need to have sign extensions when converting a Result<(), Error> to c_int. Many architecture can handle i32 better than i16, e.g. in x64, a simple mov will clear upper 32-bits (so can be elided away in many cases) but you'll need a movsx to clear upper 48-bits.

Ok, But this is also true for NonZeroI16, right? We can save a lot of space but may incur performance penalties.

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 6, 2021

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

@foxhlchen
Copy link
Author

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

Got it, thanks for the explanation. I now think NonZerol32 is the best choice. :)

Let me change it to use NonZeroI32.

@ksquirrel

This comment has been minimized.

@ksquirrel

This comment has been minimized.

@ksquirrel

This comment has been minimized.

//
// Safety: `errno` must be non zero, which is guaranteed by the check
// above.
let nz_errno = unsafe { NonZeroI32::new_unchecked(errno) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to call from_kernel_errno_unchecked() here to prevent having to write

Error(unsafe { NonZeroI32::new_unchecked(errno) })

twice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean

unsafe{ Error::from_kernel_errno_unchecked() }

?
it's not terser

@@ -21,44 +21,49 @@ use core::str::{self, Utf8Error};
///
/// The value is a valid `errno` (i.e. `>= -MAX_ERRNO && < 0`).
#[derive(Clone, Copy, PartialEq, Eq)]
pub struct Error(c_types::c_int);
pub struct Error(NonZeroI32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys believe that NonZeroI32 is optimal, perhaps put a comment that explains why?
Otherwise we might get PRs from newcomers who think that NonZeroI32 occupies too much space, so they'll replace with NonZeroI16...


impl Error {
/// Invalid argument.
pub const EINVAL: Self = Error(-(bindings::EINVAL as i32));
pub const EINVAL: Self = kernel_const_to_error(bindings::EINVAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with Rust macros, but I wonder if it would be possible to write a macro that takes the Error name as the input parameter, and expands to the whole pub const definition? Maybe even a static_assert! too?

Example:

declare_const_error!(EINVAL);
// would expand to
static_assert!(bindings::EINVAL < 0 && bindings::EINVAL >= -MAX_ERRNO);
pub const EINVAL: Self = Error(unsafe { NonZeroI32::new_unchecked(-(bindings::EINVAL as i32)) });

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

But I don't know why even static_assert!(bindings::ENOMEM < bindings::MAX_ERRNO) doesn't work.

I got the following error:

error: `const` items in this context need a name
  --> rust/kernel/static_assert.rs:36:15
   |
30 | / macro_rules! static_assert {
31 | |     ($condition:expr) => {
32 | |         // Based on the latest one in `rustc`'s one before it was [removed].
33 | |         //
...  |
36 | |         const _: () = [()][!($condition) as usize];
   | |               ^ `_` is not a valid name for this `const` item
37 | |     };
38 | | }
   | |_- in this expansion of `static_assert!`
   |
  ::: rust/kernel/error.rs:42:5
   |
42 |       static_assert!(bindings::ENOMEM < bindings::MAX_ERRNO);
   |       ------------------------------------------------------- in this macro invocation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you write like what @TheSven73 suggests, const _ in static_assert! will become an associated constant rather than just a normal item, so it must have a name. It'll need to expand into something like

pub const EINVAL: Self = {
	static_assert!(bindings::EINVAL < 0 && bindings::EINVAL >= -MAX_ERRNO);
	Error(unsafe { NonZeroI32::new_unchecked(-(bindings::EINVAL as i32)) })
};

But I think it'll be better to put the check in a const fn + const panic instead of using static_assert in this case:

const fn i_am_bad_at_naming_things(errno: i32) -> Error {
	assert!(errno < -(bindings::MAX_ERRNO as i32) || errno >= 0);
	Error(unsafe { NonZeroI32::new_unchecked(errno) })
}

pub const EINVAL: Self = i_am_bad_at_naming_things(-(bindings::EINVAL as i32));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you, Gary!!

Can I combine this? i.e. declare a i_am_bad_at_naming_things() function and write a macro to wrap
pub const EINVAL: Self = i_am_bad_at_naming_things(-(bindings::EINVAL as i32));
into declare_const_error!(EINVAL);

Do you think it's a good idea??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I opted for the first approach.

macro_rules! declare_const_error {
    ($($tt:tt)*) => {
        
        pub const $($tt)*: Self = {
            static_assert!(bindings::$($tt)* <= bindings::MAX_ERRNO && bindings::$($tt)* > 0);
            Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$($tt)* as i32)) })
        };
    };
}

Thank you again!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, putting doc inside the macro invocation is bizarre to me.
I don't think it will be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can have everything in a single macro:

macro_rules! declare_const_error {
    ($($(#[$attr:meta])* $name:ident)*) => { $(
        $(#[$attr])*
		pub const $name: Self = {
        	static_assert!(bindings::$name <= bindings::MAX_ERRNO && bindings::$name > 0);
	        Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$name as i32)) })
		};
    )* };
}

declare_const_error! {
	/// Invalid argument.
	EINVAL
	/// Out of memory.
    ENOMEM
	// and more
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this const error declaration macro (and its invocations) should go into its own dedicated PR?
Also, perhaps it could link up with @sladyn98's #358 somehow?

Copy link
Author

@foxhlchen foxhlchen Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can have everything in a single macro:

macro_rules! declare_const_error {
    ($($(#[$attr:meta])* $name:ident)*) => { $(
        $(#[$attr])*
		pub const $name: Self = {
        	static_assert!(bindings::$name <= bindings::MAX_ERRNO && bindings::$name > 0);
	        Error(unsafe { NonZeroI32::new_unchecked(-(bindings::$name as i32)) })
		};
    )* };
}

declare_const_error! {
	/// Invalid argument.
	EINVAL
	/// Out of memory.
    ENOMEM
	// and more
}

This is cool. However, I strongly disagree we use something like it.
After messing around, I realize the initial way is clearest.

For

impl Error {
    pub const EINVAL: Self = kernel_const_to_error(bindings::EINVAL);
}

A glimpse is enough to let you understand that we have an EINVAL const in the Error scope whose value is somehow related to the kernel's EINVAL (and you can easily infer the value of EINVAL equals to negative bindings::EINVAL).

But with

impl Error{
      declare_const_error! {
	      /// Invalid argument.
	      EINVAL
	      /// Out of memory.
          ENOMEM
	      // and more
      }
}

You need to spend time figuring out the magic here especially when you are not familiar with rust macro.
It adds more mental burden to developers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this const error declaration macro (and its invocations) should go into its own dedicated PR?
Also, perhaps it could link up with @sladyn98's #358 somehow?

Let me add a reference.

@bjorn3
Copy link
Member

bjorn3 commented Jun 6, 2021

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

  • Type is uninhabited or a ZST => skip it.
  • Type is a scalar => pass in a single register.
  • Type is a scalar pair => pass in two registers.
  • Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)
  • Type is an aggregate:
    • Type is at most two registers big => pass in at most two registers.
    • Type is bigger than two registers => pass a pointer to the value.

https://github.com/rust-lang/rust/blob/9a576175cc9a0aecb85d0764a4f66ee29e26e155/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

@foxhlchen
Copy link
Author

foxhlchen commented Jun 6, 2021

For Result<(), i16>, even though it takes only 4 bytes when stored in memory, it requires 2 registers for argument passing or return as ABI-wise Result<(), i16> is an aggregate while Result<(), NonZeroI16> is equivalent to i16 ABI-wise so it is a scalar.

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

  • Type is uninhabited or a ZST => skip it.

  • Type is a scalar => pass in a single register.

  • Type is a scalar pair => pass in two registers.

  • Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)

  • Type is an aggregate:

    • Type is at most two registers big => pass in at most two registers.
    • Type is bigger than two registers => pass a pointer to the value.

https://github.com/rust-lang/rust/blob/9a576175cc9a0aecb85d0764a4f66ee29e26e155/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

Ok.

So performance-wise we have NonZeroI32 > NonZeroI16 > i32 > i16, (on-memory) space-wise we have NonZeroI16 < NonZeroI32 = i16 < i32 (assuming all wrapped inside Result<(), E>), (in-register) space-wise we have NonZeroI16 = NonZeroI32 < i16 = i32.

But his conclusion is correct.
https://godbolt.org/z/GPqfEGfhd

@bjorn3
Copy link
Member

bjorn3 commented Jun 6, 2021

But his conclusion is correct.

Indeed

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 6, 2021

Result<(), i16> is not an aggregate, but a scalar pair. If it were an aggregate it would actually be passed in a single register in the Rust abi. The abi calculation logic for the "Rust" abi is currently roughly:

  • Type is uninhabited or a ZST => skip it.

  • Type is a scalar => pass in a single register.

  • Type is a scalar pair => pass in two registers.

  • Type is a vector => pass a pointer to the value. (this is necessary to allow different functions to have different target features enabled to allow for runtime target feature detection as the way vectors are passed otherwise depends on the enabled target features)

  • Type is an aggregate:

    • Type is at most two registers big => pass in at most two registers.
    • Type is bigger than two registers => pass a pointer to the value.

rust-lang/rust@9a57617/compiler/rustc_middle/src/ty/layout.rs#L2904-L2960

Thanks for the explanation and the pointer! Is scalar pair Rust-specific? Is there a document describing the Rust or LLVM behaviour about this? I couldn't seem to find much material on the Internet and there're no mentions of scalar pairs in SysV ABI.

@bjorn3
Copy link
Member

bjorn3 commented Jun 6, 2021

Is scalar pair Rust-specific?

AFAIK it is indeed rust specific. The Rust abi is unstable and apart from the source code I linked I believe there is no documentation. I had to look at it in the past to use the same abi in rustc_codegen_cranelift as in the llvm backend.

@TheSven73
Copy link
Collaborator

TheSven73 commented Jun 9, 2021

In Rust, Similar as chosen for C. In the client executables, -O3.
ETA: kernel C has Optimize for performance

@TheSven73
Copy link
Collaborator

TheSven73 commented Jun 9, 2021

I have a performance improvement of 1.96‰

Just making sure that I read this correctly: that's 1.96 per-thousand (per mille), not 1.96 percent, correct?

@foxhlchen
Copy link
Author

Can you do a microbenchmark just for the Result??
I haven't looked at the binder's code. But I don't think this can really say something.
Too many factors can add up to the variation.

@nbdd0121
Copy link
Member

nbdd0121 commented Jun 9, 2021

I have a performance improvement of 1.96‰

Just making sure that I read this correctly: that's 1.96 per-thousand, not 1.96 percent, correct?

Yes, I misread the number of zeros originally 🤦 and write percent. I was a bit surprised by such a large change and double-checked the result.

I think binder might not be the reliable benchmark for other code (we are more likely just testing how often the slow path of mutex_lock is being hit, which is affected by many factors in SMP systems, e.g. slower code can make mutex less contentious and thus make things faster).

@TheSven73
Copy link
Collaborator

TheSven73 commented Jun 9, 2021

AFAIK the binder benchmark is a real-world performance benchmark which returns Results all over the place, including in more than a few hot paths? Curious to hear @wedsonaf 's opinion.

@wedsonaf
Copy link

AFAIK the binder benchmark is a real-world performance benchmark which returns Results all over the place, including in more than a few hot paths? Curious to hear @wedsonaf 's opinion.

I think it is a real-world benchmark but there are lots of other real-world combinations that are also important but aren't captured by this single benchmark. IOW, I think we can use this benchmark as a data point, but we shouldn't base our decisions on it alone.

Android has more extensive benchmarks involving Binder, for example: https://source.android.com/compatibility/vts/performance

Once we have enough of it implemented to boot Android, we'll be able to get these to run. Those will be better numbers to base our decisions on.

@TheSven73
Copy link
Collaborator

Once we have enough of it implemented to boot Android, we'll be able to get these to run. Those will be better numbers to base our decisions on.

Excellent point. How far are you from being able to run at least some of those?

@foxhlchen
Copy link
Author

I've written a tiny microbenchmark.
https://github.com/foxhlchen/rustlinux_error_mb.git

assembly is here:
https://godbolt.org/z/6WvnMze1b

On x86 differences are in return:
i16 uses two 16bit registers
i32 uses two 32bit registers
nzi32 uses one 32bit register
nzi16 uses one 16bit register

use_result is no different for all four types, it uses ax to decide whether the return is Ok or Error.

fox:rustlinux_error_mb/ (main) $ cargo bench
    Finished bench [optimized] target(s) in 0.02s
     Running unittests (target/release/deps/error_benchmark-2ca556197417b0f8)

running 12 tests
test bench::bench_i16_100       ... bench:           2 ns/iter (+/- 0)
test bench::bench_i16_10000     ... bench:           2 ns/iter (+/- 0)
test bench::bench_i16_1000000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_100       ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_10000     ... bench:           2 ns/iter (+/- 0)
test bench::bench_i32_1000000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_100     ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi16_1000000 ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_100     ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_10000   ... bench:           2 ns/iter (+/- 0)
test bench::bench_nzi32_1000000 ... bench:           2 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 19.20s

no significant performance difference was found from the benchmark.

@nbdd0121 nbdd0121 closed this Jun 11, 2021
@nbdd0121
Copy link
Member

It seems that you didn't use black_box properly and everything has been optimized out. You'll need to give all the functions that returns result an black boxed input and black box the output before giving it to use_result.

@nbdd0121
Copy link
Member

Somehow my phone bugged and closed the PR, sorry

@nbdd0121 nbdd0121 reopened this Jun 11, 2021
@foxhlchen
Copy link
Author

It seems that you didn't use black_box properly and everything has been optimized out. You'll need to give all the functions that returns result an black boxed input and black box the output before giving it to use_result.

Can you elaborate on it??

I add state to test functions and let return depend on the inner state to prevent optimization. But I am not sure if it works. How do we know if it is optimized out or not??

@nbdd0121
Copy link
Member

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

@foxhlchen
Copy link
Author

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

Thanks!

Let me try to use black_box instead.

@foxhlchen
Copy link
Author

If you add -O option to your godbolt example you'll see it's all gone :)

The inner state you implement is not sufficient. It's almost impossible to use static mut correctly, you might want pass in a &AtomicUsize instead.

State isn't the only problem though, since even with states, in your case inlining will also mess up the benchmark. You can use #[inline(never)] and/or black_box (https://doc.rust-lang.org/nightly/std/hint/fn.black_box.html) to prevent compiler from seeing through the values.

Another question

Why can't we simply shut it off?
I can't find a way to turn off optimizations in 'cargo bench'

@ojeda
Copy link
Member

ojeda commented Jun 11, 2021

Normally you would want to benchmark with optimizations enabled, but you can customize profiles in Cargo.toml.

@foxhlchen
Copy link
Author

Normally you would want to benchmark with optimizations enabled, but you can customize profiles in Cargo.toml.

Oh, I figured it out! Thanks!
I should put

[profile.bench]
opt-level = 0

in Cargo.toml

now it looks reasonable

fox:rustlinux_error_mb/ (main✗) $ cargo bench                                                   [21:51:39]
   Compiling error_benchmark v0.1.0 (/Users/fox/Workspace/RustProjects/rustlinux_error_mb)
    Finished bench [unoptimized] target(s) in 0.94s
     Running unittests (target/release/deps/error_benchmark-cb0c48d3b0fd54b7)

running 12 tests
test bench::bench_i16_100       ... bench:       2,069 ns/iter (+/- 23)
test bench::bench_i16_10000     ... bench:     197,681 ns/iter (+/- 591)
test bench::bench_i16_1000000   ... bench:  19,798,150 ns/iter (+/- 175,781)
test bench::bench_i32_100       ... bench:       1,754 ns/iter (+/- 15)
test bench::bench_i32_10000     ... bench:     171,666 ns/iter (+/- 4,611)
test bench::bench_i32_1000000   ... bench:  16,894,095 ns/iter (+/- 334,175)
test bench::bench_nzi16_100     ... bench:       2,280 ns/iter (+/- 28)
test bench::bench_nzi16_10000   ... bench:     218,514 ns/iter (+/- 4,271)
test bench::bench_nzi16_1000000 ... bench:  22,019,433 ns/iter (+/- 1,272,562)
test bench::bench_nzi32_100     ... bench:       1,850 ns/iter (+/- 20)
test bench::bench_nzi32_10000   ... bench:     180,518 ns/iter (+/- 2,496)
test bench::bench_nzi32_1000000 ... bench:  18,052,704 ns/iter (+/- 412,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 26.04s

It seems i32 is the best on my testing machine (aarch64)

@bjorn3
Copy link
Member

bjorn3 commented Jun 11, 2021

With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower.

black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.

With these changes I get the following result:

running 12 tests
test bench::bench_i16_100       ... bench:         195 ns/iter (+/- 1)
test bench::bench_i16_10000     ... bench:      18,602 ns/iter (+/- 58)
test bench::bench_i16_1000000   ... bench:   1,859,804 ns/iter (+/- 10,857)
test bench::bench_i32_100       ... bench:         233 ns/iter (+/- 1)
test bench::bench_i32_10000     ... bench:      22,339 ns/iter (+/- 190)
test bench::bench_i32_1000000   ... bench:   2,230,379 ns/iter (+/- 7,577)
test bench::bench_nzi16_100     ... bench:         196 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:      18,607 ns/iter (+/- 280)
test bench::bench_nzi16_1000000 ... bench:   1,858,772 ns/iter (+/- 7,667)
test bench::bench_nzi32_100     ... bench:         233 ns/iter (+/- 1)
test bench::bench_nzi32_10000   ... bench:      22,298 ns/iter (+/- 112)
test bench::bench_nzi32_1000000 ... bench:   2,230,413 ns/iter (+/- 7,312)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 9.52s
$ cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz

Diff:

@@ -1,8 +1,10 @@
-#![feature(test)]
+#![feature(test, bench_black_box)]
 
-use core::num::{NonZeroI16, NonZeroI32};
 extern crate test;
 
+use std::hint::black_box;
+use std::num::{NonZeroI16, NonZeroI32};
+
 #[derive(Clone, Copy, PartialEq, Eq)]
 pub struct ErrorNzi32(NonZeroI32);
 pub struct ErrorNzi16(NonZeroI16);
@@ -67,13 +69,12 @@
         }
     }
 
-    fn use_result<F, V, E>(n: i32, f: F) -> (i32, Result<V, E>)
-    where
-        F: Fn() -> Result<V, E>,
+    #[inline(never)]
+    fn use_result<V, E>(n: i32, f: fn() -> Result<V, E>) -> (i32, Result<V, E>)
     {
         let mut rt :i32 = 0;
         for _ in 0..n {
-            rt = match f() {
+            rt += match f() {
                 Ok(_) => 0,
                 Err(_) => -1,
             };
@@ -84,62 +85,62 @@
 
     #[bench]
     fn bench_nzi32_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_nzi32));
+        b.iter(|| use_result(100, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi32_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_nzi32));
+        b.iter(|| use_result(10000, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi32_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_nzi32));
+        b.iter(|| use_result(1000000, black_box(return_nzi32)));
     }
 
     #[bench]
     fn bench_nzi16_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_nzi16));        
+        b.iter(|| use_result(100, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_nzi16_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_nzi16));        
+        b.iter(|| use_result(10000, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_nzi16_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_nzi16));        
+        b.iter(|| use_result(1000000, black_box(return_nzi16)));
     }
 
     #[bench]
     fn bench_i32_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_i32)); 
+        b.iter(|| use_result(100, black_box(return_i32)));
     }
 
     #[bench]
     fn bench_i32_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_i32)); 
+        b.iter(|| use_result(10000, black_box(return_i32)));
     }
 
     #[bench]
     fn bench_i32_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_i32)); 
+        b.iter(|| use_result(1000000, black_box(return_i32)));
     }    
 
     #[bench]
     fn bench_i16_100(b: &mut Bencher) {
-        b.iter(|| use_result(100, return_i16)); 
+        b.iter(|| use_result(100, black_box(return_i16)));
     }
 
     #[bench]
     fn bench_i16_10000(b: &mut Bencher) {
-        b.iter(|| use_result(10000, return_i16)); 
+        b.iter(|| use_result(10000, black_box(return_i16)));
     }
 
     #[bench]
     fn bench_i16_1000000(b: &mut Bencher) {
-        b.iter(|| use_result(1000000, return_i16)); 
+        b.iter(|| use_result(1000000, black_box(return_i16)));
     }    
 }

@nbdd0121
Copy link
Member

Actually I looked at the assembly and black_box requires thing to be stored on the memory and re-load them, so probably it's better to avoid black_box-ing the Result.

foxhlchen added a commit to foxhlchen/rustlinux_error_mb that referenced this pull request Jun 11, 2021
By Bjorn3:

With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower.

black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.
@foxhlchen
Copy link
Author

With opt-level=0 you can't do any realistic benchmarks. The compiler won't optimize away any zero-cost abstractions. For example for item in vec.iter() { ... } can be faster than while i < vec.len() { ...; i+=1; } with optimizations due to no bounds checking, but without optimizations it is likely much slower.

black_box calls didn't help as you likely passed a function item and not function pointer through the black box. This means that it is still known which function is called based on the type of the function item. Taking fn() -> Result<V, E> instead of F: Fn() -> Result<V, E> fixes this issue. Replacing the rt = match ... with rt += match ... inside use_result is also necessary to prevent optimizing away the match.

With these changes I get the following result:

running 12 tests
test bench::bench_i16_100       ... bench:         195 ns/iter (+/- 1)
test bench::bench_i16_10000     ... bench:      18,602 ns/iter (+/- 58)
test bench::bench_i16_1000000   ... bench:   1,859,804 ns/iter (+/- 10,857)
test bench::bench_i32_100       ... bench:         233 ns/iter (+/- 1)
test bench::bench_i32_10000     ... bench:      22,339 ns/iter (+/- 190)
test bench::bench_i32_1000000   ... bench:   2,230,379 ns/iter (+/- 7,577)
test bench::bench_nzi16_100     ... bench:         196 ns/iter (+/- 0)
test bench::bench_nzi16_10000   ... bench:      18,607 ns/iter (+/- 280)
test bench::bench_nzi16_1000000 ... bench:   1,858,772 ns/iter (+/- 7,667)
test bench::bench_nzi32_100     ... bench:         233 ns/iter (+/- 1)
test bench::bench_nzi32_10000   ... bench:      22,298 ns/iter (+/- 112)
test bench::bench_nzi32_1000000 ... bench:   2,230,413 ns/iter (+/- 7,312)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 9.52s
$ cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz
model name      : Intel(R) Core(TM) i3-7130U CPU @ 2.70GHz

Diff:

Wow, thanks for such a detailed explanation!

I've merged this change and run it on my laptop (aarch64 M1)

   Compiling error_benchmark v0.1.0 (/Users/fox/Workspace/RustProjects/rustlinux_error_mb)
    Finished bench [optimized] target(s) in 1.31s
     Running unittests (target/release/deps/error_benchmark-2ca556197417b0f8)

running 12 tests
test bench::bench_i16_100       ... bench:         263 ns/iter (+/- 4)
test bench::bench_i16_10000     ... bench:      25,172 ns/iter (+/- 1,867)
test bench::bench_i16_1000000   ... bench:   2,504,335 ns/iter (+/- 18,154)
test bench::bench_i32_100       ... bench:         294 ns/iter (+/- 8)
test bench::bench_i32_10000     ... bench:      28,133 ns/iter (+/- 176)
test bench::bench_i32_1000000   ... bench:   2,835,037 ns/iter (+/- 25,389)
test bench::bench_nzi16_100     ... bench:         298 ns/iter (+/- 7)
test bench::bench_nzi16_10000   ... bench:      28,260 ns/iter (+/- 265)
test bench::bench_nzi16_1000000 ... bench:   2,815,281 ns/iter (+/- 16,268)
test bench::bench_nzi32_100     ... bench:         262 ns/iter (+/- 2)
test bench::bench_nzi32_10000   ... bench:      25,016 ns/iter (+/- 1,392)
test bench::bench_nzi32_1000000 ... bench:   2,502,383 ns/iter (+/- 11,788)

test result: ok. 0 passed; 0 failed; 0 ignored; 12 measured; 0 filtered out; finished in 13.84s

@nbdd0121
Copy link
Member

It seems that compiler optimizes away

if I % 2 == 0 {
    return Ok(());
}

for the i16/i32 case to return_value.discriminant = I % 2 because Ok has discriminant of 0 and Err has discriminant of 1. Maybe this benchmark is little bit too synthetic.

Have you started working on ramfs @foxhlchen?

@foxhlchen
Copy link
Author

Have you started working on ramfs @foxhlchen?

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.

Your opinion??

@nbdd0121
Copy link
Member

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.

Your opinion??

I just think that ramfs might be a good real-life benchmark for this PR ;)

@foxhlchen
Copy link
Author

Yes, I'm exploring VFS interfaces, then will be carefully figuring how to abstract them.
I don't expect it to be finished very soon.
Your opinion??

I just think that ramfs might be a good real-life benchmark for this PR ;)

Ah, I'm suspicious. Even though Result is used basically everywhere, but it takes a fraction of time -
the results will end up buried in noise.

I still think a microbenchmark will be better. If it proves nzi32 is better, than we put it in a real life benchmark only to see if there is a significant regression.

@foxhlchen foxhlchen closed this Jan 3, 2023
fbq pushed a commit that referenced this pull request Dec 28, 2023
With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  #13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  #13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  #13      bpf_cookie:FAIL
  #77      fentry_fexit:FAIL
  #78/1    fentry_test/fentry:FAIL
  #78      fentry_test:FAIL
  #82/1    fexit_test/fexit:FAIL
  #82      fexit_test:FAIL
  #112/1   kprobe_multi_test/skel_api:FAIL
  #112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  #112     kprobe_multi_test:FAIL
  #356/17  test_global_funcs/global_func17:FAIL
  #356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for #13/#77 etc. except #356.

For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants