Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate most implicit calls to BUG() ? #283

Closed
TheSven73 opened this issue May 21, 2021 · 18 comments
Closed

Deprecate most implicit calls to BUG() ? #283

TheSven73 opened this issue May 21, 2021 · 18 comments
Labels
• lib Related to the `rust/` library.

Comments

@TheSven73
Copy link
Collaborator

TheSven73 commented May 21, 2021

Linus Torvalds / the kernel community has a strong dislike of BUG()/BUG_ON() calls, but the Rust core implicitly calls them from various places, such as assert macros, unwrap()/expect(), or overflow checks.

If I interpret the documentation correctly, then:

  • BUG()/BUG_ON() should “never” be called, as it’s deprecated. There are however some very limited cases where it would be appropriate. But it should never make it in without a carefully considered, valid reason - the exception not the norm.
  • For "expected to be unreachable" or “impossible error conditions” use WARN()/WARN_ON(). Note that these functions return, and the kernel should continue “as gracefully as possible” on their return.
  • For “reachable but undesirable” situations, use pr_warn(). This is because some system owners set the “panic on WARN()” sysctl, which kills the system on a call to WARN(). We do not want this to happen for “reachable but undesirable” situations.

How would this translate to Rust? Bold suggestion:

  • For exceptional calls to BUG(), we create a Rust kernel_bug! macro.
  • Rust assert!, build_assert! and friends map to “expected to be unreachable” or “impossible” error conditions, so they must map to WARN(). Since WARN() returns and assert! does not, we should prohibit these standard macros outside of the Rust core. Replace with new assertion macros which do return.
  • Rust unwrap(), expect() calls also map to “impossible” error conditions. Since there is no way to recover gracefully from these, they should be disabled outside of the Rust core.
  • In fact, Rust kernel code should “never” map “impossible error condition” to a BUG() call. With some very limited exceptions such as when it’s not possible or desirable to recover gracefully, eg. when the Rust core triggers one of its internal assertions. Or when an overflow is detected, as suggested previously by @kees?
  • For “reachable but undesirable situations” we can simply use pr_warn! as suggested.

I’m aware that my bold suggestion may generate some strong objections. But it’s good to discuss this and find out what the consensus or strategy is. We may refer back to any consensus reached here if the issue comes up again in the future?

Follow-up from this discussion.

@bjorn3
Copy link
Member

bjorn3 commented May 21, 2021

In Rust std::process::abort() should be used for cases where memory safety would be compromised when trying to continue execution of the rest of a thread inside a process. This is equivalent to a BUG() call in Linux.

In Rust a panic normally causes the current thread to unwind and exit, but not the whole process unless it was either the main thread or the panic propagated through a poisoned mutex or closed channel. A kernel oops causes the current userspace process to exit, but not the whole kernel. Panics should map to oops I think.

Rust-analyzer uses https://docs.rs/always-assert/ for "recoverable assertions" that shouldn't happen but can be gracefully handled without tearing down the current thread or process. It is used like if never!(the_assertion) { /* handle assertion failure */ } (or it's counterpart always!()).

@kees
Copy link
Member

kees commented May 21, 2021

The kernel technically has WARN (issue Oops text and continue), BUG (issue Oops text and terminate kernel thread), and panic (issue text and kill the entire system).

Linus's objection to BUG is that it is usually not much better than panic in the sense that it leaves locks held, etc. Using BUG wrecks system state so badly that it will become either unusable, hang, or panic anyway. Basically, surviving a BUG in a stable fashion should be considered "lucky". ;)

Now, that said, BUG is still preferred (in my mind) over panic. A panic should be used when the integrity of the entire system is compromised (e.g. stack canary overflows without VMAP stacks means other thread stacks could have been written to).

Is build_assert only a build failure? That should trivially map to BUILD_BUG_ON() if so.

I think math overflows should wrap around only when explicitly marked as "expected to wrap". Outside of that, we need to choose a way to deal:

  • WARN, wrap, and continue
  • WARN and saturate (see refcount_t)
  • BUG when user has selected a stronger mitigation

Right now, WARN can be upgraded to panic with the panic_on_warn=1 sysctl. I'm hoping to turn BUG_ON_DATA_CORRUPTION into a per-boot configurable, which then we can use here and in other data-corruption places.

As for memory errors, the kernel already has tons of ENOMEM handling, so plumbing that into the Rust core is likely the best approach. I don't think there is a one-to-one mapping for Rust error states and kernel error states.

@bjorn3
Copy link
Member

bjorn3 commented May 21, 2021

Is build_assert only a build failure?

Yes

That should trivially map to BUILD_BUG_ON() if so.

That is a macro, right? Calling C macros from rust requires writing a wrapper C function. Without cross-language LTO this can't be inlined and as such will always cause the build to fail, even when the assertion succeeds. Instead on the rust side a macro has been written which I think does pretty much the same thing, but does work without cross-language LTO.

I think math overflows should wrap around only when explicitly marked as "expected to wrap". Outside of that, we need to choose a way to deal:

Rust only has an option for panic on overflow or wrap on overflow as defined by RFC 560. You can manually use methods like .saturating_add() or .checked_add(), which returns an Option, though.

As for memory errors, the kernel already has tons of ENOMEM handling, so plumbing that into the Rust core is likely the best approach. I don't think there is a one-to-one mapping for Rust error states and kernel error states.

libcore doesn't do any allocations at all. liballoc will either be extended or completely replaced to handle OOM more gracefully.

@ojeda
Copy link
Member

ojeda commented May 21, 2021

Linking another comment I wrote elsewhere: #40 (comment)

My interpretation of Linus Torvalds' comment on your RFC was that any panics from the Rust code were unacceptable to him. He focused on the out-of-memory panic that can be fixed by using a panic-free memory API but I'd think that he would have the same reaction to the 'invisible' panics inserted by the compiler to implement runtime overflow checks and runtime array bounds checks because they also cause the kernel to halt with loss of service and potential loss of data.
Of course, I may be over-generalizing his statement and/or he may shift his position...

They are very different cases:

  • The alloc-related panics are there simply because there is no fallible approach. It is an interface design problem. Having a allocation failure does not mean the kernel has a bug, just that it does not have memory to serve a particular request.
  • On the other hand, OOB accesses, unexpected overflows, etc. are severe bugs with potentially unbounded consequences. In general, one cannot trust the environment after they occur. They are not a design problem in the sense of the alloc ones.

That is why latter ones should panic the same way dereferencing the start of the address space does (whether such a panic should kill only the current thread or the entire kernel is another question).

Furthermore, in general, potential loss of data is way better than silent corruption of data. Similarly, denial of service is way better than getting a server compromised.

@nbdd0121
Copy link
Member

For exceptional calls to BUG(), we create a Rust kernel_bug! macro.

panic! is fundamental to Rust and the meaning couldn't be altered. It must not return, so it is best mapped to BUG. I believe BUG will oops and kill the current thread. It's is a little bit unfortunate that Rust panic! and Linux panic does not mean exactly the same thing. We probably need to document this so kernel developers know that panic! is BUG instead.

Rust assert!, build_assert! and friends map to “expected to be unreachable” or “impossible” error conditions, so they must map to WARN(). Since WARN() returns and assert! does not, we should prohibit these standard macros outside of the Rust core. Replace with new assertion macros which do return.

assert!, unwrap() and expect() must not return upon failure, otherwise invariants are violated and safety cannot be guaranteed. If an "unreachable" condition is reached or an "impossible" error condition is shown, it is very likely that something is messed up and it may not be safe for the driver to continue. I agree that in most cases, the driver should try its best to exit gracefully in such case but I don't find it too awkward to BUG.

For “reachable but undesirable situations” we can simply use pr_warn! as suggested.

Agree.

Rust-analyzer uses docs.rs/always-assert for "recoverable assertions" that shouldn't happen but can be gracefully handled without tearing down the current thread or process. It is used like if never!(the_assertion) { /* handle assertion failure */ } (or it's counterpart always!()).

It looks so nice! We probably want something similar, and perform a WARN_ONCE for failure.

I think math overflows should wrap around only when explicitly marked as "expected to wrap". Outside of that, we need to choose a way to deal:

  • WARN, wrap, and continue
  • WARN and saturate (see refcount_t)
  • BUG when user has selected a stronger mitigation

As @bjorn3 said, Rust have checked_add, saturating_add and wrapping_add. For normal adds, currently it can be configured to either silent wrap or trigger a panic. I don't think it'll be possible to re-define it to saturate, but I think it possible that with a few tweaks to rustc we can make it call an arbitrary function instead of panic. Obvious it's a lot of work and probably not worth the effort unless we're upstreamed.

For allocs, as @bjorn3 and @ojeda said, libcore does not do any memory allocation and we currently only panic! on OOM as that's what liballoc currently does. Eventually we will move to full fallible allocation.

@ojeda
Copy link
Member

ojeda commented May 21, 2021

This has been my "panic model":

  • For (what I call) "unneeded" panics, like the current alloc ones: we just need to write the proper APIs. If a request can be reasonably be fallible, it should be. [*]
  • For things that have unbounded consequences, like OOB stores or unintended overflows: there is no "fallible" way of dealing with them, because they were not expected nor intended to begin with. These are simply major bugs in kernel code, and the code should be fixed. [**]

[*] See my next message for an explanation/flowchart on how to approach it.

[**] What is the best trade-off here (killing the current thread, the entire device, etc.) depends on the particular system. For instance, I may want my desktop to not die due to a faulty driver so that I have a chance to save my work, since it is most likely just that: a bug. But for a public facing server? Surely I prefer it dies if someone managed to find a way to trigger an OOB store.

@ojeda
Copy link
Member

ojeda commented May 21, 2021

(This will be part of the docs I am writing for coding guidelines etc.)
(Re-posting from scratch since I did a lot of edits, in case you are reading from email).


How to approach things that may fail in Rust?

  1. If you statically know that something cannot fail, then calling Rust-panicking functions is fine (e.g. unwrap()/expect()): this is where our // NOPANIC annotation comes into play.
  2. If you do not know, then it is most likely you should be using a fallible API instead (e.g. Result/Option) and bubbling up the error to fail the outer request (e.g. taking advantage of the ? operator).
  3. In the very rare cases where there is no reasonable way to continue, an actual kernel panic may be warranted. We document this with the // PANIC annotation.

Notice the difference between Cases 1 and 3. In the latter (// PANIC), the code is saying something may fail but there is no way to do anything reasonable if it happens. It is not a bug if the Rust-panic occurs. On the other hand, the former (// NOPANIC) asserts Rust-panicking is impossible by construction. If it actually happens, it is a bug that should be fixed.

To be extra clear: "statically" really means statically known. If you think any of the following:

  • "It will almost never fail".
  • "Nobody will hit this in all probability -- this code path is very rare to begin with".
  • "Other subsystems or drivers would need to be failing for this to happen, so perhaps I can ignore it".

Then it is not statically known. Think about it the same way you think about unsafe: you need to be able to prove it is OK. If you cannot prove it, fallback to fallible APIs (the same way you fallback to marking a function as unsafe). Sometimes you might be able to design things in a different way to avoid using the fallible APIs. static_assert! and build_assert! may also help proving conditions -- use them liberally.

Let's see a particularly interesting example: integer arithmetic overflows. The guidelines above apply exactly the same way:

  • First of all, if you just need a particular behavior on overflow, then request that explicitly via saturating_*(), wrapping_*(), or use the Wrapping type, etc. These situations are not related to error handling or Rust-panics: they are just about writing what we actually want for the happy path.
  • Otherwise:
    • If you statically know that your arithmetic never overflows, use "naked" arithmetic (e.g. a + b) [†]. The compiler will insert overflow checks as needed [‡]. (Case 1 above)
    • If you do not know, then it is most likely you should be using a fallible API instead like checked_*() (if you do not care about the overflowed result) or overflowing_*() (if you do). (Case 2 above)
    • In the very rare cases where there is no reasonable way to continue if something has overflown, an actual kernel panic may be warranted. (Case 3 above)

[†] Indeed, all our usage of "naked" integer arithmetic is asserting there is no possibility of overflow.

[‡] Unless the user disables it via CONFIG_RUST_OVERFLOW_CHECKS. Even then, overflowing with "naked" arithmetic is not undefined behavior in Rust (it has two's complement semantics). Nevertheless, we still consider it a bug. If you can justify performance needs, you may opt into unchecked_*(). This does introduce UB if you get it wrong, and so it is unsafe.

@ojeda ojeda added prio: high • lib Related to the `rust/` library. labels May 21, 2021
@TheSven73
Copy link
Collaborator Author

I fear that there might be a significant misunderstanding/disconnect here between kernel and Rust communities. Look, us Rust people are primed to put safety and correctness first. That’s what attracted us to Rust in the first place. So when we encounter a situation where it’s possible or even likely that undefined behaviour could occur, our natural instinct is to use the standard Rust abstractions that panic and kill off the process. Imagine that we have “proven” that an Option cannot possibly be None at some point in our code. Or that a counter can never go below zero. So we believe it’s ok to call unwrap() and keep overflow panics on, because, well, if our proof doesn’t hold, then “there is no reasonable way to continue”, right? And when that Option turns out to be None, or that counter does go below zero, Rust panics, BUG() gets called, and we die. Dead code doesn’t tell any lies. And surely dead code must be better than potentially corrupted, exploitable code?

Except that it might not work that way. Linus has a huge user base that cares very deeply about safety and correctness. But, as unlikely as it may sound to us, he also has a huge interest in allowing a potentially broken, buggy, or corrupted system to go on as well as it can, for as long as it can. Even if that would risk memory corruption or exploits.

And that’s probably why the docs tell us to use WARN_ON() for “impossible” error conditions, and recover as gracefully as possible. Because this allows Linus to have it both ways: those who care about absolute safety can make WARN_ON() kill the system. Those who want the system to live as long as possible, simply tell Linux to warn and continue.

Of course BUG() or even panic() is still there for situations where it would be impossible or impractical to recover and keep the system running.

I suspect that this will also apply to things like math overflow. Users that need to keep going, want to have some form of best-effort recovery, such as saturate-with-WARN, or wrap-with-WARN, as @kees was suggesting. Again, those who value absolute safety can make WARN() kill the system. Users should likely be able to choose the desired wrap or saturate recovery behaviour via defconfig. Which would affect all math operations not explicitly marked as wrap in the Rust code itself.

And that’s why I proposed eliminating all use of Rust abstractions that panic and do not return, such as unwrap()/expect(), runtime assertions, or panicking overflow checks. We might keep a kernel_bug! type macro around just so we can tell people not to use it when they try to merge code that calls it :)

Seen in this light, it may be that linking an unmodified Rust core to the kernel could work against us. I’m unfamiliar with core’s internals, but it sounds plausible to me that it favours correctness. So when a bug creeps into core or someone misuses it - by creating a Duration where the nanosecond carry overflows the seconds counter, say - then some unwrap()/expect() or panic! will trigger deep inside core, and call BUG(). It’s simply not designed to recover gracefully. The first time this happens in the real world, there might be some uncomfortable questions to answer.

So, am I 100% certain about all of this? Not exactly. I understand only just enough to suspect that trouble might be brewing. It would be great to discuss this further with LKML and Rust people, hopefully more knowledgeable than me.

IMHO a targeted question on LKML along the lines of “is it acceptable for Rust kernel code to call BUG() if it thinks it got into an impossible situation?” could provide useful insights.

@TheSven73
Copy link
Collaborator Author

TheSven73 commented May 22, 2021

(PS if what I outlined above is indeed an issue, I believe 80-90% of @ojeda's rule table is still sound. None of this should invalidate those very sensible policies)

@bjorn3
Copy link
Member

bjorn3 commented May 22, 2021

Seen in this light, it may be that linking an unmodified Rust core to the kernel could work against us.

Modifying it is not an option. libcore works with exactly one rustc version and one rustc version only. It has a deep integration with rustc by for example defining all intrinsics, many essential lang items and more. Trying to use the wrong libcore will cause compilation errors or even internal compiler errors (ICE).

I’m unfamiliar with core’s internals, but it sounds plausible to me that it favours correctness.

Correct. Bugs in either libcore or user code that would violate memory safety will always panic or abort. If there is a way found to violate memory safety, the github issue for this will be marked as I-unsafe and most of the time P-high or P-critical.

So when a bug creeps into core or someone misuses it - by creating a Duration where the nanosecond carry overflows the seconds counter, say - then some unwrap()/expect() or panic! will trigger deep inside core, and call BUG().

This particular example doesn't cause a panic, but I get what you mean. I think it depends on the implications of the bug. A bug that could cause memory safety would cause an abort or panic. Other bugs inside libcore are often catched by debug_assert! I think and thus not enabled in release mode. A quick search resulted in a few assert! for a zero step_by() argument (would cause an infinite iterator otherwise) and some in float parsing/printing. Most assertions in libcore are debug_assert!() and thus don't run normally.

https://doc.rust-lang.org/stable/std/time/struct.Duration.html#method.new

If the number of nanoseconds is greater than 1 billion (the number of nanoseconds in a second), then it will carry over into the seconds provided.

@ojeda
Copy link
Member

ojeda commented May 22, 2021

Except that it might not work that way. Linus has a huge user base that cares very deeply about safety and correctness. But, as unlikely as it may sound to us, he also has a huge interest in allowing a potentially broken, buggy, or corrupted system to go on as well as it can, for as long as it can. Even if that would risk memory corruption or exploits.

But this is already what we do -- see below.

And that’s probably why the docs tell us to use WARN_ON() for “impossible” error conditions, and recover as gracefully as possible. Because this allows Linus to have it both ways: those who care about absolute safety can make WARN_ON() kill the system. Those who want the system to live as long as possible, simply tell Linux to warn and continue.

There is a difference between "impossible" error conditions and statically known impossible conditions:

  • The former are things that you believe are impossible, but cannot prove. For those, you are still in Case 2, and you should still use fallible APIs.
  • The latter (Case 1) are things that come with a proof, and we require justification for them (same as unsafe). We are either:
    • For unsafe code: relying on the proof to avoid UB, and if UB happens, we are "continuing" because we do not even know it happened. So no BUG call here.
    • For safe code: Rust-panicking because it is detected UB, and that is the entire point of using Rust and the safe/unsafe split. (Note that if panicking is unreasonable, you should not be in Case 1). The kernel also does this, e.g. null pointer deref.

Now, within Case 2, the guidelines above do not talk about how to report undesirable or unexpected conditions (that you cannot prove they will not happen):

  • If what you see is something that is not expected at all by the code (yet cannot be statically be proven), you should call WARN. This is the domain of defensive programming. Typically it will be a bug that needs to be fixed, but not always (e.g. it could also be a hardware failure, a single-event upset, a compiler bug, etc.). Here is where Linus has a problem calling BUG, but we are not doing that.
  • If what you see is something that is just undesirable for the user, use pr_warn.

In both of those you are still bubbling up and writing fallible APIs, regardless of what you do to report the detected problem.

And that’s why I proposed eliminating all use of Rust abstractions that panic and do not return, such as unwrap()/expect(), runtime assertions, or panicking overflow checks. We might keep a kernel_bug! type macro around just so we can tell people not to use it when they try to merge code that calls it :)

There is no reasonable way to eliminate all, the same way as in the C side. Say, you got a null pointer deref. What do you do?

Seen in this light, it may be that linking an unmodified Rust core to the kernel could work against us. I’m unfamiliar with core’s internals, but it sounds plausible to me that it favours correctness. So when a bug creeps into core or someone misuses it - by creating a Duration where the nanosecond carry overflows the seconds counter, say - then some unwrap()/expect() or panic! will trigger deep inside core, and call BUG(). It’s simply not designed to recover gracefully. The first time this happens in the real world, there might be some uncomfortable questions to answer.

That would be just a bug like any other. I do not see what is the problem. There are countless bugs going on in the kernel C side too, in the compilers themselves, in hardware... Nobody is claiming our Rust code (or core, or rustc, or LLVM) is bug-free.

IMHO a targeted question on LKML along the lines of “is it acceptable for Rust kernel code to call BUG() if it thinks it got into an impossible situation?” could provide useful insights.

What an "impossible situation" is matters -- see above.

@nbdd0121
Copy link
Member

Some of the ideas are similar to @ojeda's but since I typed it I'll keep it anyway..


I don't think it's fair to just look at Rust code and say we need to get rid of all panic!s. What we should do is to compare with C code instead:

  • Rust code oops for bounds check failure. In C code, if you failed to check bounds and buffer overflows, you either 1) triggers a page fault if lucky, and the kernel will oops anyway; or 2) you will corrupt stack/heap and you get a CVE, and the kernel will likely panic later at a different (potentially totally irrelevant) location due to the corruption.
  • Rust code oops if you unwrap a None. In C code, if you failed to check for null pointer and dereference it, you get a trap and the kernel still oops.

Do people check pointers for null and handle it in C code if they're certain that it isn't null? No. They'll just dereference it, and the kernel will trap and oops as a result. So you shouldn't require Rust code to not unwrap a Option that are certainly not None. It's oops in both cases.

@TheSven73
Copy link
Collaborator Author

TheSven73 commented May 25, 2021

Thanks for the constructive discussion folks, it’s awesome to exchange ideas with such clever and knowledgeable people.

@bjorn3

Modifying it is not an option.

What is possible and what is acceptable are two separate things. If C required the kernel to link to a library with plenty of infallible checks that had no choice but to fail through BUG(), it would probably not be accepted by the community. They’d roll their own libraries - which they already do.

This particular example doesn't cause a panic

I’ll see your safe Rust and raise you a panic! straight from the library :)
https://doc.rust-lang.org/src/core/time.rs.html#173

@ojeda @nbdd0121
I believe that we agree at least 80% - 90% on Rust safety/failure rules. Sure, when we run into a null pointer deref or segmentation violation, we can only follow what C does and OOPS. We don’t have any choice in the matter anyways, as this is enforced mostly outside of the language.

Now when it comes to things like (language) bounds checks, overflows, runtime assertions, panics coming out of the core library - I have a strong suspicion that they’ll have to be fallible through WARN(). IMHO this should apply even to unwrap()/expect() where we “prove” it can “never fail”. Human proofs are fallible. Even proofs that were correct at one point in time, could become invalidated by code changing around it, or even further away. There is no automatic mechanism to prevent this from happening - it all relies on human infallibility.

At this point I think we understand each other, we simply differ in opinion, and that’s fine. I am curious to see what feedback we’ll get on these issues when the next LKML drop happens. Maybe it’ll be just fine. Maybe there’ll be BUG/WARN issues we didn't discuss or think about yet.

@nbdd0121
Copy link
Member

Sure, when we run into a null pointer deref or segmentation violation, we can only follow what C does and OOPS. We don’t have any choice in the matter anyways, as this is enforced mostly outside of the language.

Safe Rust should never do a null pointer deref or cause a segfault.

Now when it comes to things like (language) bounds checks, overflows, runtime assertions, panics coming out of the core library - I have a strong suspicion that they’ll have to be fallible through WARN(). IMHO this should apply even to unwrap()/expect() where we “prove” it can “never fail”.

I don't agree. I might not be clear enough with my last message, let me try to persuade again :)

Suppose in C code we have T*, and on Rust side we may have Option<&mut T>. We are very certain that the value is not null. So on C side we just proceed to dereference it. On Rust side we did a unwrap. Null ptr dereference in C certainly leads to an OOPS, and so should the Rust side. It does not make any sense to force the Rust code to WARN() instead. Similar argument applies to bound checks.

In a sense, in C code you may have a lot of implicit oops opportunities, and in Rust code it's an explicit unwrap. There is nothing evil about the possibility of panicking in unwrap and expect -- the possibility is always there. It's as natural as dereferencing a pointer that is known to be not null.

We shouldn't compare how likely will rust code calls BUG() vs C; we should compare how likely will rust code OOPS vs C. Since safe Rust can't cause OOPS by null pointer deref or segfault, counting BUG() will bias against Rust.

One more thing: panic! in Rust don't have to be implemented as BUG(). panic! in Rust means that a bad thing is about to happen but it didn't. Because bad thing has not yet happened, there is a higher chance that all data structures are in a good state for the continued operation of kernel. So, panic! can be implemented as anything, as long as the function never returns. There is nothing preventing us from defining a Rust panic! as:

WARN(1, "Rust code panicked");
for (;;) {
	schedule_timeout_uninterruptible(MAX_SCHEDULE_TIMEOUT);
}

@ojeda
Copy link
Member

ojeda commented May 26, 2021

I’ll see your safe Rust and raise you a panic! straight from the library :) https://doc.rust-lang.org/src/core/time.rs.html#173

Yeah, for every API that has a precondition there could be three versions provided (fallible, panicking, unchecked). And I think it is a good thing to go and improve Rust's standard library with whatever we actually need.

In the particular case you mention, the panic is so hard to hit (huge values of u64 needed for secs) that it is possible one can statically prove it is not hit (in particular if you can assume you do not need durations of billions of years... :-).

Anyway, for fallible APIs, it is likely that we may want to provide a wrapper that returns a kernel::Result. Or we may need our own type for other reasons (e.g. like the binary string one). So I do not think it will be a big issue.

Now when it comes to things like (language) bounds checks, overflows, runtime assertions, panics coming out of the core library - I have a strong suspicion that they’ll have to be fallible through WARN(). IMHO this should apply even to unwrap()/expect() where we “prove” it can “never fail”.

@nbdd0121 has answered this, but one more point: for panics coming out of core or anywhere else (e.g. unwrap()/expect()), there is not much we can "do" because the code does not even account for the possibility of continuing, e.g.:

fn f() -> T {
    panic!() // What do we return as `T` if we "continue"?
}

Thus e.g. a failed unwrap() calling WARN() does not make sense to begin with. So either we use a fallible API or we write the proof, there is no other option.

For the rest, yes, we could define some "meaning" to the failure, e.g.:

  • For failed bound checks, performing the access.
  • For unexpected overflows, using the result.
  • For failed runtime asserts, not panicking.

But it would make the code unsound, which partially defeats the purpose of the safe subset. And we would need to be careful about not having the optimizer remove our WARN() call since, after all, it is "impossible" given it is UB.

If we were to do that, then at that point I would instead go with unchecked calls and SAFETY comments, that way at least we get soundness, the proofs written and a performance boost on top ;)

Human proofs are fallible. Even proofs that were correct at one point in time, could become invalidated by code changing around it, or even further away. There is no automatic mechanism to prevent this from happening - it all relies on human infallibility.

Of course, mistakes happen, and we will definitely have wrong or outdated proofs more than once. Which is why we should not use NOPANIC for no reason, the same way we should not use unsafe blocks (SAFETY) for no reason.

Proofs (for both NOPANIC and SAFETY) should be as simple as possible (e.g. using a type invariant) or, even better, avoided by going to a fallible API unless it is worth it (e.g. for performance gains or to avoid making an API fallible that could otherwise be infallible for users). The same way in C we do not make code extra complex and full of assumptions unless we get something in exchange.

And, hopefully, most of these proofs should be in kernel, and not around in drivers etc.

@TheSven73
Copy link
Collaborator Author

Can we freeze this discussion until after the next LKML drop? The feedback received from the kernel community should allow me to update my "internal model" of what's desirable and what's not.

In the mean time I'll treat this as settled, and follow @ojeda's rules above. You won't see any further PRs or Issues from me on this subject.

@ojeda
Copy link
Member

ojeda commented May 26, 2021

I think that is fair. As you say, we may need to change our model after the next LKML submission. Nevertheless, I think it is worth trying to follow the approach above and see where we land.

To recap, the discussion is mainly about what to do with NOPANIC panicking operations that could instead have some meaningful behavior on failure (such as failed bound checks, detected overflows and UB-preventing asserts). There are a few approaches:

  • Force to use fallible APIs. Sound, but it is a big burden for Rust code that not even C has: it would force us to make some requests fallible that are we know are otherwise infallible. It adds complexity, including adding code paths that cannot be tested and will never be used (since they are supposed to be NOPANIC).
  • Rust-panic (the current one, see guidelines above). Here, there are several things a panic could do: panic(), BUG() or sleep/spin/etc. All of them are sound, because we do not continue on detected UB. It could be configurable. The disadvantage is making the device unusable in some cases.
  • WARN() and continue. Unsound because we are continuing on detected UB.
  • Force to use unchecked APIs. Sound because we need to prove there is no UB. Performance advantage with respect the others. Riskier because we need to maintain more unsafe code, i.e. if we get a SAFETY proof wrong, it is possible we just introduced a vulnerability (unlike with a NOPANIC proof).

@ojeda
Copy link
Member

ojeda commented Oct 29, 2024

Closing -- when exactly to design APIs as panicking, or unchecked, or fallible, or WARN_ON_ONCE() + a "default" behavior is decided per-API, depending on the task/subsystem/... at hand, e.g. most recently on the fsleep() abstractions currently being discussed. Rust panics must be avoided unless there is really no reasonable way out. We do not currently use the // {NO,}PANIC annotations mentioned here. For arithmetic, we decided to keep using the operators/methods as usual in Rust (e.g. using + means it is a bug to overflow), and that any types defining operators should follow the same pattern -- we are adding this to the guidelines soon.

@ojeda ojeda closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
• lib Related to the `rust/` library.
Development

No branches or pull requests

5 participants