Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significantly cleanup and flesh out page on UB #158

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Gankra
Copy link
Contributor

@Gankra Gankra commented Aug 16, 2019

No description provided.

@Gankra
Copy link
Contributor Author

Gankra commented Aug 16, 2019

this is a followup on #149

cc @RalfJung: welcome to round 2, fight me

@Gankra
Copy link
Contributor Author

Gankra commented Aug 16, 2019

Extremely Rough Note before I wander off: I feel like I want to define a concrete notion of two different ways a type can occupy memory to address how unions/enums have a strange relationship with uninit memory.

Perhaps it's sufficient to just appeal to size vs "actual" memory? I believe we do not actually require/guarantee that e.g. moving an Option<u32> moves the memory footprint of the Some variant (ignoring the specific ongoing discussion of ptr::copy and padding which is weird).

Like it should be fine for us to check/know that the variant is None and just copy the tag. Too dizzy to think and need to bounce.

@RalfJung
Copy link
Member

I believe we do not actually require/guarantee that e.g. moving an Option moves the memory footprint of the Some variant (ignoring the specific ongoing discussion of ptr::copy and padding which is weird).

Indeed, padding does not have to be copied. Some UCG resources on this:

@@ -37,14 +40,10 @@ language cares about is preventing the following things:
* `dyn Trait` metadata is invalid if it is not a pointer to a vtable for
`Trait` that matches the actual dynamic trait the reference points to
* a `str` that isn't valid UTF-8
* an integer (`i*`/`u*`), floating point value (`f*`), or raw pointer read from
[uninitialized memory][]
* a non-padding byte that is [uninitialized memory][] (see discussion below)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one sticks out. Everything else here is "check the type, then do something", this one here is not. It also introduces the notion of padding, IMO unnecessarily. And finally I think it is wrong, unless you are saying that MaybeUninit<u8> has a padding byte, which however would also be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I am using very ambiguous wording here that requires the "discussion below" to have a meaning rigorous enough for your standards :)

But yes this is something that needs consideration (although remember: we just need to be conservatively correct, and not precisely correct). Things being "too UB" aren't an issue unless they preclude very important cases that everyone agrees must work.

Copy link
Member

@RalfJung RalfJung Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with the wording here is that is is not ambiguous enough, making it just plain wrong in my eyes.

Things being "too UB" aren't an issue unless they preclude very important cases that everyone agrees must work.

They are an issue if we tell people on reddit (as I did) that e.g.

let x: [MaybeUninit<T>; N] = unsafe { MaybeUninit::uninit().assume_init() };

is okay (and is on current stable the only way to construct an element of this type) -- and then they go to the nomicon and see this described as UB, and their reaction will be "well screw the nomicon, it's clearly bogus".

Put another way, if you only want to be conservatively correct, just declare every unsafe block as UB. ;)


"Producing" a value happens any time a value is assigned, passed to a
function/primitive operation or returned from a function/primitive operation.
feature, but some stable stdlib types, like `NonNull`, make use of it.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stdlib" is AFAIK not a term we use anyhwere. If you don't like "libstd" (because it seems to exclude libcore), what about "standard library"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genuinely shocked to learn that this is the only occurrence of either string in the nomicon, with "std" only being used in the title of "beneath std". Cool with keeping your version, completely thought I was just homogenizing the word with the rest of the book.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"libstd" is likely a bad choice though as some people read that to exclude libcore and liballoc.

So, I'd go for "standard library".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, sure, I guess

from something. Keep in mind references get to assume their referents are valid,
so you can't even create a reference to an invalid value.

Additionally, [uninitialized memory][] is **always invalid**, so you can't assign it to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, this is incorrect in its generality. You are restricting this later, but I think that's too late. Also, first saying one thing and then later "well but we didn't really mean" it is really confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair criticism. I wanted to try this approach out, but it's definitely worth rethinking.

Basically I kinda like this approach because, again, our primary interest is in preventing programmers from doing bad things. So "you can't do this" followed by "...except for here" isn't a terrible approach with that goal.

If people bounce off, they come away with a hyper-conservative model and try to avoid messing with uninitialized memory, which is good imo.

If people see that and get confused/angry, they can keep reading and go "aha!".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience (which is mostly based on writing scientific papers), one has to be very careful when leading with a wrong statement and correcting that later. This is definitely sometimes a good strategy, but it's a double-edged sword when you also consider e.g. your reader's faith in you. Usually, the least I'd do in a case like this is to add a footnote (would have to be a parenthetical here because footnotes are rendered too far away) saying something like "this is not strictly correct, we will refine this statement later". This at least prepares the reader for the blow coming later.

That said, in this case, I disagree with the entire approach. I think it is wrong to call out uninitialized memory as anything special. The fact that an uninitialized bool is UB should fall out of the same general principle that makes a NULL reference UB. We can have the desired rules for uninitialized memory based just on the notion of "valid value", with no need for programmers to learn another not-entirely-but-somewhat orthogonal concept.

On top of that, we should mentally prepare for when rust-lang/unsafe-code-guidelines#71 gets resolved. The likely resolution (and IMO the best one) is that we will declare uninitialized integers as not being UB. Only when an uninitialized integer is fed into a primitive operation (like + or | or ==), then we have UB if there is anything uninitialized. With that in mind, I think the route you are trying here is even less appealing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression we were very far from resolving the uninit integer question satisfactorily. If that's not the case, I agree we should focus on a tighter definition.

Copy link
Member

@RalfJung RalfJung Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't dare making predictions. But the person arguing most strongly against allowing uninit integers was me, and I changed my mind. gnzlbg went back and forth, not sure what their current stanza is. I don't actually know any argument against allowing uninit integers, besides the few that I opened the thread with, and I don't consider them convincing enough (any more) given the benefits of allowing uninit integers (mostly, everyone does it anyway^^).

If I had infinite time, the RFC would already have been written. I consider this the least controversial amongst the open questions around validity that we have. I might be missing a controversy though.

invalid value. No, it's not ok if you "don't use" or "don't read" the value.
Invalid values are **instant Undefined Behaviour**. The only correct way to
manipulate memory that could be invalid is with raw pointers using methods
like write and copy. If you want to leave a local variable or struct field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be links to the method docs.


Well, ok, only sort of.

While it's true that the language itself doesn't define that much Undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really true either because intrinsics are part of the language spec. That's what separates them from normal standard library functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree. Intrinsics are part of the compiler+stdlib fused implementation. It is not, to my knowledge, expected that you can hot-swap any stdlib implementation with any compiler implementation. Certainly this is not the case for C++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intrinsics are definitely considered T-lang jurisdiction in Rust. They also have to be defined as primitive operations in the Rust Abstract Machine.

The only difference between intrinsics and e.g. MIR binops is their syntax and data structure representation, really. Both + and intrinsics::overflowing_add are primitive language operations, and that one is an "intrinsic" and one is not is basically an implementation detail.

I don't see any way to treat intinsics as not being part of the language. Sure, they can be compiler-specific language extensions, but that still makes them very different from library functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be quite surprised and disappointed if writing a correct Rust program ever required a programmer to meaningfully distinguishing a function call from "an operation in the Rust abstract machine".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But anyway, the important point here is that intrinsic UB only shows up when you use the instrinsic. I'll think about tweaking this line.

Copy link
Member

@RalfJung RalfJung Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be quite surprised and disappointed if writing a correct Rust program ever required a programmer to meaningfully distinguishing a function call from "an operation in the Rust abstract machine".

I don't understand what you mean here.

Most function calls, say to String::new, are an instance of the general "function call" operation of the abstract machine: the machine goes on running the code of the function body. But intrinsics only look like function calls, they really are not. There is no code to run. They are a primitive operation.

So, much like 1+2 is not defined by "look up some code somewhere and execute that" (addition is a primitive operation of the language/machine), in the same way overflowing_add(1,2) is not defined by "look up some code somewhere" -- this, too is a primitive operation of the language/machine.

Put differently, if you are saying intrinsics are not "part of the language", then why do you call + "part of the language"? Or do you not?

But anyway, the important point here is that intrinsic UB only shows up when you use the instrinsic. I'll think about tweaking this line.

Indeed most our primitive operations do not have UB. Pointer deref (*) does, though, as does union field access. Both have UB that "only shows up when you use" them, exactly like intrinsics.


It is *technically not* Undefined Behavior to run a value's destructor twice.
Authors of destructors may however assume this doesn't happen. For instance, if
you drop a Box twice it will almost certainly result in Undefined Behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to hedge, this is a double-free.

Copy link
Contributor Author

@Gankra Gankra Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hedging here because technically I think it's not UB if someone manages to reallocate the pointer before you run the second drop. You just freed their allocation, and they're probably going to Do An UB, but not necessarily. Or is their some fancy compiler-knows-about-malloc shenanigans where this is still UB because the compiler "knows" you're not allowed to get lucky and free someone else's allocation like that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointer provenance rules (even the basic C-style provenance described in the UCG glossary) mean that even if the physical address gets reused, you would not be allowed to use old pointers to access the new allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet




## Not Technically Fundamental Undefined Behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is useful to introduce the terminology of "language-level UB" and "library-level UB" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the interesting thing to define here is the notion of "things safe code can't ever be allowed to do, but isn't actually UB (but will almost certainly lead to UB)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say this because this is specifically a page about the distinction between safe/unsafe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would basically be what I call the "safety invariant".


In simple and blunt terms: you cannot ever even *suggest* the existence of an
invalid value. No, it's not ok if you "don't use" or "don't read" the value.
Invalid values are **instant Undefined Behaviour**. The only correct way to

This comment was marked as resolved.

@nico-abram
Copy link
Contributor

Should there be a mention that null vtable pointers for even trait object raw pointers (And not just references which I believe could be understood from https://github.com/rust-lang-nursery/nomicon/pull/149/files#diff-9f9e3daa6dcba4f53211916eb094e123R37) are UB (Which iirc they are. Like in https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=82583ed18f9d0891adb54e5403269dec . rust-lang/rfcs#433 (comment) says it is).

@RalfJung
Copy link
Member

@nico-abram Agreed -- if we decide that we really want this non-NULL property, we should add it here. But I don't think this is clear-cut yet.

@nico-abram
Copy link
Contributor

Wouldn't being conservative on what's considered UB be best?

@RalfJung
Copy link
Member

Fair. In my mind this is still a rustc bug but I guess many won't agree. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants