Skip to content

Commit

Permalink
Merge pull request rust-lang#175 from RalfJung/memory-interface
Browse files Browse the repository at this point in the history
WIP documents for memory interface and value domain
  • Loading branch information
RalfJung authored Aug 9, 2019
2 parents c426fdb + d63559b commit 09c0f97
Show file tree
Hide file tree
Showing 2 changed files with 293 additions and 0 deletions.
156 changes: 156 additions & 0 deletions wip/memory-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Rust Memory Interface

**Note:** This document is not normative nor endorsed by the UCG WG. Its purpose is to be the basis for discussion and to set down some key terminology.

The purpose of this document is to describe the interface between a Rust program and memory.
This interface is a key part of the Rust Abstract Machine: it lets us separate concerns by splitting the Machine (i.e., its specification) into two pieces, connected by this well-defined interface:
* The *expression/statement semantics* of Rust boils down to explaining which "memory events" (calls to the memory interface) happen in which order - expressed as calls to the methods of this interface, and reactions to its return values.
This part of the specification is *pure* in the sense that it has no "state": everything that needs to be remembered from one expression evaluation to the next is communicated through memory.
* The Rust *memory model* explains which interactions with the memory are legal (the others are UB), and which values can be returned by reads.
A memory model is defined by implementing the memory interface.

The interface shown below is also opinionated in several ways.
It is not intended to be able to support *any imaginable* memory model, but rather start the process of reducing the design space of what we consider a "reasonable" memory model for Rust.
For example, it explicitly acknowledges that pointers are not just integers and that uninitialized memory is special (both are true for C and C++ as well but you have to read the standard very careful, and consult [defect report responses](http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm), to see this).
Another key property of the interface presented below is that it is *untyped*.
This implies that in Rust, *operations are typed, but memory is not* - a key difference to C and C++ with their type-based strict aliasing rules.
At the same time, the memory model provides a *side-effect free* way to turn pointers into "raw bytes", which is *not* [the direction C++ is moving towards](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf), and we might have to revisit this choice later if it turns out to not be workable.

## Pointers

One key question a memory model has to answer is *what is a pointer*.
It might seem like the answer is just "an integer of appropriate size", but [that is not the case][pointers-complicated].
This becomes even more prominent with aliasing models such as [Stacked Borrows].
So we leave it up to the memory model to answer this question, and make `Pointer` an associated type.
Practically speaking, `Pointer` will be some representation of an "address", plus [provenance] information.

[provenance]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/reference/src/glossary.md#pointer-provenance

## Bytes

The unit of communication between the memory model and the rest of the program is a *byte*.
Again, the question of "what is a byte" is not as trivial as it might seem; beyond `u8` values we have to represent `Pointer`s and [uninitialized memory][uninit].
We define the `Byte` type as follows, where `Pointer` will later be instantiated with the `Memory::Pointer` associated type.

```rust
enum Byte<Pointer> {
/// The "normal" case: a (frozen, initialized) integer in `0..256`.
Raw(u8),
/// An uninitialized byte.
Uninit,
/// One byte of a pointer.
PtrFragment {
/// The pointer of which this is a byte.
/// That is, the byte is a fragment of this pointer.
ptr: Pointer,
/// Which byte of the pointer this is.
/// `idx` will always be in `0..PTR_SIZE`.
idx: u8,
}
}
```

The purpose of `PtrFragment` is to enable a byte-wise representation of a `Pointer`.
On a 32-bit system, the sequence of 4 bytes representing `ptr: Pointer` is:
```
[
PtrFragment { ptr, idx: 0 },
PtrFragment { ptr, idx: 1 },
PtrFragment { ptr, idx: 2 },
PtrFragment { ptr, idx: 3 },
]
```

Based on the `PtrToInt` trait (see below), we can turn every initialized `Byte` into an integer in `0..256`:

```rust
impl<Pointer: PtrToInt> Byte<Pointer> {
fn as_int(self) -> Option<u8> {
match self {
Byte::Raw(int) => Some(int),
Byte::Uninit => None,
Byte::PtrFragment { ptr, idx } =>
Some(ptr.get_byte(idx)),
}
}
}
```

## Memory interface

The Rust memory interface is described by the following (not-yet-complete) trait definition:

```rust
/// All operations are fallible, so they return `Result`. If they fail, that
/// means the program caused UB. What exactly the `UndefinedBehavior` type is
/// does not matter here.
type Result<T=()> = std::result::Result<T, UndefinedBehavior>;

/// *Note*: All memory operations can be non-deterministic, which means that
/// executing the same operation on the same memory can have different results.
/// We also let all operations potentially mutate memory. For example, reads
/// actually do change the current state when considering concurrency or
/// Stacked Borrows.
/// This is pseudo-Rust, so we just use fully owned types everywhere for
/// symmetry and simplicity.
trait Memory {
/// The type of pointer values.
type Pointer: Copy + PtrToInt;

/// The size of pointer values.
const PTR_SIZE: u64;

/// Create a new allocation.
fn allocate(&mut self, size: u64, align: u64) -> Result<Self::Pointer>;

/// Remove an allocation.
fn deallocate(&mut self, ptr: Self::Pointer, size: u64, align: u64) -> Result;

/// Write some bytes to memory.
fn write(&mut self, ptr: Self::Pointer, bytes: Vec<Byte<Self::Pointer>>) -> Result;

/// Read some bytes from memory.
fn read(&mut self, ptr: Self::Pointer, len: u64) -> Result<Vec<Byte<Self::Pointer>>>;

/// Offset the given pointer.
fn offset(&mut self, ptr: Self::Pointer, offset: u64, mode: OffsetMode)
-> Result<Self::Pointer>;

/// Cast the given integer to a pointer. (The other direction is handled by `PtrToInt` below.)
fn int_to_ptr(&mut self, int: u64) -> Result<Self::Pointer>;
}

/// The `Pointer` type must know how to extract its bytes, *without any access to the `Memory`*.
trait PtrToInt {
/// Get the `idx`-th byte of the pointer. `idx` must be in `0..PTR_SIZE`.
fn get_byte(self, idx: u8) -> u8;
}

/// The rules applying to this pointer offset operation.
enum OffsetMode {
/// Wrapping offset; never UB.
Wrapping,
/// Non-wrapping offset; UB if it wraps.
NonWrapping,
/// In-bounds offset; UB if it wraps or if old and new pointer are not both
/// in-bounds of the same allocation (details are specified by the memory
/// model).
Inbounds,
}
```

We will generally assume we have a particular memory model in scope, and freely refer to its `PTR_SIZE` and `Pointer` items.
We will also write `Byte` for `Byte<Pointer>`.

This is a very basic memory interface that is incomplete in at least the following ways:

* To implement rules like "dereferencing a null, unaligned, or dangling raw pointer is UB" (even if no memory access happens), there needs to be a way to do an "alignment, bounds and null-check".
* There needs to be some way to do alignment checks -- either using the above operation, or by adding `align` parameters to `read` and `write`.
* To represent concurrency, many operations need to take a "thread ID" and `read` and `write` need to take an [`Ordering`].
* To represent [Stacked Borrows], there needs to be a "retag" operation, and that one will in fact be "lightly typed" (it cares about `UnsafeCell`).
* Maybe we want operations that can compare pointers without casting them to integers.

[pointers-complicated]: https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html
[uninit]: https://www.ralfj.de/blog/2019/07/14/uninit.html
[`Ordering`]: https://doc.rust-lang.org/nightly/core/sync/atomic/enum.Ordering.html
[Stacked Borrows]: stacked-borrows.md
137 changes: 137 additions & 0 deletions wip/value-domain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Rust Value Domain

**Note:** This document is not normative nor endorsed by the UCG WG. Its purpose is to be the basis for discussion and to set down some key terminology.

The purpose of this document is to describe what the set of *all possible values* is in Rust.
This is an important definition: one key element of a Rust specification will be to define the [representation relation][representation] of every type.
This relation relates values with lists of bytes: it says, for a given value and list of bytes, if that value is represented by that list.
However, before we can even start specifying the relation, we have to specify the involved domains.
`Byte` is defined as part of [the memory interface][memory-interface]; this document is about defining `Value`.

[representation]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/reference/src/glossary.md#representation
[memory-interface]: memory-interface.md

## Values

The Rust value domain is described by the following (incomplete) type definition (using the `Pointer` type defined by [the memory interface][memory-interface]):

```rust
enum Value {
/// A mathematical integer, used for `i*`/`u*` types.
Int(BigInt),
/// A Boolean value, used for `bool`.
Bool(bool),
/// A pointer value, used for (thin) references and raw pointers.
Ptr(Pointer),
/// An uninitialized value.
Uninit,
/// An n-tuple, used for arrays, structs, tuples, SIMD vectors.
Tuple(Vec<Self>),
/// A variant of a sum type, used for enums.
Variant {
idx: BigInt,
data: Box<Self>,
},
/// A "bag of raw bytes", used for unions.
RawBag(Vec<Byte>),
/* ... */
}
```

The point of this type is to capture the mathematical concepts that are represented by the data we store in memory.
The definition is likely incomplete, and even if it was complete now, we might expand it as Rust grows.
That is okay; all previously defined representation relations are still well-defined when the domain grows, the newly added values will just not be valid for old types as one would expect.

## Example value relations

We show some examples for how one might want to use this `Value` domain to define the value relation for a type.

### `bool`

The value relation for `bool` relates `Bool(b)` with `[r]` if and only if `r.as_int() == Some(if b { 1 } else { 0 })`.
(`as_int` is defined in [the memory interface][memory-interface].)

### `()`

The value relation for the `()` type relates the empty tuple `Tuple([])` (assuming we can use array notation to "match" on `Vec`) with the empty byte list `[]`, and that's it.

### `!`

The value relation for the `!` type is empty: nothing is related to anything at this type.

### `#[repr(C)] struct Pair<T, U>(T, U)`

The value relation for `Pair` is based on the value relations for `T` and `U`.
A value `Tuple([t, u])` is represented by a list of bytes `rt ++ pad1 ++ ru ++ pad2` (using `++` for list concatenation) if:

* `t` is represented by `rt` at type `T`.
* `u` is represented by `ru` at type `U`.
* The length of `rt ++ pad1` is equal to the offset of the `U` field.
* The length of the entire list `rt ++ pad1 ++ ru ++ pad2` is equal to the size of `Pair<T, U>`.

This relation specifies that values of type `Pair` are always 2-tuples (aka, pairs).
It also says that the actual content of the padding bytes is entirely irrelevant, we only care to have the right number of them to "pad" `ru` to the right place and to "pad" the entire list to have the right length.
So, for example when considering `Pair<u8, u16>`, the value `Tuple[42, 119]` is represented on a little-endian target by `[Raw(42), byte, Raw(119), Raw(0)]` for *any* `byte: Byte`.

### `&T`/`&mut T`

Reference types are tricky.
But a possible value relation for sized `T` is:
A value `Ptr(ptr)` is related to `[PtrFragment { ptr, idx: 0 }, ..., PtrFragment { ptr, idx: PTR_SIZE-1 }]` if `ptr` is non-NULL and appropriately aligned (defining alignment is left open for now).

### `u8`

For the value representation of integer types, there are two different reasonable choices.
Certainly, a value `Int(i)` where `i` in `0..256` is related to `[b]` if `b.as_int() == Some(i)`.

And then, maybe, we also want to additionally say that value `Uninit` is related to byte list `[Uninit]`.
This essentially corresponds to saying that uninitialized memory is a valid representation of a `u8` value (namely, the uninitialized value).

### `NonZeroU8`

`NonZeroU8` is basically the same as `u8` except that we remove the 0, and certainly do not consider `Uninit` a valid value.
A value `Int(i)` where `i` in `1..256` is related to `[b]` if `b.as_int() == Some(i)`, and that's it.

### `union`

The `union` type does not even try to interpret memory, so for a `union` of size `n`, the value relation says that for any byte list `bytes` of that length, `RawBag(bytes)` is related to `bytes`.
(Note however that [this definition might not be implementable](https://github.com/rust-lang/unsafe-code-guidelines/issues/156).)

## The role of the value representation in the operational semantics

One key use of the value representation is to define a "typed" interface to memory:

```rust
trait TypedMemory: Memory {
/// Write a value of the given type to memory.
fn typed_write(&mut self, ptr: Self::Pointer, val: Value, ty: Type) -> Result;

/// Read a value of the given type.
fn typed_read(&mut self, ptr: Self::Pointer, ty: Type) -> Result<Value>;
}
```

Here, `Type` is some representation of the Rust type system (akin to [`Ty`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc/ty/type.Ty.html) in the compiler).
This interface is inspired by [Cerberus](https://www.cl.cam.ac.uk/~pes20/cerberus/).

We can implement `TypedMemory` for any `Memory` as follows:
* For `typed_write`, pick any representation of `val` for `ty`, and call `Memory::write`. If no representation exists, we have UB.
* For `typed_read`, read `ty.size()` many bytes from memory, and then determine which value this list of bytes represents. If it does not represent any value, we have UB.

In particular, this defines the semantics of a "typed copy": when an assignment is executed in Rust (at some type `T`), or when an argument of type `T` is passed to a function or a return value of type `T` returned from a function, this is called a "typed copy".
Its semantics is basically to do a `typed_read` at `T` followed by a `typed_write` at `T`.
In particular, this means doing a "typed copy" at `T` of some memory that has no valid representation at `T` is undefined behavior!
This also means that for types that have padding, the "typed copy" does not preserve the padding bytes.

## Relation to validity invariant

One way we *could* also use the value representation (and the author thinks this is exceedingly elegant) is to define the validity invariant.
Certainly, it is the case that if a list of bytes is not related to any value for a given type `T`, then that list of bytes is *invalid* for `T` and it should be UB to produce such a list of bytes at type `T`.
We could decide that this is an "if and only if", i.e., that the validity invariant for a type is exactly "must be in the value representation".
For many types this is likely what we will do anyway (e.g., for `bool` and `!` and `()` and integers), but for references, this choice would mean that *validity of the reference cannot depend on what memory looks like*---so "dereferencable" and "points to valid data" cannot be part of the validity invariant for references.
The reason this is so elegant is that, as we have seen above, a "typed copy" already very naturally is UB when the memory that is copied is not a valid representation of `T`.
This means we do not even need a special clause in our specification for the validity invariant---in fact, the term does not even have to appear in the specification---as everything juts falls out of how a "typed copy" applies the value representation twice.

Justifying the `dereferencable` LLVM attribute is, in this case, left to the aliasing model (e.g. [Stacked Borrows]), just like the `noalias` attribute.

[Stacked Borrows]: stacked-borrows.md

0 comments on commit 09c0f97

Please sign in to comment.