Skip to content

Commit

Permalink
3540: Clarifications, better wording, and change PC (#3715)
Browse files Browse the repository at this point in the history
* Some updates and clarifications

* Better wording

* Change PC to be within the code section
  • Loading branch information
axic authored Aug 11, 2021
1 parent ac1fcbc commit d64ec9a
Showing 1 changed file with 18 additions and 9 deletions.
27 changes: 18 additions & 9 deletions EIPS/eip-3540.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ requires: 3541

We introduce an extensible and versioned container format for the EVM with a once-off validation at deploy time. The version described here brings the tangible benefit of code and data separation, and allows for easy introduction of a variety of changes in the future. This change relies on the reserved byte introduced by [EIP-3541](./eip-3541.md).

To summarise, EOF bytecode has the following layout:
```
format, magic, version, (section_kind, section_size)+, 0, <section contents>
```

## Motivation

On-chain deployed EVM bytecode contains no pre-defined structure today. Code is typically validated in clients to the extent of `JUMPDEST` analysis at runtime, every single time prior to execution. This poses not only an overhead, but also a challenge for introducing new or deprecating existing features.
Expand All @@ -25,12 +30,12 @@ The format described in this EIP introduces a simple and extensible container wi
The first tangible feature it provides is separation of code and data. This separation is especially beneficial for on-chain code validators (like those utilised by layer-2 scaling tools, such as Optimism), because they can distinguish code and data (this includes deployment code and constructor arguments too). Currently they a) require changes prior to contract deployment; b) implement a fragile method; or c) implement an expensive and restrictive jump analysis. Code and data separation can result in ease of use and significant gas savings for such use cases. Additionally, various (static) analysis tools can also benefit, though off-chain tools can already deal with existing code, so the impact is smaller.

A non-exhaustive list of proposed changes which could benefit from this format:
- Including a `JUMPDEST`-table (to avoid analysis at execution time) or removing `JUMPDEST`s entirely.
- Including a `JUMPDEST`-table (to avoid analysis at execution time) and/or removing `JUMPDEST`s entirely.
- Introducing static jumps (with relative addresses) and jump tables, and disallowing dynamic jumps at the same time.
- Requiring code section(s) to be terminated by `STOP`. (Assumptions like this can provide significant speed improvements in interpreters, such as a speed up of ~7% seen in [evmone](https://github.com/ethereum/evmone/pull/295).)
- Multi-byte opcodes without any workarounds.
- Representing functions as individual code sections instead of subroutines.
- Introducing a specific section for the [EIP-2938 Account Abstraction](./eip-2938.md) "top-level AA execution frame", simplifying the proposal.
- Introducing special sections for different use cases, notably Account Abstraction.

## Specification

Expand Down Expand Up @@ -86,7 +91,7 @@ If the terminator is encountered, section size MUST NOT follow.

The section contents follow after the header, in the order and size they are defined, without any padding bytes.

To summarise, the bytecode has the following format:
To summarise, the bytecode has the following layout:
```
format, magic, version, (section_kind, section_size)+, 0, <section contents>
```
Expand All @@ -109,11 +114,11 @@ A bytestream starting with the *EOF prefix* declares itself conforming to the ru

For clarity, the *container* refers to the complete account code, while *code* refers to the contents of the code section only.

1. Jumpdest analysis is only run on the *code*.
2. Execution starts at the first byte of the *code*, and `PC` is set to this position within the container format (e.g. `PC=10` for a *container* with a code and data section).
1. JUMPDEST analysis is only run on the *code*.
2. Execution starts at the first byte of the *code*, and `PC` is set to 0.
3. If `PC` goes outside of the code section bounds, execution aborts with failure.
4. `PC` returns the current position within the *container*.
5. `JUMP`/`JUMPI` uses an absolute offset within the *container*.
4. `PC` returns the current position within the *code*.
5. `JUMP`/`JUMPI` uses an absolute offset within the *code*.
6. `CODECOPY`/`CODESIZE`/`EXTCODECOPY`/`EXTCODESIZE`/`EXTCODEHASH` keeps operating on the entire *container*.
7. The input to `CREATE`/`CREATE2` is still the entire *container*.

Expand Down Expand Up @@ -169,7 +174,11 @@ The `0xEF` byte was chosen because it is reserved for this purpose by [EIP-3541]
We have considered different questions for the sections:
- Streaming headers (i.e. `section_header, section_data, section_header, section_data, ...`) are used in some other formats (such as WebAssembly). They are handy for formats which are subject to editing (adding/removing sections). That is not a useful feature for EVM. One minor benefit applicable to our case is that they do not require a specific "header terminator". On the other hand they seem to play worse with code chunking / merkleization, as it is better to have all section headers in a single chunk.
- Whether to have a header terminator or to encode `number_of_sections` or `total_size_of_headers`. Both raise the question how large of a value these fields should be able to hold. While today there will be only two sections, in case each "EVM function" would become a separate code section, a fixed 8-bit field may not be big enough. A terminator byte seems to avoid these problems.
- Whether to encode `section_size` as a fixed 16-bit value or some kind of variable length field (e.g. [LEB128](https://en.wikipedia.org/wiki/LEB128)). We have opted for fixed size, because it simplifies client implementations, and 16-bit seems enough, because of the currently exposed code size limit of 24576 bytes (see [EIP-170](./eip-170.md) and [EIP-2677](./eip-2677.md)). Should this be limiting in the future, a new EOF version could change the format.
- Whether to encode `section_size` as a fixed 16-bit value or some kind of variable length field (e.g. [LEB128](https://en.wikipedia.org/wiki/LEB128)). We have opted for fixed size, because it simplifies client implementations, and 16-bit seems enough, because of the currently exposed code size limit of 24576 bytes (see [EIP-170](./eip-170.md) and [EIP-2677](./eip-2677.md)). Should this be limiting in the future, a new EOF version could change the format. Besides simplifying client implementations, not using LEB128 also greatly simplifies on-chain parsing.

### PC starts with 0 at the code section

The values for `PC` and `JUMP`/`JUMPI` start with 0 and are within the *code* section. We considered keeping `PC`/`JUMP`/`JUMPI` values to operate on the whole *container* and be consistent with `CODECOPY`/`EXTCODECOPY` but in the end decided otherwise. It looks to be much easier to propose EOF extensions that affect jumps and jumpdests when `JUMP`/`JUMPI` already operates on indexes within *code* section only. This also feels more natural and easier to implement in EVM: the new EOF EVM should only care about traversing *code* and accessing other parts of the *container* only on special occasions (e.g. in `CODECOPY` instruction).

## Backwards Compatibility

Expand All @@ -192,7 +201,7 @@ The choice of *magic* guarantees that none of the contracts existing on the chai
Given the rigid rules of EOF1 it is possible to implement support for the container in clients using very simple pattern matching (the following assumes `magic = 0x00`):

1. If code starts with `0xEF 0x00 0x01 codelen1 codelen2 0x02 datalen1 datalen2 0x00`, then calculate `total_size = (9 + (codelen1 << 8 | codelen2) + (datalen1 << 8 | datalen2))`. If `total_size == container_size` then it is a valid EOF1 code with a code and data section.
2. If code starts with `0xEF 0x00 0x01 codelen1 codelen2 0x00`, then calculate `total_size = 7 + (codelen1 << 8 | codelen2)`. If `total_size == container_size` then it is a valid EOF1 code with a code section only.
2. If code starts with `0xEF 0x00 0x01 codelen1 codelen2 0x00`, then calculate `total_size = 6 + (codelen1 << 8 | codelen2)`. If `total_size == container_size` then it is a valid EOF1 code with a code section only.
3. Otherwise if it starts with `0xEF`, it is invalid.
4. Otherwise if it does not start with `0xEF`, it is valid legacy code.

Expand Down

0 comments on commit d64ec9a

Please sign in to comment.