Docs at https://rainprotocol.github.io/rain.lib.hash
When producing hashes of just about anything that isn't already bytes
the
common suggestions look something like keccak256(abi.encode(...))
or
keccak256(abi.encodePacked(...))
. This appears reasonable as Solidity itself
does not provide an "any type" keccak256
function but does provide one (almost)
for abi encoding.
When I say "common suggestion" I mean literally the compiler itself gives outputs
like this for any type other than bytes
.
โ keccak256(address(0))
Compiler errors:
error[7556]: TypeError: Invalid type for argument in function call. Invalid implicit conversion from address to bytes memory requested. This function requires a single bytes argument. Use abi.encodePacked(...) to obtain the pre-0.5.0 behaviour or abi.encode(...) to use ABI encoding.
--> ReplContract.sol:14:19:
|
14 | keccak256(address(0));
This approach raises two questions for me:
- Why are we relying on interface encodings to satisfy cryptographic properties?
- Encoding requires complex recursive/nested data processing, memory expansion, and making more than a full copy of the data to include headers, is that significant gas cost strictly necessary?
For the purpose of this document we are NOT attempting any specific compatibility with external systems or standards, etc.
The basic use case is that we are writing contracts that need to convince themselves that they should authorize some state change.
Often we find ourselves with a lot of state that informs the authorization verification logic. Too much to store, sign, etc. so first we want to "hash the data" and just store, compare, sign the hash.
It doesn't really matter in this case what the hashing algorithm is, as long as it gives us the security guarantees that the contract needs. If the hash is needed to be known offchain, e.g. so it can be passed back to a future call on the contract, then the contract can emit the hash into the logs etc.
We are even fine with changing the patterns described in this document over time. There's no requirement that a hash produced by one contract is compatible with the hash produced by another. Our goal is that contracts implementing these patterns can securely accept arbitrary inputs, NOT that the basic approach ossifies due to unrelated contracts doing different things to each other.
That is to say, there's no "upgradeable contract" support.
Further, while we do want to be able to support "any" data type in our pattern, we do NOT need to support "every" data type in our implementations. It may be relatively onerous to implement and maintain the required assembly logic to safely hash some structure, relative to just slapping an abi encoding on the problem and walking away. The intention is that this work would only be needed for maybe 1 or 2 structs within a codebase, because there wouldn't be a large variety of security sensitive hashing to be done for some given contract/context.
It's also assumed that, because these structs are used on the critical security path, there would be good reasons to design them for stability and simplicity already. In this case, the maintainability concerns that always arise when handling structs in assembly (adding/removing/reordering fields!) are naturally less of a concern due to their context/usage.
An earlier version of the EIP712 spec outlined the difficulty in relying on encoding formats to provide cryptographic guarantees.
A good hashing algorithm should satisfy security properties such as determinism, second pre-image resistance and collision resistance. The keccak256 function satisfies the above criteria when applied to bytestrings. If we want to apply it to other sets we first need to map this set to bytestrings. It is critically important that this encoding function is deterministic and injective. If it is not deterministic then the hash might differ from the moment of signing to the moment of verifying, causing the signature to incorrectly be rejected. If it is not injective then there are two different elements in our input set that hash to the same value, causing a signature to be valid for a different unrelated message.
An illustrative example of the above breakage can be found in Ethereum. Ethereum has two kinds of messages, transactions ๐ and bytestrings ๐นโธโฟ. These are signed using eth_sendTransaction and eth_sign respectively. Originally the encoding function encode : ๐ โช ๐นโธโฟ โ ๐นโธโฟ was defined as follows:
- encode(t : ๐) = RLP_encode(t)
- encode(b : ๐นโธโฟ) = b
encode(b : ๐นโธโฟ) = "\x19Ethereum Signed Message:\n" โ len(b) โ b where len(b) is the ascii-decimal encoding of the number of bytes in b.
This solves the collision between the legs since RLP_encode(t : ๐) never starts with \x19. There is still the risk of the new encoding function not being deterministic or injective. It is instructive to consider those in detail.
As is, the definition above is not deterministic. For a 4-byte string b both encodings with len(b) = "4" and len(b) = "004" are valid. This can be solved by further requiring that the decimal encoding of the length has no leading zeros and len("") = "0".
The above definition is not obviously collision free. Does a bytestring starting with "\x19Ethereum Signed Message:\n42aโฆ" mean a 42-byte string starting with a or a 4-byte string starting with 2a?. This was pointed out in Geth issue #14794 and motivated Trezor to not implement the standard as-is. Fortunately this does not lead to actual collisions as the total length of the encoded bytestring provides sufficient information to disambiguate the cases.
Both determinism and injectiveness would be trivially true if len(b) was left out entirely. The point is, it is difficult to map arbitrary sets to bytestrings without introducing security issues in the encoding function. Yet the current design of eth_sign still takes a bytestring as input and expects implementors to come up with an encoding.
This is good context that has sadly been removed from current versions of the EIP.
The key takeaways are:
- We need something determinstic and injective, which can probably be summarised in a single word as "unambiguous"
- Hashing bytes is secure by default and any encoding scheme's security can only be less than or equal to the security of the hash of the raw data before it is encoded
- It is difficult to assess the cryptographic qualities of an encoding scheme and high profile mistakes can be found in the wild, including formal standards
Perhaps unsurprisingly we can find one of these issues in abi.encodePacked
as
this encoding scheme simply concatenates bytes together.
This means that "abc" + "def"
and "ab" + "cdef"
will pack to the same final
bytestring, "abcdef"
.
We can suggest potential workarounds like "only use packed encoding for fixed length input data", but it's clear that packed encoding is situationally useful at best, and dangerous at worst.
The suggested fix is usually to use abi.encode
, which adds additional
information to the raw data as part of the encoding. Something like
3"abc" + 3"def"
with length prefixes and 2"ab" + 4"cdef"
, and then additional
head/tail structures that encode the offsets of the dynamic length data in an
overall prefix to the encoded data.
https://docs.soliditylang.org/en/develop/abi-spec.html#formal-specification-of-the-encoding
Importantly, in light of the discussion in EIP712, the lengths are fixed length
themselves, always represented as a uint256
, so the full abi.encode
encoding
of the underlying data is probably safe.
So abi.encode
doesn't have the problems of abi.encodePacked
nor early geth
implementations, but is that a strong proof that it doesn't introduce new
problems?
If abi.encode
was an efficient function, we could probably be happy that it has
sufficient adoption and time without exploit to use it. Even if there was some
issue, "nobody ever got fired for using abi.encode
", right?
Doing the same thing as everyone else, and as the security researchers recommend in audits, is usually a good idea.
The issue here is that abi.encode
is not particularly gas efficient. This is a
fundamental issue and not at all the "fault" of Solidity. To encode anything with
any algorithm and not cause the original data to be corrupted/unsafe to use, the
EVM must allocate a new region of memory to house the encoded data. If we allow
for dynamic length data types, the UNAVOIDABLE runtime overhead of ANY
schemaless/uncompressed encoding algorithm is:
- Calculate the size of memory to allocate for the encoded output by recursively traversing the input data
- Allocate the memory and pay nonlinear gas for expansion costs
- Make a complete copy of the input data
- Write additional data for the encoding itself, e.g. type/length prefixes, headers, magic numbers, etc.
The Solidity type system can definitely make a lot of this more efficient, especially the traversal bit, by generating the traversal process at compile time but it can't hand wave away the need for allocating and copying.
Typically, my experience has shown that if some algorithm f(x)
is implemented
in a functionally equivalent way, where one implementation internally encodes x
and another avoids it, the no-encode solution often costs 40-80%+ less gas.
This saving is of course most noticeable when the algorithm is relatively
efficient, or involves a tight internal loop over encoding, such that the
encoding then starts to dominate the profile. Even in cases where that is not
true, such as comparing the reference SSTORE2 implementation to
LibDataContract we still can see 1k+ gas savings per-write for common usage patterns, with
identical outcomes.
It really just seems to come down to the fact that memory expansion and bulk copying nested/dynamic is not a cheap thing to do. It's typically not millions of gas, but it can easily be 1-10k+ gas for what is often unneccessary work.
Note however that keccak256
itself is non destructive, it can happily produce
a hash on the stack without modifying or allocating any memory at all. Even in
the case that some data is NOT in memory yet and we want to hash it
(e.g. on the stack), there is a dedicated region of memory from 0-0x40
called
"scratch space for hashing methods". We can put any two words in the scratch
space and hash them together without interacting with the allocator at all.
What perhaps is the "fault" of Solidity is that they don't implement keecak256
for any type other than bytes
so we are forced to go all the way to Yul and
write assembly the moment we want to do anything other than abi.encode
.
- Define a pattern to hash any Solidity data structure without allocation and minimal memory reads/writes, and is generally efficient
- Convince ourselves the pattern is unambiguous/secure, being both deterministic and injective
- Provide a reference implementation of the pattern that can be fuzzed against to show inline implementations of the pattern provide valid outputs
The memory layout of data in Solidity is very regular across all data types.
https://docs.soliditylang.org/en/v0.8.19/internals/layout_in_memory.html
It is optimised so that the allocator always has the free memory pointer at a multiple of 32.
Note that the memory layout is completely different to e.g. the storage layout. Everything discussed here is specific to data in memory and does not generalise at all.
All non-struct types end up in one of 3 buckets:
- 1 or more 32 byte words, of
length
defined by the type - A 32 byte
length
followed bylength
32 byte words (most dynamic types) - A 32 byte
length
followed bylength
bytes (bytes
andstring
only)
Note that in the last case, the allocator will still move the free memory pointer to a multiple of 32 bytes even if that points past the end of the data structure.
The variables and structs that reference these things are pointers to them, or even nested pointers in the case of structs. The pointers can be either on the stack or in memory, depending on context.
Consider the struct
struct Foo {
uint256 a;
address b;
uint256[] c;
bytes d;
}
If we had some foo_
such that Foo memory foo_ = Foo(...);
then foo_
will
be a pointer, either on the stack or in memory, depending on compiler
optimisations.
The thing it points to falls into the first bucket, a 4-word region of memory
defined by its type. This may not be intuitive but all of uint256
, address
,
uint256[]
and bytes
, and all other types, are all a full singular word in
the struct.
Any types that are smaller than 1 word are padded with 0's such that they retain
the same uint256
equivalent value.
Any types that are larger, or potentially larger than 1 word are pointers to that data, from the perspective of the struct.
This logic is applied recursively.
This means that structs are NOT dynamic length, regardless of how nested or how
many dynamic types appear in their definition, or the definitions of types within
their fields. For example, a Foo
is ALWAYS 4 words, i.e. 0x80 bytes long.
Given the above, we can
- Define a pattern for hashing each of the 3 possible memory layouts
- Explain how to handle pointers across non-contigous regions of memory
- Discuss the security of the composition
- Provide a guide for implementation, maintenance and quality assurance
In all cases where the size of the data is a known number of words at compile time we are free to simply hash the known memory region.
For example, we could hash a foo_
as above like so
assembly ("memory-safe") {
let hash_ := keccak256(foo_, add(foo_, 0x80))
}
Ignore for now that c
and d
are pointers, as that will be discussed later in
this document.
The basic point is that the code example shows that Yul handles what we need for known memory regions very naturally.
Other than implementation bugs, there's no potential for
- Collisions
- Including data what we did not intend to in the hash input
- Failing to include some part of the struct
Because the size of the data never changes, we can just hardcode it per-type.
Most dynamic length types in Solidity are a list of 32 byte words. This includes
lists of pointers like Foo[]
, single byte values bytes1[]
, etc.
The ONLY exceptions to the rule are bytes
and string
types.
Again, ignoring pointers for now, we can hash any dynamic length word list as
assembly ("memory-safe") {
// Assume bar_ is some dynamic length list of words
let hash_ := keccak256(
// Skip the length prefix
add(bar_, 0x20),
// Read the length prefix and multiply by 0x20 to know how many _words_
// to hash
mul(mload(bar_), 0x20)
)
}
Note that here we DO NOT include the length prefix in the bytes that we hash.
This gives us the same behaviour as the case of hashing static length data, but with lengths known only at runtime.
When it comes to composition we do not want to rely on length prefixes for safety guarantees, as they are not always available. As EIP712 explained, length prefixes can introduce ambiguity as easily as they can resolve them as in the case of packed encoding. Ideally we can show security without the need for any additional metadata about our words.
The two byte length types bytes
and string
are the only types in Solidity
that may not have whole-world lengths. Even though the allocator retains a
multiple of 0x20 on the free memory pointer, we MUST respect the true length of
bytes
and string
in bytes, otherwise we introduce ambiguity.
If we did not respect the length then hex"01"
and hex"0100"
would hash to
the same value, as they both have 0x20 bytes allocated to them in memory, even
though the first is 1 byte and the second is 2 bytes in length.
The assembly for this is actually simpler than dealing with words as we do not
need to convert between length/bytes. It is the same for string
and bytes
.
assembly ("memory-safe") {
// Assume baz_ is some bytes/string
let hash_ := keccak256(
// Skip the length prefix
add(baz_, 0x20),
// Read the length prefix to know how many _bytes_ to hash
mload(baz_)
)
}
Note that pointers never appear in bytes
nor string
, or if they do, they are
not going to be dereferenced by our hashing logic. This assembly above is all
that is needed to hash bytes
and string
types.
It would be pointless to hash pointers (no pun intended). A pointer is merely an offset in memory, which has very little to do with the data on the other side of it, and is not even deterministic.
We find pointers in Solidity wherever something that is potentially larger than 1 word needs to fit in a single word slot. For example, any time a struct or dynamic type is an item or field in another struct or dynamic type.
Solidity does not allow mixed type lists so all pointers are at least found in predictable positions. We always know at compile time whether something is a pointer or not, either because it's a field at a known offset, or we are dealing with an individual or list or pointers directly.
To reliably handle pointers without allocations:
- Compute the hash of all data up to the pointer
- Compute the hash of all data referenced by the pointer
- Hash these two hashes together
Using our Foo
struct from above as an example this would look like:
- Hash the first two words as a contigious memory region of known size as
A
- Hash the dynamic word list
foo_.c
asB
- Write
A
andB
to scratch space at0
and0x20
respectively - Hash the scratch space to produce
C
- Hash the bytes
foo_.d
asD
- Write
C
andD
to scratch space as above - Hash the scratch space to produce
E
, which is our final hash ofFoo
As assembly it would look like
assembly ("memory-safe") {
// hash foo_.a and foo_.b together to produce hash A
// store A in scratch
mstore(0, keccak256(foo_, add(foo_, 0x40)))
// Follow the pointer to hash foo_.c into B
let deref_ := mload(add(foo_, 0x60))
// Store B in scratch
mstore(0x20, keccak256(add(deref_, 0x20), mul(mload(deref_), 0x20)))
// Hash A and B to produce C which can be stored direct in scratch
mstore(0, keccak256(0, 0x40))
// Follow the pointer to hash foo_.d
deref_ := mload(add(foo_, 0x80))
// Store D in scratch
mstore(0x20, keccak256(add(deref_, 0x20), mload(deref_)))
// Write C and D to scratch to produce the final hash E
let E := keccak256(0, 0x40)
}
If we had a list of pointers, such as a Foo[]
then this would be modelled as
a simple fold/reduce style accumulator where each item is hashed as above
individually then hashed into the accumulator. I.e. Hash foos_[0]
to hash A,
then hash foos_[1]
to hash B, then write both to scratch and hash to produce C,
then hash foos_[2]
to hash D, and hash C and D to produce E, etc.
This process of iterating and accumulating a hash incrementally seems to be about 40% cheaper in gas terms than ABI encoding then hashing, based on some simple testing with Foundry.
If we are handling a pointer and have no hash, e.g. we're directly hashing an
array of pointers, we start with the hash of nil bytes, i.e. keccak256(0, 0)
.
If the array is 0 length then the hash will be the nil hash, regardless of the type behind the pointers.
Assume that we're comfortable with concepts like blockchains and merkle trees, that rely on hashes of hashes to iteratively build a single hash out of a system of hashes.
We need to convince ourselves that the hashing process above is unambiguous for all permutations.
This relies on us accepting that hash(hash(a) + hash(b))
will not collide with
hash(hash(c) + hash(d))
for all values of a
, b
, c
, d
except the trivial
case where a
= c
and b
= d
.
As long as hash(x)
and hash(x) + hash(y)
don't produce collisions then this
is true. This guarantee seems to come down to the strength of keccak256
itself.
As keccak256
always produces hashes exactly 32 bytes long for all inputs, we avoid
all the issues seen with both including and excluding length prefixes, such as
those seen in abi.encodePacked
.
I'm not sure this constitutes a formal mathematical proof, but I'm not sure if
one exists for abi.encode
either :)
Uses nixos.
Install nix develop
- https://nixos.org/download.html.
Run nix develop
in this repo to drop into the shell. Please ONLY use the nix
version of foundry
for development, to ensure versions are all compatible.
Read the flake.nix
file to find some additional commands included for dev and
CI usage.
Everything is under DecentraLicense 1.0 (DCL-1.0) which can be found in LICENSES/
.
This is basically CAL-1.0
which is an open source license
https://opensource.org/license/cal-1-0
The non-legal summary of DCL-1.0 is that the source is open, as expected, but also user data in the systems that this code runs on must also be made available to those users as relevant, and that private keys remain private.
Roughly it's "not your keys, not your coins" aware, as close as we could get in legalese.
This is the default situation on permissionless blockchains, so shouldn't require any additional effort by dev-users to adhere to the license terms.
This repo is REUSE 3.2 compliant https://reuse.software/spec-3.2/ and compatible
with reuse
tooling (also available in the nix shell here).
nix develop -c rainix-sol-legal
Contributions are welcome under the same license as above.
Contributors agree and warrant that their contributions are compliant.