NEP | Title | Author | Status | DiscussionsTo | Type | Category | Created |
---|---|---|---|---|---|---|---|
399 |
Flat Storage |
Aleksandr Logunov <[email protected]> Min Zhang <[email protected]> |
Final |
Protocol Track |
Storage |
30-Sep-2022 |
This NEP proposes the idea of Flat Storage, which stores a flattened map of key/value pairs of the current blockchain state on disk. Note that original Trie (persistent merkelized trie) is not removed, but Flat Storage allows to make storage reads faster, make storage fees more predictable and potentially decrease them.
Currently, the blockchain state is stored in our storage only in the format of persistent merkelized tries. Although it is needed to compute state roots and prove the validity of states, reading from it requires a traversal from the trie root to the leaf node that contains the key value pair, which could mean up to 2 * key_length disk accesses in the worst case.
In addition, we charge receipts by the number of trie nodes they touched (TTN cost). Note that the number of touched trie node does not always equal to the key length, it depends on the internal trie structure. Based on some feedback from contract developers collected in the past, they are interested in predictable fees, but TTN costs are annoying to predict and can lead to unexpected excess of the gas limit. They are also a burden for NEAR Protocol client implementations, i.e. nearcore, as exact TTN number must be computed deterministically by all clients. This prevents storage optimizations that use other strategies than nearcore uses today.
With Flat Storage, number of disk reads is reduced from worst-case 2 * key_length to exactly 2, storage read gas fees are simplified by getting rid of TTN cost, and potentially can be further reduced because fewer disk reads are needed.
Q: Why is this design the best in the space of possible designs?
A: Space of possible designs is quite big here, let's show some meaningful examples.
The most straightforward one is just to increase TTN, or align it with biggest possible value, or alternatively increase the base fees for storage reads and writes. However, technically the biggest TTN could be 4096 as of today. And we tend to strongly avoid increasing fees, because it may break existing contract calls, not even mentioning that it would greatly reduce capacity of NEAR blocks, because for current mainnet usecases depth is usually below 20.
We also consider changing tree type from Trie to AVL, B-tree, etc. to make number of traversed nodes more stable and predictable. But we approached AVL idea, and implementation turned out to be tricky, so we didn't go much further than POC here: near/nearcore#4815. Also for most key-value pairs tree depth will actually increase - for example, if you have 1M keys, depth is always 20, and it would cause increasing fees as well. Size of intermediate node also increases, because we have to need to store a key there to decide whether we should go to the left or right child.
Separate idea is to get rid of global state root completely: #425. Instead, we could track the latest account state in the block where it was changed. But after closer look, it brings us to similar questions - if some key was untouched for a long time, it becomes harder to find exact block to find latest value for it, and we need some tree-shaped structure again. Because ideas like that could be also extremely invasive, we stopped considering them at this point.
Other ideas around storage include exploring databases other than RocksDB, or moving State to a separate database. We can tweak State representation, i.e. start trie key from account id to achieve data locality within the same account. However, Flat Storage main goal is speed up reads and make their costs predictable, and these ideas are orthogonal to that, although they still can improve storage in other ways.
Q: What other designs have been considered and what is the rationale for not choosing them?
A: There were several ideas on Flat Storage implementation. One of questions is whether we need global flat storage for all shards or separate flat storages. Due to how sharding works, we have make flat storages separate, because in the future node may have to catchup new shard while already tracking old shards, and flat storage heads (see Specification) must be different for these shards.
Flat storage deltas are another tricky part of design, but we cannot avoid them, because at certain points of time different nodes can disagree what is the current chain head, and they have to support reads for some subset of latest blocks with decent speed. We can't really fallback to Trie in such cases because storage reads for it are much slower.
Another implementation detail is where to put flat storage head. The planned implementation doesn't rely on that significantly and can be changed, but for MVP we assume flat storage head = chain final head as a simplest solution.
Q: What is the impact of not doing this?
A: Storage reads will remain inefficiently implemented and cost more than they should, and the gas fees will remain difficult for the contract developers to predict.
The key idea of Flat Storage is to store a direct mapping from trie keys to values in the DB. Here the values of this mapping can be either the value corresponding to the trie key itself, or the value ref, a hash that points to the address of the value. If the value itself is stored, only one disk read is needed to look up a value from flat storage, otherwise two disk reads are needed if the value ref is stored. We will discuss more in the following section for whether we use values or value refs. For the purpose of high level discussion, it suffices to say that with Flat Storage, at most two disk reads are needed to perform a storage read.
The simple design above won't work because there could be forks in the chain. In the following case, FlatStorage must support key value lookups for states of the blocks on both forks.
/ Block B1 - Block B2 - ...
block A
\ Block C1 - Block C2 - ...
The handling of forks will be the main consideration of the following design. More specifically, the design should satisfy the following requirements,
- It should support concurrent block processing. Blocks on different forks are processed concurrently in the nearcore Client code, the struct which responsibility includes receiving blocks from network, scheduling applying chunks and writing results of that to disk. Flat storage API must be aligned with that.
- In case of long forks, block processing time should not be too much longer than the average case. We don’t want this case to be exploitable. It is acceptable that block processing time is 200ms longer, which may slow down block production, but probably won’t cause missing blocks and chunks. 10s delays are not acceptable and may lead to more forks and instability in the network.
- The design must be able to decrease storage access cost in all cases, since we are going to change the storage read fees based on flat storage. We can't conditionally enable Flat Storage for some blocks and disable it for other, because the fees we charge must be consistent.
The mapping of key value pairs FlatStorage stored on disk matches the state at some block. We call this block the head of flat storage, or the flat head. During block processing, the flat head is set to the last final block. The Doomslug consensus algorithm guarantees that if a block is final, all future final blocks must be descendants of this block. In other words, any block that is not built on top of the last final block can be discarded because they will never be finalized. As a result, if we use the last final block as the flat head, any block FlatStorage needs to process is a descendant of the flat head.
To support key value lookups for other blocks that are not the flat head, FlatStorage will store key value changes(deltas) per block for these blocks. We call these deltas FlatStorageDelta (FSD). Let’s say the flat storage head is at block h, and we are applying transactions based on block h’. Since h is the last final block, h is an ancestor of h'. To access the state at block h', we need FSDs of all blocks between h and h'. Note that all these FSDs must be stored in memory, otherwise, the access of FSDs will trigger more disk reads and we will have to set storage key read fee higher.
We prefer to store deltas in memory, because memory read is much faster than disk read, and even a single extra RocksDB access requires increasing storage fees, which is not desirable. To reduce delta size, we will store hashes of trie keys instead of keys, because deltas are read-only. Now let's carefully estimate FSD size.
We can do so using protocol fees as of today. Assume that flat state stores a mapping from keys to value refs.
Maximal key length is ~2 KiB which is the limit of contract data key size. During wasm execution, we pay
wasm_storage_write_base
= 64 Ggas per call. Entry size is 68 B for key hash and value ref.
Then the total size of keys changed in a block is at most
chunk_gas_limit / gas_per_entry * entry_size * num_shards = (1300 Tgas / 64 Ggas) * 68 B * 4 ~= 5.5 MiB
.
Assuming that we can increase RAM requirements by 1 GiB, we can afford to store deltas for 100-200 blocks simultaneously.
Note that if we store a value instead of value ref, size of FSDs can potentially be much larger.
Because value limit is 4 MiB, we can’t apply previous argument about base cost.
Since wasm_storage_write_value_byte
= 31 Mgas, values contribution to FSD size can be estimated as
(1300 Tgas / storage_write_value_byte * num_shards)
, or ~167 MiB. Same estimation for trie keys gives 54 MiB.
The advantage of storing values instead of value refs is that it saves one disk read if the key has been
modified in the recent blocks. It may be beneficial if we get many transactions or receipts touching the same
trie keys in consecutive blocks, but it is hard to estimate the value of such benefits without more data.
We may store only short values ("inlining"), but this idea is orthogonal and can be applied separately.
Flat Storage itself doesn't change protocol. We only change impacted storage costs to reflect changes in performance. Below we describe reads and writes separately.
Latest proposal for shipping storage reads is here.
It solves several issues with costs, but the major impact of flat storage is that essentially for reads
wasm_touching_trie_node
and wasm_read_cached_trie_node
are reduced to 0. Reason is that before we had to cover costs
of reading nodes from memory or disk, and with flat storage we make only 2 DB reads.
Latest up-to-date gas and compute costs can be found in nearcore repo.
Storage writes are charged similarly to reads and include TTN as well, because updating the leaf trie node which stores the value to the trie key requires updating all trie nodes on the path leading to the leaf node. All writes are committed at once in one db transaction at the end of block processing, outside of runtime after all receipts in a block are executed. However, at the time of execution, runtime needs to calculate the cost, which means it needs to know how many trie nodes the write affects, so runtime will issue a read for every write to calculate the TTN cost for the write. Such reads cannot be replaced by a read in FlatStorage because FlatStorage does not provide the path to the trie node.
There are multiple proposals on how storage writes can work with FlatStorage.
- Keep it the same. The cost of writes remain the same. Note that this can increase the cost for writes in some cases, for example, if a contract first read from a key and then writes to the same key in the same chunk. Without FlatStorage, the key will be cached in the chunk cache after the read, so the write will cost less. With FlatStorage, the read will go through FlatStorage, the write will not find the key in the chunk cache and it will cost more.
- Remove the TTN cost from storage write fees. Currently, there are two ideas in this direction.
- Charge based on maximum depth of a contract’s state, instead of per-touch-trie node.
- Charge based on key length only. Both of the above ideas would allow us to get rid of trie traversal ("reads-for-writes") from the critical path of block execution. However, it is unclear at this point what the new cost would look like and whether further optimizations are needed to bring down the cost for writes in the new cost model.
See https://gov.near.org/t/storage-write-optimizations/30083 for more details.
While storage writes are not fully implemented yet, we may increase parameter compute cost for storage writes implemented in #455 as an intermediate solution.
There are two main questions regarding to how to enable FlatStorage.
-
Whether there should be database migration. The main challenge of enabling FlatStorage will be to build the flat state column, which requires iterating the entire state. Estimations showed that it takes 10 hours to build flat state for archival nodes and 5 hours for rpc and validator nodes in 8 threads. The main concern is that if it takes too long for archival node to migrate, they may have a hard time catching up later since the block processing speed of archival nodes is not very fast.
Alternatively, we can build the flat state in a background process while the node is running. This provides a better experience for both archival and validator nodes since the migration process is transient to them. It would require more implementation effort from our side.
We currently proceed with background migration using 8 threads.
-
Whether there should be a protocol upgrade. The enabling of FlatStorage itself does not require a protocol upgrade, since it is an internal storage implementation that doesn't change protocol level. However, a protocol upgrade is needed if we want to adjust fees based on the storage performance with FlatStorage. These two changes can happen in one release, or we can be release them separately. We propose that the enabling of FlatStorage and the protocol upgrade to adjust fees should happen in separate release to reduce the risk. The period between the two releases can be used to test the stability and performance of FlatStorage. Because it is not a protocol change, it is easy to roll back the change in case any issue arises.
FlatStorage will implement the following structs.
FlatStorageChunkView
: interface for getting value or value reference from flat storage for
specific shard, block hash and trie key. In current logic we plan to make it part of Trie
,
and all trie reads will be directed to this object. Though we could work with chunk hashes, we don't,
because block hashes are easier to navigate.
FlatStorage
: API for interacting with flat storage for fixed shard, including updating head,
adding new delta and creating FlatStorageChunkView
s.
for example, all block deltas that are stored in flat storage and the flat
storage head. FlatStorageChunkView
can access FlatStorage
to get the list of
deltas it needs to apply on top of state of current flat head in order to
compute state of a target block.
FlatStorageManager
: owns flat storages for all shards, being stored in NightshadeRuntime
, accepts
updates from Chain
side, caused by successful processing of chunk or block.
FlatStorageCreator
: handles flat storage structs creation or initiates background creation (aka migration
process) if flat storage data is not presend on DB yet.
FlatStateDelta
: a HashMap that contains state changes introduced in a chunk. They can be applied
on top the state at flat storage head to compute state at another block.
The reason for having separate flat storages that there are two modes of block processing, normal block processing and block catchups. Since they are performed on different ranges of blocks, flat storage need to be able to support different range of blocks on different shards. Therefore, we separate the flat storage objects used for different shards.
DBCol::FlatState
stores a mapping from trie keys to the value corresponding to the trie keys,
based on the state of the block at flat storage head.
- Rows: trie key (
Vec<u8>
) - Column type:
ValueRef
DBCol::FlatStateDeltas
stores all existing FSDs as mapping from (shard_id, block_hash, trie_key)
to the ValueRef
.
To read the whole delta, we read all values for given key prefix. This delta stores all state changes introduced in the
given shard of the given block.
- Rows:
{ shard_id, block_hash, trie_key }
- Column type:
ValueRef
Note that FlatStateDelta
s needed are stored in memory, so during block processing this column won't be used
at all. This column is only used to load deltas into memory at FlatStorage
initialization time when node starts.
DBCol::FlatStateMetadata
stores miscellaneous data about flat storage layout, including current flat storage
head, current creation status and info about deltas existence. We don't specify exact format here because it is under
discussion and can be tweaked until release.
Similarly, flat head is also stored in FlatStorage
in memory, so this column is only used to initialize
FlatStorage
when node starts.
FlatStateDelta
stores a mapping from trie keys to value refs. If the value is None
, it means the key is deleted
in the block.
pub struct FlatStateDelta(HashMap<Vec<u8>, Option<ValueRef>>);
pub fn from_state_changes(changes: &[RawStateChangesWithTrieKey]) -> FlatStateDelta
Converts raw state changes to flat state delta. The raw state changes will be returned as part of the result of
Runtime::apply_transactions
. They will be converted to FlatStateDelta
to be added
to FlatStorage
during Chain::postprocess_block
or Chain::catch_up_postprocess
.
FlatStorageChunkView
will be created for a shard shard_id
and a block block_hash
, and it can perform
key value lookup for the state of shard shard_id
after block block_hash
is applied.
pub struct FlatStorageChunkView {
/// Used to access flat state stored at the head of flat storage.
store: Store,
/// The block for which key-value pairs of its state will be retrieved. The flat state
/// will reflect the state AFTER the block is applied.
block_hash: CryptoHash,
/// Stores the state of the flat storage
flat_storage: FlatStorage,
}
FlatStorageChunkView
will provide the following interface.
pub fn get_ref(
&self,
key: &[u8],
) -> Result<Option<ValueRef>, StorageError>
Returns the value or value reference corresponding to the given key
for the state that this FlatStorageChunkView
object represents, i.e., the state that after
block self.block_hash
is applied.
FlatStorageManager
will be stored as part of ShardTries
and NightshadeRuntime
. Similar to how ShardTries
is used to
construct new Trie
objects given a state root and a shard id, FlatStorageManager
is used to construct
a new FlatStorageChunkView
object given a block hash and a shard id.
pub fn new_flat_storage_chunk_view(
&self,
shard_id: ShardId,
block_hash: Option<CryptoHash>,
) -> FlatStorageChunkView
Creates a new FlatStorageChunkView
to be used for performing key value lookups on the state of shard shard_id
after block block_hash
is applied.
pub fn get_flat_storage(
&self,
shard_id: ShardId,
) -> Result<FlatStorage, FlatStorageError>
Returns the FlatStorage
for the shard shard_id
. This function is needed because even though
FlatStorage
is part of NightshadeRuntime
, Chain
also needs access to FlatStorage
to update flat head.
We will also create a function with the same in NightshadeRuntime
that calls this function to provide Chain
to access
to FlatStorage
.
pub fn remove_flat_storage(
&self,
shard_id: ShardId,
) -> Result<FlatStorage, FlatStorageError>
Removes flat storage for shard if we stopped tracking it.
FlatStorage
is created per shard. It provides information to which blocks the flat storage
on the given shard currently supports and what block deltas need to be applied on top the stored
flat state on disk to get the state of the target block.
fn get_blocks_to_head(
&self,
target_block_hash: &CryptoHash,
) -> Result<Vec<CryptoHash>, FlatStorageError>
Returns the list of deltas between blocks target_block_hash
(inclusive) and flat head (exclusive),
Returns an error if target_block_hash
is not a direct descendent of the current flat head.
This function will be used in FlatStorageChunkView::get_ref
. Note that we can't call it once and store during applying
chunk, because in parallel to that some block can be processed and flat head can be updated.
fn update_flat_head(&self, new_head: &CryptoHash) -> Result<(), FlatStorageError>
Updates the head of the flat storage, including updating the flat head in memory and on disk,
update the flat state on disk to reflect the state at the new head, and gc the FlatStateDelta
s that
are no longer needed from memory and from disk.
fn add_block(
&self,
block_hash: &CryptoHash,
delta: FlatStateDelta,
) -> Result<StoreUpdate, FlatStorageError>
Adds delta
to FlatStorage
, returns a StoreUpdate
object that includes DB transaction to be committed to persist
that change.
fn get_ref(
&self,
block_hash: &CryptoHash,
key: &[u8],
) -> Result<Option<ValueRef>, FlatStorageError>
Returns ValueRef
from flat storage state on top of block_hash
. Returns None
if key is not present, or an error if
block is not supported.
We should note that the implementation of FlatStorage
must be thread safe because it can
be concurrently accessed by multiple threads. A node can process multiple blocks at the same time
if they are on different forks, and chunks from these blocks can trigger storage reads in parallel.
Therefore, FlatStorage
will be guarded by a RwLock
so its access can be shared safely:
pub struct FlatStorage(Arc<RwLock<FlatStorageInner>>);
The main drawback is that we need to control total size of state updates in blocks after current final head. current testnet/mainnet load amount of blocks under final head doesn't exceed 5 in 99.99% cases, we still have to consider extreme cases, because Doomslug consensus doesn't give guarantees / upper limit on that. If we don't consider this at all and there is no finality for a long time, validator nodes can crash because of too many FSDs of memory, and chain slows down and stalls, which can have a negative impact on user/validator experience and reputation. For now, we claim that we support enough deltas in memory for chain to be finalized, and the proper discussions are likely to happen in NEPs like #460.
Risk of DB corruption slightly increases, and it becomes harder to replay blocks on chain. While Trie
entries are
essentially immutable (in fact, value for each key is
unique, because key is a value hash), FlatStorage
is read-modify-write, because values for the same TrieKey
can be
completely different. We believe that such flat mapping is reasonable to maintain anyway, as for newly discovered state
sync idea. But if some change was applied incorrectly, we may have to recompute the whole flat storage, and for block
hashes before flat head we can't access flat storage at all.
Though Flat Storage significantly reduces amount of storage reads, we have to keep it up-to-date, which results in 1 extra disk write for changed key, and 1 auxiliary disk write + removal for each FSD. Disk requirements also slightly increase. We think it is acceptable, because actual disk writes are executed in background and are not a bottleneck for block processing. For storage write in general Flat Storage is even a net improvement, because it removes necessity to traverse changed nodes during write execution ("reads-for-writes"), and we can apply optimizations there (see "Storage Writes" section).
Implementing FlatStorage will require a lot of engineering effort and introduce code that will make the codebase more
complicated. In particular, we had to extend RuntimeAdapter
API with flat storage-related method after thorough
considerations. We are confident that FlatStorage will bring a lot of performance benefit, but we can only measure the exact
improvement after the implementation. We may find that the benefit FlatStorage brings is not
worth the effort, but it is very unlikely.
It will make the state rollback harder in the future when we enable challenges in phase 2 of sharding. When a challenge is accepted and the state needs to be rolled back to a previous block, the entire flat state needs to be rebuilt, which could take a long time. Alternatively, we could postpone garbage collection of deltas and add support of applying them backwards.
Speaking of new sharding phases, once nodes are no longer tracking all shards, Flat Storage must have support for adding
or removing state for some specific shard. Adding new shard is a tricky but natural extension of catchup process. Our
current approach for removal is to iterate over all entries in DBCol::FlatState
and find out for each trie key to
which shard it belongs to. We would be happy to assume that each shard is represented by set of
contiguous ranges in DBCol::FlatState
and make removals simpler, but this is still under discussion.
Last but not least, resharding is not supported by current implementation yet.
Flat Storage maintains all state keys in sorted order, which seems beneficial. We currently investigate opportunity to speed up state sync: instead of traversing state part in Trie, we can extract range of keys and values from Flat Storage and build range of Trie nodes based on it. It is well known that reading Trie nodes is a bottleneck for state sync as well.
The NEP was approved by Protocol Working Group members on March 16, 2023 (meeting recording):
- The proposal makes serving reads more efficient; making the NEAR protocol cheaper to use and increasing the capacity of the network;
- The proposal makes estimating gas costs for a transaction easier as the fees for reading are no longer a function of the trie structure whose shape the smart contract developer does not know ahead of time and can continuously change.
- The proposal should open doors to enabling future efficiency gains in the protocol and further simplifying gas fee estimations.
- 'Secondary' index over the state data - which would allow further optimisations in the future.
# | Concern | Resolution | Status |
---|---|---|---|
1 | The cache requires additional database storage | There is an upper bound on how much additional storage is needed. The costs for the additional disk storage should be negligible | Not an issue |
2 | Additional implementation complexity | Given the benefits of the proposal, I believe the complexity is justified | not an issue |
3 | Additional memory requirement | Most node operators are already operating over-provisioned machines which can handle the additional memory requirement. The minimum requirements should be raised but it appears that minimum requirements are already not enough to operate a node | This is a concern but it is not specific to this project |
4 | Slowing down the read-update-write workload | This is common pattern in smart contracts so indeed a concern. However, there are future plans on how to address this by serving writes from the flat storage as well which will also reduce the fees of serving writes and make further improvements to the NEAR protocol | This is a concern but hopefully will be addressed in future iterations of the project |
Copyright and related rights waived via CC0.