Chain key value storage #3774

cheme · 2019-10-07T06:59:28Z

This is ongoing work on a new storage kind for substrate. I do open the PR to follow the progress, and get feedback.
The amount of code start to grow, and I am at a point where I want to stabilize, clean and document code, for this putting some roadmap in place will help.

Seems also like a good time to start design discussion.

Description

Some info needs to be in synch with blockchain state (information accessible or any block), but does not require the proof property.

This PR intends to have such storage. Design goals are:

lower overhead than trie storage: we store directly key values directly (with the need to have some history management: so double indexing and a bit of history reading).
synch of info up to the extrinsics call:
- support revert when evaluating an extrinsic, and state update for next extrinsic
- support chain reorg

Basically the same thing as the existing state trie storage without the trie proof overhead.

Use case

child trie ids:

Child trie key isolation can be based on parent storage path, but this does not play well with move or delete child trie operations. So associating a unique id with every child trie seems needed (see #2209). Yet those uniqueids are only an implementation details and should not be part of the state (and in case of full storage with pruning it could not (this case does not exists at this time but indicate a severe design limitation of #2209)).
Using this storage for those child trie unique id (or keyspace) seems fine. For instance, a child trie deletion and then re-creation (at the same address in the same transaction) is doable : deletion register the old unique id for pruning and delete the association from the new storage, creation get a new unique id for the new child trie so no delta for pruning is needed (global id for full history with pruning or index from the same storage if only using canonical).

An alternative to get back to #2209 state would be to use storage cache, but it is a bit less flexible (no direct access from statemachine overlay, so it would be temporary code until move or delete operation are really needed), and it is subject to exploit (as his the current implementation) : stores one rocksdb key value per block history as a linked list.

offchain local storage:

cc\ @todr , iirc there is such local storage in offchain storage api but not yet implemented.
It seems to me this replies the the local offchain storage, it will requires exposing the extrinsic (see TODOs), and probably handling bigger content (see TODOs on indexing and maybe handling file as content). Also maybe adding the per key lock that is in place in offchain storage and not needed in others use cases.

any technical info that can be different between client, eg current use cases of storage-cache.
info that should remain identical between client but does not fit a trie storage (we still need to synch some proof on the trie). can even be extended to state storage with lazy proof access (or only a few recalculated trie levels).

TODOs in this PR

overlay db handle: push transaction at the end of a block execution
state-db branches local storage, usage of branch index
state-db handling tree history of data. Indexing by branch index and block number.
canonical storage: simple indexing by block number, no need for branch index.
client testing (canonical): client test should be extended, notably assert the block number send from state-db.
clean code.

TODOs for other PR

Those TODOs even if a bit to much to be reviewable in this pr are strictly needed for this storage.

lazy pruning key set (no need to keep the set in memory). Additionally we may be pruning to often: since we do not have to maintain a deathrow delta having less frequent pruning can be fine.
Genesis storage initialization (may be needed for client test) and 'eset_storage'
Serialized implementation scaling (do not store full history in one rocksdb value (obviously don't scale), but implement two strategie:
- linear: size limite range index and value: linked list of range with possibly included values in range.
- mmr: range with indexing of range as in mmr allows quick access for latest blocks and reduce significantly access for very big history (eg a block index). Size limits similar to previous impl (may need to allow splitting mmr nodes index too).
- bench with different size limits.
Full history mode (archive all):
- serialized with history: likely a first indexing by branch index and canonical only indexing behind it.
- storing association block -> branch index : that was done in one of my previous branch (client-db-ix) where I mistakenly implements branch index management at client level.
- storing last branch ix at client level.
- transmitting branchix in commitset of state-db and initiating state-db at right branch index
- this mode is way less efficient than standard canonical only mode so there could be a plugged canonical only storage mode (same as current) with complement to the full trie indexed one.
storage cache: quite needed.

Optional TODOs

Api similar to offchain, add a prefix, have reserved columns for some static prefix (or add both column and prefix).
- means that result transaction will also be grouped by those.
- means adding extrinsics to query the storage (such ext will probably be needed soon for testing (but gated behind a feature).
full history with pruning: currently all non canonical state is only use in state-db in memory. Implementing full history for this storage means basically that those key do not have to be in memory until canonicalisation. This is not possible currently for state db trie because it will result in key collision. Still a costy approach would be to extend trie library to add the couple branch number block number to the encoded nodes in the underlying db (we already prefix with the node partial path). Then stored branch will becomes [Encoded branch ++ array of child (branch ix + block ix)]. This will not change hash calculation, but would certainly be a bit costy memory wise (two u64 per child, a bit less if we do some small optim like last childs if undefined are same as previous). This could be use along with a canonical compact encoding, but it would involve a pretty heavy cannonicalisation process (rewrite all branch pointer and node index for canonicalized content).

Also this pr could be split for review:

overlay (all change in state-machine)
statedb (state-db change with tree historied data)
client code (client with linear historied data)

This could also be organized in a clean 3 commits history.

complete some test method by providing child access to values.

is. Need to use historied values, maybe remove pinned, and put some context as parameter for get_offstate

pin).

probably need to store state, apply gc on state and depending on result restore state or actually gc historied values.

should get updated.

for gc. Need to change commit set to contain an history of values and adjust test db.

gui1117 · 2019-10-28T11:57:02Z