Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable getting a proof for non-current account states #527

Open
igamigo opened this issue Oct 24, 2024 · 19 comments
Open

Enable getting a proof for non-current account states #527

igamigo opened this issue Oct 24, 2024 · 19 comments
Assignees
Milestone

Comments

@igamigo
Copy link
Collaborator

igamigo commented Oct 24, 2024

Currently, the GetAccountProofs RPC method provides a way to retrieve public account data, alongside a merkle proof of the account being included in a block. Specifically, you can get the account's header representing the account's state, and an account storage header, which contains all top-level (meaning, either values or roots of maps) elements.

This proof is currently being generated exclusively for the chain tip. However, realistically, a user will not always have the latest block when executing a transaction that uses FPIs. For this, the current endpoint could be updated to return proofs for an arbitrary block (close to the chain tip), making transaction execution easier to set up on the user side.

@bobbinth
Copy link
Contributor

I think the same strategy should be applied to the GetAccountDetailsRequest request. Though, maybe this should be a separate issue.

@bobbinth bobbinth added this to the v0.7 milestone Oct 24, 2024
@polydez polydez self-assigned this Nov 11, 2024
@polydez polydez moved this to In Progress in User's testnet Nov 11, 2024
@polydez
Copy link
Contributor

polydez commented Nov 12, 2024

@igamigo I have a concern or need an additional clarification regarding this issue. If account was changed after the requested block, is obsolete account's state still useful for transaction? I mean, should we rather return error (something like "account was updated") in this case? In my vision, the result will be useful for transaction execution only when the latest account update was before the requested block. If so, this will significantly simplify overall solution, since we do not need to track account changes, but only proofs.

@igamigo
Copy link
Collaborator Author

igamigo commented Nov 12, 2024

In the context of transaction execution, there are a couple of things to keep in mind:

  • The network/protocol might have a specific block delta after which transaction might be considered outdated (Require transactions to refer to recent blocks miden-base#933).
  • Through the account's contracts, the creator of an account can decide up to which block delta its state is valid (See this). In this sense, if the user executes a transaction that uses FPI with outdated data, the transaction will immediately be considered expired and thus invalid.

Additionally, things like the client might also have extra validations to avoid retrieving stale data, etc.

But in general, I think there can be cases where the user retrieves a state that is not the latest and could be considered still valid (but there should probably be a hard limit after which it does not make sense to retrieve account proofs). I am also not sure if there are going to be any other use cases for this endpoint (or GetAccountDetailsRequest).

@polydez
Copy link
Contributor

polydez commented Nov 13, 2024

For implementing this we actually have to solve two problems:

  • How to get full account state for specific block number?
  • How to get account inclusion proof for specific block number?

I've tried different ideas and approaches for doing these and I think I end up with the simplest to implement now but yet rather effective solution.

Getting Account State

Currently we store the latest account's state and all account state deltas from the very beginning of account's lifetime. Having initial account's state, we can transition it by sequentially applying deltas to its intermediate states.

There are two feasible solutions (without complete account storage redesign) to get account's state prior the latest account's update:

  1. Store some intermediate account's states (not older than 256 blocks or so). This is difficult because we can't predict future updates and this might also require plenty of additional space and sophisticated algorithms for keeping only needed states and pruning excessive updates in order to save space. But having one of states before requested block number gives us ability to sequentially apply deltas to transition account's state to the requested one.
  2. Use the information we already have (latest state and deltas) gives us possibility to reconstruct account's state to any point of time by applying deltas in reversing order. The only difficulty I can see here is that we need to redesign storage delta to make it reversible and probably need to apply additional checks to nonce in delta to assure it is next to previous. And implement reverse applying, of course. But it seems to be easier and more effective in the most of cases than saving previous account's states.

Getting Inclusion Proof

Currently we have in-memory sparse Merkle tree which stores the latest account hashes for each account in blockchain. It's updated each new block by computing update first and then applying it to the current tree.

If try to avoid ending up with completely different account's hashes storage structure, there a few solutions I can see:

  1. Clone the whole tree every block, keep all copies for required amount of time ($N$ last blocks). It will require additional RAM, but might be acceptable at least for current loads.
  2. Keep one previous state ($N$ blocks back) and all computed updates for this range. We can transition $tip - N - 1$ state to $tip - N$ state by applying update to it (another solution would be to keep only state for each $Nth$ block and purge states older than $N × 2$, but I think applying update is not too expensive).
  3. Keep only oupdates for the last $N$ blocks. Apply them in reverse order on each request. In order to implement this, we will need to implement updating of SMT in reversing order and probably update the change sets itself if they are not ready for such operation (but it seems that the will work).

I think, the third solution might be a good compromise for us now.

@Mirko-von-Leipzig
Copy link
Contributor

Mirko-von-Leipzig commented Nov 13, 2024

Getting Inclusion Proof
...

There is a bit of pain associated with (3). Or more specifically, the current forward delta is not enough - you need to know the reverse delta i.e. what value did each node start with. It is possible though, I don't think its super trivial. Its also fairly expensive computationally since you are recomputing hashes (?) - unless the delta's are just hash updates.

There is a fourth option iiuc. Cache the complete account proofs for all recently updated accounts. The benefit of this is that it doesn't involve the original tree at all. The downside is its a bit wasteful ito memory usage, at least if implemented naively. e.g.:

If account x is updated at block y, then we need to store x account proofs from y..=N (N being current chain tip). Once N - y > limit we drop the proofs associated with y. The account can of course get more updates y' etc, but those would have there own N - y' > limit checks.

Memory usage can be improved by creating a proper tree, with Arc<Node> etc. The nice thing about that would be that pruning is automatic.

Effectively we would have the latest tree, and a separate sparse caching tree structure that only has the updated account proofs.

@polydez
Copy link
Contributor

polydez commented Nov 13, 2024

There is a bit of pain associated with (3). Or more specifically, the current forward delta is not enough - you need to know the reverse delta i.e. what value did each node start with.

Yes, but since we need this only for reversing order applying, we can use the same delta format, but compute them differently (i.e. additionally to current compute_mutations we can implement something like compute_reverse_mutations and this will produce mutation which after applying will revert SMT state to previous).

There is a fourth option iiuc. Cache the complete account proofs for all recently updated accounts.

I didn't think this way, thank you! Will try to think about it.

Memory usage can be improved by creating a proper tree, with Arc<Node> etc. The nice thing about that would be that pruning is automatic.

Yes, that's the way I was thinking when tried to end up with "perfect" solution for this task. In my idea I tried to make multiple SMTs which reuse same subtrees if they have equal hashes. Each root could be account's root for particular block number. Pruning of old roots would force unused nodes to prune as well. But the main difficulty there was how to access account hashes by account id and I haven't end up with efficient solution of it yet.

@polydez
Copy link
Contributor

polydez commented Nov 14, 2024

Cache the complete account proofs for all recently updated accounts.

I was thinking about it and there is an issue with other accounts. When one account was changed, its subtree has to be recalculated up to the root. And this affects proofs of all other accounts. For example, if account 1 was updated in block #10 and account 2 was updated in block #11 and proof of account 1 was requested for block #12, this proof will be different from the one for block #10.

@Mirko-von-Leipzig
Copy link
Contributor

Cache the complete account proofs for all recently updated accounts.

I was thinking about it and there is an issue with other accounts. When one account was changed, its subtree has to be recalculated up to the root. And this affects proofs of all other accounts. For example, if account 1 was updated in block #10 and account 2 was updated in block #11 and proof of account 1 was requested for block #12, this proof will be different from the one for block #10.

Indeed you'll have to store the account proof per block for an account until it exceeds the limit.

@polydez
Copy link
Contributor

polydez commented Nov 14, 2024

Indeed you'll have to store the account proof per block for an account until it exceeds the limit.

My point was, that it's not enough to store proofs for only updated accounts, because proofs of unchanged accounts are also updated on any blockchain state change. In other words, update of one account affects proofs of all accounts.

@Mirko-von-Leipzig
Copy link
Contributor

Indeed you'll have to store the account proof per block for an account until it exceeds the limit.

My point was, that it's not enough to store proofs for only updated accounts, because proofs of unchanged accounts are also updated on any blockchain state change. In other words, update of one account affects proofs of all accounts.

It does, but you only need to cache the recently updated accounts. For every account updated x in block y, you'll have to store the proof for x from block y..tip until tip - y > limit.

But in general all solutions are just different ways of expressing the same subgraph. My suggestion is just a de-normalized variation where you flatten the graph. i.e. least efficient ito memory usage but also low effort ito implementation.

@polydez
Copy link
Contributor

polydez commented Nov 18, 2024

Taking into account ideas from our call, we end up with the following solutions for the corresponding problems:

Getting Account State

Currently we store the latest account's state and all account state deltas from the very beginning of account's lifetime. We store deltas in a single account_deltas table in binary-encoded form.

Let's refactor deltas in order to make them reversible (i.e. store old values along with new ones). And we also should store deltas in set of separate tables, so that it will be possible to calculate single delta from block A to block B by just using SQL queries:

CREATE TABLE
    account_deltas
(
    account_id  INTEGER NOT NULL,
    block_num   INTEGER NOT NULL,
    old_nonce   INTEGER NOT NULL,
    new_nonce   INTEGER NOT NULL,

    PRIMARY KEY (account_id, block_num),
    FOREIGN KEY (account_id) REFERENCES accounts(account_id),
    FOREIGN KEY (block_num) REFERENCES block_headers(block_num),
    CONSTRAINT account_deltas_nonce_increased CHECK (old_nonce < new_nonce)
) STRICT;


CREATE TABLE
    account_storage_delta_values
(
    account_id  INTEGER NOT NULL,
    block_num   INTEGER NOT NULL,
    slot        INTEGER NOT NULL,
    old_value   BLOB    NOT NULL,
    new_value   BLOB    NOT NULL,

    PRIMARY KEY (account_id, block_num, slot),
    FOREIGN KEY (account_id) REFERENCES accounts(account_id),
    FOREIGN KEY (block_num) REFERENCES block_headers(block_num),
    CONSTRAINT account_storage_delta_value_updated CHECK (old_value != new_value)
) STRICT, WITHOUT ROWID;

CREATE TABLE
    account_storage_delta_map_values
(
    account_id  INTEGER NOT NULL,
    block_num   INTEGER NOT NULL,
    slot        INTEGER NOT NULL,
    key         BLOB    NOT NULL,
    old_value   BLOB    NOT NULL,
    new_value   BLOB    NOT NULL,

    PRIMARY KEY (account_id, block_num, slot, key),
    FOREIGN KEY (account_id) REFERENCES accounts(account_id),
    FOREIGN KEY (block_num) REFERENCES block_headers(block_num),
    CONSTRAINT account_storage_delta_map_value_updated CHECK (old_value != new_value)
) STRICT;

CREATE TABLE
    account_fungible_asset_deltas
(
    account_id  INTEGER NOT NULL,
    block_num   INTEGER NOT NULL,
    faucet_id   INTEGER NOT NULL,
    delta       INTEGER NOT NULL,

    PRIMARY KEY (account_id, block_num, faucet_id),
    FOREIGN KEY (account_id) REFERENCES accounts(account_id),
    FOREIGN KEY (block_num) REFERENCES block_headers(block_num)
) STRICT, WITHOUT ROWID;

CREATE TABLE
    account_non_fungible_asset_delta_actions
(
    account_id  INTEGER NOT NULL,
    block_num   INTEGER NOT NULL,
    asset       BLOB NOT NULL,
    is_remove   INTEGER NOT NULL, -- 0 - add, 1 - remove

    PRIMARY KEY (account_id, block_num, asset),
    FOREIGN KEY (account_id) REFERENCES accounts(account_id),
    FOREIGN KEY (block_num) REFERENCES block_headers(block_num)
) STRICT, WITHOUT ROWID;

So, a procedure of getting account state for requested block number will look like:

  1. Get the latest account state.
  2. Compute reverse delta from the latest account state to the requested one by series of SQL queries (hope, it will take 5 SQL queries).
  3. Apply reverse delta to the latest account state.

Getting Inclusion Proof

Currently we have in-memory sparse Merkle tree which stores the latest account hashes for each account in blockchain. It's updated each new block by computing update first and then applying it to the current tree. When we need to get account inclusion proof, we traverse tree from account's leaf up to the root, putting hash of each passed node to the list.

We can keep computed updates for the last $N$ blocks in order to be able to get inclusion proof for any account in any of last $N$ blocks. This will require significantly less space than keeping all copies of account SMTs for the last $N$ blocks. But we also will need to keep $(tip - N)$'th block state and update it with every computed update (another approach would be to keep SMT for every $Nth$ block and purge data from $tip - 2N$ to $tip - N$, but this would require about 2 times more memory).

So, an algorithm of getting account inclusion proof for requested block number will be:

  1. Get update for the requested block.
  2. Starting from the root of the update, go deeper and deeper to the searched account ID. If account's ID is in left subtree — go left, if in the right subtree — go right.
  3. If there is no corresponding node in the current update — go to the previous update. Repeat, if the node is not in the update, up to the full SMT.
  4. When account's leaf was found, put it to the end of the list and output the list in reverse order.

UPD: I think, if the solution performs well, we won't even need to keep latest state SMT, it will be enough to have "initial" ($tip - N$) state and all updates after that state.

IMO, this is the most memory/computations effective solution so far.

@bobbinth
Copy link
Contributor

  • Get the latest account state.
  • Compute reverse delta from the latest account state to the requested one by series of SQL queries (hope, it will take 5 SQL queries).
  • Apply reverse delta to the latest account state.

I think this works. A couple of comments:

  • We may be able to batch all 5 queries into a single database request - though, not sure how complicated it is with SQLite. Also, in the future, we could probably move these tables into a separate DB altogether to reduce the load on the main DB.
  • I noticed that in some tables we keep both old value and new value. What's the purpose of the old value (since it should be usually present in the previous record for the same element)?

So, an algorithm of getting account inclusion proof for requested block number will be:

  1. Get update for the requested block.
  2. Starting from the root of the update, go deeper and deeper to the searched account ID. If account's ID is in left subtree — go left, if in the right subtree — go right.
  3. If there is no corresponding node in the current update — go to the previous update. Repeat, if the node is not in the update, up to the full SMT.
  4. When account's leaf was found, put it to the end of the list and output the list in reverse order.

I think this works. There could also be other ways to keep track of this data. For example, we could merge all changesets into a single data structure - something like:

BTreeMap<NodeIndex, Vec<(u32, Digest)>>

Where the internal tuples are (block_num, node). This may be more efficient from the memory access standpoint, but also probably more complicated to remove stale data.

Overall, I'd probably abstract this away behind a new struct - maybe something like this:

pub struct AccountTree {
    tree: Smt,
    updates: SmtUpdates,
}

I terms of PRs, I think we actually have 2 PRs here:

  1. Update the account delta structure.
  2. Update the account tree.

I'd probably start with the delta structure as we need it for other purposes too.

@polydez
Copy link
Contributor

polydez commented Nov 19, 2024

  • We may be able to batch all 5 queries into a single database request - though, not sure how complicated it is with SQLite. Also, in the future, we could probably move these tables into a separate DB altogether to reduce the load on the main DB.

We can join different tables in a single request, maybe UNION them, but I'm not sure it will improve performance. I will think deeper about it during implementation of merging of deltas.

  • I noticed that in some tables we keep both old value and new value. What's the purpose of the old value (since it should be usually present in the previous record for the same element)?

This is good idea and in our current solution it should work, but we should decide first, whether we keep all deltas from the very beginning, or prune deltas older than some number of blocks.

I think this works. There could also be other ways to keep track of this data. For example, we could merge all changesets into a single data structure - something like:

BTreeMap<NodeIndex, Vec<(u32, Digest)>>

Where the internal tuples are (block_num, node). This may be more efficient from the memory access standpoint, but also probably more complicated to remove stale data.

Thank you! I totally agree, this will require more operations for pruning old data, especially moving data in vectors. I will think about some optimizations here.

@igamigo
Copy link
Collaborator Author

igamigo commented Dec 9, 2024

Regarding #563 (comment), here are some of the ways we currently use endpoints and how these retrieving old account states would come into play:

  • As mentioned in the original description, currently GetAccountProofs is used to retrieve an account's state at the chain tip specifically for FPI transactions. Clients will not necessarily have block header information for that chain tip, as header/MMR data is retrieved (pretty much) only during syncs. What this means currently is that FPI transactions will imply retrieving block data separately (through a sync and GetBlockHeader calls), so having a way to define an older block for which to retrieve account state would simplify this.
  • There's also the GetAccountDetails endpoint which @bobbinth mentioned as another candidate for this refactor (not entirely sure of what his original intent was for this request specifically). We currently use it in two places:
    • If a private account has been locked (ie, the local hash has deviated from that of the node) and the user imported the actual state of the account, we check the hash against the network hash to compare. If it does not, the import does not go through.
    • During sync_state, we send all account IDs that we are tracking (private or public), and we receive a list of (AccountId, AccountHashUpdate) pairs. From this response, we do GetAccountDetails in order to retrieve the new account state. AFAIK, the node only keeps the latest state of public accounts and so this does not really apply until the client has synced to the last block number where each public account was updated.
      • Reviewing this, we have GetAccountDeltas endpoint now (I did not know it was on the RPC set of requests, thought it was store-only). I wonder if we should progressively get deltas for public accounts using this instead of retrieving the full account object.

@bobbinth
Copy link
Contributor

Seems like overall we have a need for 3 actions:

  1. Get account inclusion proof for some account state close to the chain tip. This should work for both public and private accounts.
  2. Get account deltas for a given account between blocks $x$ and $y$.
  3. Get full account data for some account state close to the chain tip. This should work only for public accounts.

For the first point, we should be able to come up with a decent solution and I know @polydez is currently working on it.

For the second point, we kind of already have a solution - though, one thing that bothers me there is how to keep the size of the returned deltas reasonable (i.e., if a lot of updates were done to the account between blocks $x$ and $y$, we may end up sending large amounts of data). Maybe we should think about potentially refactoring this somehow.

For the third point, the solution we currently have is to return the entire account data and only for the chain tip. We need to think of how to refactor this as both of these could be issues (the account may be too big to return in a single request, but if we return the account data over multiple requests, the account state may change between these requests).

The biggest issue here is mostly dealing with storage maps and account vault as all other data takes up at most 8KB per account and so in theory, we could store multiple version of that for states close to the chain tip. How to store/retrieve account vault and storage map data is still an open question for me.

@polydez
Copy link
Contributor

polydez commented Dec 10, 2024

@igamigo, @bobbinth, thank you for sharing your vision!

  1. Get full account data for some account state close to the chain tip. This should work only for public accounts.

We can use the solution from #563, but in order to limit computations we should also add purging of obsolete deltas on each block applying. This can be achieved by executing queries like "remove all values for blocks older than $tip - N$ which have newer updates before $tip - N$", where $N$ is number of latest block for which we want to support getting of public account states.

This solution has several advantages over rewriting which we discussed before:

  1. It's almost done, only purging is left and we can implement it in a separate PR.
  2. Purging of old deltas will reduce growth of the database.
  3. State calculation from the beginning to a given block is simpler than backward delta calculation/applying from the chain tip down to the given block, which would also require implementation of reverse delta applying.

@bobbinth:

For the third point, the solution we currently have is to return the entire account data and only for the chain tip. We need to think of how to refactor this as both of these could be issues (the account may be too big to return in a single request, but if we return the account data over multiple requests, the account state may change between these requests).

For the proposed solution, the only difficulty I see is that account state might be big to return in a single request. We can limit response size and introduce paging with caching of generated responses for short period of time (e.g. 20 seconds or so after the last request, but we probably would like to share the cached results between similar requests from different users).

There is also no problem with account state changes during fetching of pages: user requests account state for specified block number and the only thing we get from the latest state is account's code which is immutable by now. Once we add support for mutable account's code, we will store code updates in account's deltas table.

@bobbinth
Copy link
Contributor

the only difficulty I see is that account state might be big to return in a single request. We can limit response size and introduce paging with caching of generated responses for short period of time

Right, this is my biggest concern: how do we deal with accounts where the state is big (e.g., 100 MB or even 1 GB). We could use pagination, but what would we paginate over and how much overhead would supporting pagination add? If we break the account down we have several components:

  1. Account header - this is pretty small (~128 bytes).
  2. Storage header - usually pretty small but in the worst case around 8KB.
  3. Code - I think we'll have an upper bound on this (I believe we discussed something like 32KB).
  4. Storage maps - potentially many maps each with theoretically unbounded in size.
  5. Asset vault - theoretically unbounded in size.

We could design an endpoint which always returns the first 3 items in this list and maybe some number of entries for the other items. Then, we'd need endpoints to retrieve additional items for storage maps and asset vault and depending on how we'd want to do that, we may need to refactor how we store account data in the database.

We can use the solution from #563, but in order to limit computations we should also add purging of obsolete deltas on each block applying. This can be achieved by executing queries like "remove all values for blocks older than $tip - N$ which have newer updates before $tip - N$", where $N$ is number of latest block for which we want to support getting of public account states.

I think whether we go with this solution or something else will depend on the answers to the above question. For example, we may decide to store account assets in a table which has the following fields: (account_id, asset, added_on, removed_on) where added_on and removed_on are block numbers. Then, getting the state of the asset vault at a given point in time may not be that difficult (though, this approach would work only for non-fungible assets).

It is also possible I'm overthinking this as dealing with accounts which require 100MB or 1GB of storage may require a completely different approach.

@polydez
Copy link
Contributor

polydez commented Dec 11, 2024

Right, this is my biggest concern: how do we deal with accounts where the state is big (e.g., 100 MB or even 1 GB).

I have doubts if we really need to support such big accounts. I think, we should limit account storage size and big account storage should also require bigger fees.

Few examples from different blockchains:

  • Solana account size is limited to 10 MB and account's owner must pay for storage rent (now that payment is fully refundable on account shrinking/destroying).
  • TON account storage is limited to 65535 cells (each 1024 bits max, i.e. storage size is limited to 8 MB) and owner have to pay non-refundable storage rent fees for each block.
  • Ethereum contracts have no practical storage size limit (theoretical limit is 2256 words), but each word costs 20.000 gas (each update costs 5.000 gas per word, only 20% refund on zeroing), and for storing 1 MB you'd have to pay tenth of thousands in USD and millions of USD for 1 GB, which practically limits account's storage size.

And talking about asset vault, how many assets we would practically see? For DEX it might be plenty of fungible tokens, let's say, 1 thousand. For some NFT market it might be really many, millions of generated non-fungible assets. But we probably should force (or incentive) developers to spread such big numbers between many smaller accounts.

@bobbinth
Copy link
Contributor

  • Ethereum contracts have no practical storage size limit (theoretical limit is 2256 words), but each word costs 20.000 gas (each update costs 5.000 gas per word, only 20% refund on zeroing), and for storing 1 MB you'd have to pay tenth of thousands in USD and millions of USD for 1 GB, which practically limits account's storage size.

I can't find it now, but I think there are quite a few contracts on Ethereum that take up more than 100MB of storage. I agree that storage should be expensive, but I am not sure we should limit it beyond that. Similar thinking goes for the asset vault. We could imagine contracts with millions of NFTs where each NFT is 32 bytes. So, we are again in the realm of dozens of megabytes and potentially more.

So, we might want to re-factor our account_details table into multiple tables and this may address some of the issues of retrieving recent account states. For example, we could make it so that we store only account headers in the account_details table but we store the last 100 headers for each account. Then we can have supporting table which store additional data - e.g.,

  • account_code to store account code.
  • account_storage containing account storage for the last 100 account states (there could be a clever way to do this to avoid duplication).
  • account_vault - to store assets, though, maybe this one could also be split into 2 tables.
  • account_storage_maps - this could hold storage map data for each storage map (or maybe this would be a couple of tables as well).

In the future, we may even put public account data into a separate database so that it doesn't slow down the main database - but for now its simpler to keep them together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants