Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce SSD disk storage requirements for Parity nodes to optimise costs #14

Open
medvedev1088 opened this issue May 3, 2020 · 0 comments

Comments

@medvedev1088
Copy link
Member

medvedev1088 commented May 3, 2020

We currently use the following options for Parity:

tracing = "on"
pruning = "archive"

This allows us to call the trace_block JSON RPC API to retrieve traces. With these options Parity consumes more than 4TB of SSD space.

A few optimisation options to explore here:

  1. Use tracing = "on" and pruning = "auto". Test if with this configuration we can still call trace_block for all blocks. This will presumable only save trace data in disk but not trie state history.
  2. If option 1 doesn't work: --pruning-history and --pruning-memory options allow specifying how many latest states to keep in memory, older states will be pruned. With this option we could have light nodes from which we pull latest blocks in the Streaming component.
    • We need to test what's the maximum number of trie states that Parity can keep in memory. In Streamer we lag 18 blocks behind the tip. We also need a substantial buffer to account for possible failures in the Streamer that may require pulling much older blocks.
    • Running just the light node described above will not be enough. We also need a full archive node, in case we need to pull the entire history starting from block 0. The full archive node doesn't need to run 24/7 though, we can start it daily, sync the state to the latest block, take a snapshot, and delete. Assuming blocks can be synced 10 times faster than new blocks are mined the full archive node will need to be run only 2.4 hours a day, which can potentially reduce the cost by the factor of 10. This node will also be used for daily scrapes in Airflow. For this scenario we can run a dedicated Kubernetes cluster, separate from the light node cluster. Explore Argo - workflow and pipeline management in Kubernentes - for running the daily jobs (spin up the node from latest snapshot, wait until it's synced, stop, take disk snapshot, delete the node). Also consider Airflow/Composer for this purpose (some work on this has been started here).
    • A side benefit of the above is that the snapshots of full archival node can be shared with the community.
    • The same approach with light and full nodes can be used when we migrate from Parity to Geth in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant