-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
## Description Closes: #10085 Adds some broad documentation for the `snapshots` package. --- ### Author Checklist *All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.* I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [x] added `!` to the type prefix if API or client breaking change - [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [x] provided a link to the relevant issue or specification - [x] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [x] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [x] added a changelog entry to `CHANGELOG.md` - [x] included comments for [documenting Go code](https://blog.golang.org/godoc) - [x] updated the relevant documentation or specification - [x] reviewed "Files changed" and left comments if necessary - [x] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable)
- Loading branch information
1 parent
cfc8b47
commit fffc2a6
Showing
1 changed file
with
236 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
# State Sync Snapshotting | ||
|
||
The `snapshots` package implements automatic support for Tendermint state sync | ||
in Cosmos SDK-based applications. State sync allows a new node joining a network | ||
to simply fetch a recent snapshot of the application state instead of fetching | ||
and applying all historical blocks. This can reduce the time needed to join the | ||
network by several orders of magnitude (e.g. weeks to minutes), but the node | ||
will not contain historical data from previous heights. | ||
|
||
This document describes the Cosmos SDK implementation of the ABCI state sync | ||
interface, for more information on Tendermint state sync in general see: | ||
|
||
* [Tendermint Core State Sync for Developers](https://medium.com/tendermint/tendermint-core-state-sync-for-developers-70a96ba3ee35) | ||
* [ABCI State Sync Spec](https://docs.tendermint.com/master/spec/abci/apps.html#state-sync) | ||
* [ABCI State Sync Method/Type Reference](https://docs.tendermint.com/master/spec/abci/abci.html#state-sync) | ||
|
||
## Overview | ||
|
||
For an overview of how Cosmos SDK state sync is set up and configured by | ||
developers and end-users, see the | ||
[Cosmos SDK State Sync Guide](https://blog.cosmos.network/cosmos-sdk-state-sync-guide-99e4cf43be2f). | ||
|
||
Briefly, the Cosmos SDK takes state snapshots at regular height intervals given | ||
by `state-sync.snapshot-interval` and stores them as binary files in the | ||
filesystem under `<node_home>/data/snapshots/`, with metadata in a LevelDB database | ||
`<node_home>/data/snapshots/metadata.db`. The number of recent snapshots to keep are given by | ||
`state-sync.snapshot-keep-recent`. | ||
|
||
Snapshots are taken asynchronously, i.e. new blocks will be applied concurrently | ||
with snapshots being taken. This is possible because IAVL supports querying | ||
immutable historical heights. However, this requires `state-sync.snapshot-interval` | ||
to be a multiple of `pruning-keep-every`, to prevent a height from being removed | ||
while it is being snapshotted. | ||
|
||
When a remote node is state syncing, Tendermint calls the ABCI method | ||
`ListSnapshots` to list available local snapshots and `LoadSnapshotChunk` to | ||
load a binary snapshot chunk. When the local node is being state synced, | ||
Tendermint calls `OfferSnapshot` to offer a discovered remote snapshot to the | ||
local application and `ApplySnapshotChunk` to apply a binary snapshot chunk to | ||
the local application. See the resources linked above for more details on these | ||
methods and how Tendermint performs state sync. | ||
|
||
The Cosmos SDK does not currently do any incremental verification of snapshots | ||
during restoration, i.e. only after the entire snapshot has been restored will | ||
Tendermint compare the app hash against the trusted hash from the chain. Cosmos | ||
SDK snapshots and chunks do contain hashes as checksums to guard against IO | ||
corruption and non-determinism, but these are not tied to the chain state and | ||
can be trivially forged by an adversary. This was considered out of scope for | ||
the initial implementation, but can be added later without changes to the | ||
ABCI state sync protocol. | ||
|
||
## Snapshot Metadata | ||
|
||
The ABCI Protobuf type for a snapshot is listed below (refer to the ABCI spec | ||
for field details): | ||
|
||
```protobuf | ||
message Snapshot { | ||
uint64 height = 1; // The height at which the snapshot was taken | ||
uint32 format = 2; // The application-specific snapshot format | ||
uint32 chunks = 3; // Number of chunks in the snapshot | ||
bytes hash = 4; // Arbitrary snapshot hash, equal only if identical | ||
bytes metadata = 5; // Arbitrary application metadata | ||
} | ||
``` | ||
|
||
Because the `metadata` field is application-specific, the Cosmos SDK uses a | ||
similar type `cosmos.base.snapshots.v1beta1.Snapshot` with its own metadata | ||
representation: | ||
|
||
```protobuf | ||
// Snapshot contains Tendermint state sync snapshot info. | ||
message Snapshot { | ||
uint64 height = 1; | ||
uint32 format = 2; | ||
uint32 chunks = 3; | ||
bytes hash = 4; | ||
Metadata metadata = 5 [(gogoproto.nullable) = false]; | ||
} | ||
// Metadata contains SDK-specific snapshot metadata. | ||
message Metadata { | ||
repeated bytes chunk_hashes = 1; // SHA-256 chunk hashes | ||
} | ||
``` | ||
|
||
The `format` is currently `1`, defined in `snapshots.types.CurrentFormat`. This | ||
must be increased whenever the binary snapshot format changes, and it may be | ||
useful to support past formats in newer versions. | ||
|
||
The `hash` is a SHA-256 hash of the entire binary snapshot, used to guard | ||
against IO corruption and non-determinism across nodes. Note that this is not | ||
tied to the chain state, and can be trivially forged (but Tendermint will always | ||
compare the final app hash against the chain app hash). Similarly, the | ||
`chunk_hashes` are SHA-256 checksums of each binary chunk. | ||
|
||
The `metadata` field is Protobuf-serialized before it is placed into the ABCI | ||
snapshot. | ||
|
||
## Snapshot Format | ||
|
||
The current version `1` snapshot format is a zlib-compressed, length-prefixed | ||
Protobuf stream of `cosmos.base.store.v1beta1.SnapshotItem` messages, split into | ||
chunks at exact 10 MB byte boundaries. | ||
|
||
```protobuf | ||
// SnapshotItem is an item contained in a rootmulti.Store snapshot. | ||
message SnapshotItem { | ||
// item is the specific type of snapshot item. | ||
oneof item { | ||
SnapshotStoreItem store = 1; | ||
SnapshotIAVLItem iavl = 2 [(gogoproto.customname) = "IAVL"]; | ||
} | ||
} | ||
// SnapshotStoreItem contains metadata about a snapshotted store. | ||
message SnapshotStoreItem { | ||
string name = 1; | ||
} | ||
// SnapshotIAVLItem is an exported IAVL node. | ||
message SnapshotIAVLItem { | ||
bytes key = 1; | ||
bytes value = 2; | ||
int64 version = 3; | ||
int32 height = 4; | ||
} | ||
``` | ||
|
||
Snapshots are generated by `rootmulti.Store.Snapshot()` as follows: | ||
|
||
1. Set up a `protoio.NewDelimitedWriter` that writes length-prefixed serialized | ||
`SnapshotItem` Protobuf messages. | ||
1. Iterate over each IAVL store in lexicographical order by store name. | ||
2. Emit a `SnapshotStoreItem` containing the store name. | ||
3. Start an IAVL export for the store using | ||
[`iavl.ImmutableTree.Export()`](https://pkg.go.dev/github.com/tendermint/iavl#ImmutableTree.Export). | ||
4. Iterate over each IAVL node. | ||
5. Emit a `SnapshotIAVLItem` for the IAVL node. | ||
2. Pass the serialized Protobuf output stream to a zlib compression writer. | ||
3. Split the zlib output stream into chunks at exactly every 10th megabyte. | ||
|
||
Snapshots are restored via `rootmulti.Store.Restore()` as the inverse of the above, using | ||
[`iavl.MutableTree.Import()`](https://pkg.go.dev/github.com/tendermint/iavl#MutableTree.Import) | ||
to reconstruct each IAVL tree. | ||
|
||
## Snapshot Storage | ||
|
||
Snapshot storage is managed by `snapshots.Store`, with metadata in a `db.DB` | ||
database and binary chunks in the filesystem. Note that this is only used to | ||
store locally taken snapshots that are being offered to other nodes. When the | ||
local node is being state synced, Tendermint will take care of buffering and | ||
storing incoming snapshot chunks before they are applied to the application. | ||
|
||
Metadata is generally stored in a LevelDB database at | ||
`<node_home>/data/snapshots/metadata.db`. It contains serialized | ||
`cosmos.base.snapshots.v1beta1.Snapshot` Protobuf messages with a key given by | ||
the concatenation of a key prefix, the big-endian height, and the big-endian | ||
format. Chunk data is stored as regular files under | ||
`<node_home>/data/snapshots/<height>/<format>/<chunk>`. | ||
|
||
The `snapshots.Store` API is based on streaming IO, and integrates easily with | ||
the `snapshots.types.Snapshotter` snapshot/restore interface implemented by | ||
`rootmulti.Store`. The `Store.Save()` method stores a snapshot given as a | ||
`<- chan io.ReadCloser` channel of binary chunk streams, and `Store.Load()` loads | ||
the snapshot as a channel of binary chunk streams -- the same stream types used | ||
by `Snapshotter.Snapshot()` and `Snapshotter.Restore()` to take and restore | ||
snapshots using streaming IO. | ||
|
||
The store also provides many other methods such as `List()` to list stored | ||
snapshots, `LoadChunk()` to load a single snapshot chunk, and `Prune()` to prune | ||
old snapshots. | ||
|
||
## Taking Snapshots | ||
|
||
`snapshots.Manager` is a high-level snapshot manager that integrates a | ||
`snapshots.types.Snapshotter` (i.e. the `rootmulti.Store` snapshot | ||
functionality) and a `snapshots.Store`, providing an API that maps easily onto | ||
the ABCI state sync API. The `Manager` will also make sure only one operation | ||
is in progress at a time, e.g. to prevent multiple snapshots being taken | ||
concurrently. | ||
|
||
During `BaseApp.Commit`, once a state transition has been committed, the height | ||
is checked against the `state-sync.snapshot-interval` setting. If the committed | ||
height should be snapshotted, a goroutine `BaseApp.snapshot()` is spawned that | ||
calls `snapshots.Manager.Create()` to create the snapshot. | ||
|
||
`Manager.Create()` will do some basic pre-flight checks, and then start | ||
generating a snapshot by calling `rootmulti.Store.Snapshot()`. The chunk stream | ||
is passed into `snapshots.Store.Save()`, which stores the chunks in the | ||
filesystem and records the snapshot metadata in the snapshot database. | ||
|
||
Once the snapshot has been generated, `BaseApp.snapshot()` then removes any | ||
old snapshots based on the `state-sync.snapshot-keep-recent` setting. | ||
|
||
## Serving Snapshots | ||
|
||
When a remote node is discovering snapshots for state sync, Tendermint will | ||
call the `ListSnapshots` ABCI method to list the snapshots present on the | ||
local node. This is dispatched to `snapshots.Manager.List()`, which in turn | ||
dispatches to `snapshots.Store.List()`. | ||
|
||
When a remote node is fetching snapshot chunks during state sync, Tendermint | ||
will call the `LoadSnapshotChunk` ABCI method to fetch a chunk from the local | ||
node. This dispatches to `snapshots.Manager.LoadChunk()`, which in turn | ||
dispatches to `snapshots.Store.LoadChunk()`. | ||
|
||
## Restoring Snapshots | ||
|
||
When the operator has configured the local Tendermint node to run state sync | ||
(see the resources listed in the introduction for details on Tendermint state | ||
sync), it will discover snapshots across the P2P network and offer their | ||
metadata in turn to the local application via the `OfferSnapshot` ABCI call. | ||
|
||
`BaseApp.OfferSnapshot()` attempts to start a restore operation by calling | ||
`snapshots.Manager.Restore()`. This may fail, e.g. if the snapshot format is | ||
unknown (it may have been generated by a different version of the Cosmos SDK), | ||
in which case Tendermint will offer other discovered snapshots. | ||
|
||
If the snapshot is accepted, `Manager.Restore()` will record that a restore | ||
operation is in progress, and spawn a separate goroutine that runs a synchronous | ||
`rootmulti.Store.Restore()` snapshot restoration which will be fed snapshot | ||
chunks until it is complete. | ||
|
||
Tendermint will then start fetching and buffering chunks, providing them in | ||
order via ABCI `ApplySnapshotChunk` calls. These dispatch to | ||
`Manager.RestoreChunk()`, which passes the chunks to the ongoing restore | ||
process, checking if errors have been encountered yet (e.g. due to checksum | ||
mismatches or invalid IAVL data). Once the final chunk is passed, | ||
`Manager.RestoreChunk()` will wait for the restore process to complete before | ||
returning. | ||
|
||
Once the restore is completed, Tendermint will go on to call the `Info` ABCI | ||
call to fetch the app hash, and compare this against the trusted chain app | ||
hash at the snapshot height to verify the restored state. If it matches, | ||
Tendermint goes on to process blocks. |