description |
---|
Aggregation of smaller data pieces to store on Filecoin |
Filecoin is designed to store large data for extended periods. Small-scale data (<4 GiB) can be combined with other small deals into larger ones, either on-chain or off-chain. Smart contracts can handle programmatic data storing. This article explains the process, referring to small-scale data as sub-piece data.
For context, a piece of data in Filecoin refers to a unit of negotiation for data to be stored on Filecoin. A sub-piece refers to a sub-unit of that larger piece. These are typically small data like NFT images, short videos and more.
Aggregation is the process of combining multiple packages of sub-piece data into a single package. This large package is stored on the Filecoin network instead of multiple smaller packages. Aggregation can be done off-chain or on-chain.
@@ -14,61 +16,61 @@ Aggregation is the process of combining multiple packages of _sub-piece data_ in
The base interface for aggregation requires the following components:
- A client who has data to upload.
- An aggregator platform that clients can interact with to request to make a storage deal and retrieve Proof of Deal Sub-piece Inclusion (PoDSI) from.
- An aggregation node to aggregate the sub-piece data into a larger file and to provide PoDSI that can be called via an API endpoint. Aggregation of data always happens off-chain. This is typically hosted by the aggregator platform
- An optional aggregation smart contract that clients can submit an on-chain request to, to request an off-chain aggregation node to make a storage deal
How aggregation happens
Proof of Deal Sub-piece Inclusion (PoDSI) is motivated by a need for sub-piece data uploads to eventually issue verification and proof that the data was included in an associated deal on Filecoin. PoDSI is heavily used in the aggregated deal-making workflow.
PoDSI is a proof construction and is generated for each sub-piece CID (within the large data segment) and stored in an off-chain database.
The proof consists of two elements:
- An inclusion proof of a sub-tree, which contains the size and position of the sub-piece data in the larger aggregated data piece, corresponding to the tree of the aggregator's committed larger aggregated data piece.
- An inclusion proof of the double leaf data segment descriptor, which describes the sub-piece data within the larger data segment index, which is contained at the end of a deal, describing all the data segments contained within that deal.
Lighthouse.storage is the first aggregator platform to enable PoDSI. You can view the tutorial in their docs here.
Read more about the technical details of PoDSI here.
To request for aggregation and PoDSI off-chain, developers interact with an aggregator platform:
- The client submits sub-piece data to an aggregator platform. The aggregator prepares the data and generates the sub-piece CID, known as pCID, and URL to download the CAR file.
- The aggregator hosts an off-chain aggregation node, which aggregates the sub-piece CAR files into a larger aggregated CAR file.
- Simultaneously, the aggregator aggregates indexed data segments (based on specs here). It runs the proofing library and generates PoDSI proofs for each sub-piece pCID, storing them in an off-chain database.
- The aggregator uses programmatic deal-making or manual deal-making to make storage deals with storage providers for the aggregated larger CAR file.
- Storage Providers download the aggregated CAR file and publish storage deals.
- Clients can query a proofing endpoint provided by the aggregator (example here, which will look up the sub-piece CID (pCID) in the database and return the PoDSI proof, aggregated CID, and associated deal ID.
- Clients can use the sub-piece pCID for on-chain verification with the aggregation smart contract, which will verify the Merkle proof to ensure the sub-piece pCID (CommPc) matches the piece CID (CommPa) of the associated deal ID.
Lighthouse.storage is the first aggregator platform available. You can find their docs on how to utilize their SDK for the above process here.
Requesting for aggregation of data to store, via an off-chain platform
On-chain aggregation and PoDSI requests go through aggregator oracle smart contracts:
- The client prepares the data and generates the sub-piece CID, known as pCID (CommPc). Here is an easy data preparation tool by lighthouse.storage.
- The client submits a sub-piece CID (CommPc) with metadata (e.g. URL to download the sub-piece CAR file) directly to the aggregation smart contract.
- The aggregator watches the aggregation contract, and when the aggregator decides there are enough sub-pieces, it downloads all sub-piece data, to generate the aggregated piece from the CAR file URL.
- The aggregator aggregates indexed data segments into a larger data file for deal-making (based on specs here).
- The aggregator combines the sub-piece data into the aggregated CommP (CommPa) by computing within aggregator's off-chain node.
- The aggregator uses programmatic deal-making or manual deal-making to make storage deals with storage providers for the aggregated larger CAR file.
- Storage Providers download the aggregated CAR file and publish storage deals. Upon the client's request, they can find the data via sub-piece CID.
- Clients can query the aggregation smart contract, which notifies the aggregator platform to look up the sub-piece CID (pCID) in its aggregation node's database and return the PoDSI proof, aggregated CID, and associated deal ID.
- Simultaneously, clients can use the sub-piece pCID for on-chain verification with the aggregation smart contract, which will verify the Merkle proof to ensure the sub-piece pCID (CommPc) matches the piece CID (CommPa) of the associated deal ID.
Requesting for aggregation of data to store, via an on-chain smart contract
To build your own on-chain aggregator oracle smart contract, check out one of the implementations with Filecoin Data Tools.