Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the sharding strategy to ensure a pail DAG is deterministic in structure (provided operations do not touch the same keys) i.e. operations can be applied in any order, and they will result in the same DAG.
This makes diffing easier, and can enable pail to operate efficiently in a diff syncing environment without a merkle clock.
It also results in a significant speedup to
put
operations, due to the reduced string comparison and manipulation operations needed.While the strategy (and configuration) for shards has changed, the main format has not. This means that v1 can read v0 but should not write to it.
The change in the sharding strategy is that instead of sharding at the back of a key (largest common prefix) when the size limit is reached, we shard at the front, using the smallest common prefix (which is always a single character) always, if a common prefix exists. This effectively ensures consistent sharding.
For example, previously when putting
aaaa
and thenaabb
we might end up with a DAG like this when the shard size is reached:If we then put
abbb
we might then get a DAG like:...but if
abbb
was put BEFOREaabb
we may have ended up with a DAG like:Now we always end up with a DAG like the following, because we always shard when there's a common prefix, independent of the order and indepentent of shard size:
That is to say, there is no maximum shard size, and the max key length is now absolute, not just the point at which a shard is created.
The maximum shard size is controlled by the creator of the pail by specifying 2 options:
keyChars
- the characters allowed in keys (default ASCII only).maxKeySize
- maximum size in bytes a UTF-8 encoded key is allowed to be (default 4096 bytes).A good estimate for the max size of a shard is the number of characters allowed in keys multiplied by the maximum key size. For ASCII keys of 4096 bytes the biggest possible shard will stay well within the maximum block size allowed by libp2p.
Note: The default max key size is the same as
MAX_PATH
- the maximum filename+path size on most windows/unix systems so should be sifficient for most purposes.Note: even if you use unicode key characters it would still be tricky to exceed the max libp2p block size, but it is not impossible.
Overall this makes pail skewed less towards quick reads and more towards quick writes. This is contrary to the original goal of the project but on balance I think worth the trade offs, in view of the DAG structure properties and the write speed increase.
I discovered a bug with
removals
in the current implementation - In theory there's a chance if you put the same value to multiple keys and then update it, a shard might appear in theremovals
list that is a shard in a different part of the tree. The new version fixes this by encoding the path prefix into each shard.The following trade offs exist:
However: