Encode singleton repetition update cleverly #421
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a clever optimization for representing in a batch any
(Key, Val)
pairs with a singleton(Time, Diff)
that happens to exactly match the previous(Time, Diff)
. This happens often in e.g. snapshot batches, where often the time is "zero" (or equivalent modulo frontiers) and the diff is "one".Conventionally for each
(Key, Val)
we record an offset that indicates the upper bound of the corresponding(Time, Diff)
entries, starting from the previous offset: the upper bound of the prior keyvals list of timediffs. Conventionally, each recorded offset must be strictly greater than the offset that precedes it, because we simply don't record absent updates.We take advantage of this to signal a repeated timediff by repeating the offset. Essentially, if two updates
(lower, upper)
are equal, the range that should be used is instead(lower - 1, upper)
, picking up the single previous timediff. We need to be careful to read out the right data, also to encode the data correctly, and also to report the total number of updates correctly (it is no longerupdates.len()
).When running
this change improves the limiting memory use (at the end of the execution, with all data in batches) from ~2.75GB to ~1.50GB. This program uses
u32
keys and values, which means that the times and diffs are substantial fat to cut off.