Encode singleton repetition update cleverly #421

frankmcsherry · 2023-11-20T17:43:21Z

This PR introduces a clever optimization for representing in a batch any (Key, Val) pairs with a singleton (Time, Diff) that happens to exactly match the previous (Time, Diff). This happens often in e.g. snapshot batches, where often the time is "zero" (or equivalent modulo frontiers) and the diff is "one".

Conventionally for each (Key, Val) we record an offset that indicates the upper bound of the corresponding (Time, Diff) entries, starting from the previous offset: the upper bound of the prior keyvals list of timediffs. Conventionally, each recorded offset must be strictly greater than the offset that precedes it, because we simply don't record absent updates.

We take advantage of this to signal a repeated timediff by repeating the offset. Essentially, if two updates (lower, upper) are equal, the range that should be used is instead (lower - 1, upper), picking up the single previous timediff. We need to be careful to read out the right data, also to encode the data correctly, and also to report the total number of updates correctly (it is no longer updates.len()).

When running

cargo run --release --example bfs -- 100000000 200000000 1000 0 potato

this change improves the limiting memory use (at the end of the execution, with all data in batches) from ~2.75GB to ~1.50GB. This program uses u32 keys and values, which means that the times and diffs are substantial fat to cut off.

antiguru

Looks good! The logic is subtle; but I think it does the right thing: Pushing the same (key, value) into the builder actually spells out the prior singleton, and in all other cases, we take it and increment the singleton counter.

antiguru · 2023-11-20T20:08:01Z

src/trace/implementations/ord_neu.rs

+                    self.result.updates.push(time_diff);
+                }
+                self.result.updates.push((time, diff));
+                self.singleton = None;


No need to assign None as you take() above.

Good point.

Encode singleton repetition update cleverly

c393d22

frankmcsherry requested a review from antiguru November 20, 2023 17:43

Correct off-by-one error in singleton counting

b8e855a

antiguru approved these changes Nov 20, 2023

View reviewed changes

Remove redundant code

3640792

frankmcsherry merged commit 6027145 into TimelyDataflow:master Nov 20, 2023
1 check passed

This was referenced Oct 29, 2024

chore: release #532

Closed

chore: release #534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode singleton repetition update cleverly #421

Encode singleton repetition update cleverly #421

frankmcsherry commented Nov 20, 2023

antiguru left a comment

antiguru Nov 20, 2023

frankmcsherry Nov 20, 2023

Encode singleton repetition update cleverly #421

Encode singleton repetition update cleverly #421

Conversation

frankmcsherry commented Nov 20, 2023

antiguru left a comment

Choose a reason for hiding this comment

antiguru Nov 20, 2023

Choose a reason for hiding this comment

frankmcsherry Nov 20, 2023

Choose a reason for hiding this comment