Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode singleton repetition update cleverly #421

Merged

Conversation

frankmcsherry
Copy link
Member

This PR introduces a clever optimization for representing in a batch any (Key, Val) pairs with a singleton (Time, Diff) that happens to exactly match the previous (Time, Diff). This happens often in e.g. snapshot batches, where often the time is "zero" (or equivalent modulo frontiers) and the diff is "one".

Conventionally for each (Key, Val) we record an offset that indicates the upper bound of the corresponding (Time, Diff) entries, starting from the previous offset: the upper bound of the prior keyvals list of timediffs. Conventionally, each recorded offset must be strictly greater than the offset that precedes it, because we simply don't record absent updates.

We take advantage of this to signal a repeated timediff by repeating the offset. Essentially, if two updates (lower, upper) are equal, the range that should be used is instead (lower - 1, upper), picking up the single previous timediff. We need to be careful to read out the right data, also to encode the data correctly, and also to report the total number of updates correctly (it is no longer updates.len()).

When running

cargo run --release --example bfs -- 100000000 200000000 1000 0 potato

this change improves the limiting memory use (at the end of the execution, with all data in batches) from ~2.75GB to ~1.50GB. This program uses u32 keys and values, which means that the times and diffs are substantial fat to cut off.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The logic is subtle; but I think it does the right thing: Pushing the same (key, value) into the builder actually spells out the prior singleton, and in all other cases, we take it and increment the singleton counter.

self.result.updates.push(time_diff);
}
self.result.updates.push((time, diff));
self.singleton = None;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to assign None as you take() above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@frankmcsherry frankmcsherry merged commit 6027145 into TimelyDataflow:master Nov 20, 2023
1 check passed
This was referenced Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants