Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: A merge batcher that gracefully handles non-ready data #463

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

antiguru
Copy link
Member

@antiguru antiguru commented Feb 27, 2024

This PR shows how to implement a merge batcher that is smart about reconsidering data that's in advance of the current frontier. It does the following things:

  1. It extracts ready data from chains instead of merging all chains and then extracting data from the last remaining chain.
  2. It separates canonicalization from the extraction operation, so it can be reused for inserts and extracts.
  3. It memorizes a frontier per block, which allows for efficient frontier testing: If the extraction (upper) frontier is not less or equal to the block frontier, do not touch the block.

This should have the potential to reduce the amount of work for outstanding data from $O(n)$ where $n$ is the number of records in the merge batcher to $O(n/1024)$ by considering only the block itself, but not the data it contains.

I am sorry for the formatting noise which originates from copying this code from DD to Mz and back again :/

Teaches the merge batcher to extract ready times from the existing
chains, and maintaining the chain invariant after extracting data. This
reduces the effort to maintain data that is not yet ready, by
maintaining a frontier per chain block that allows us to efficiently
decide that a block needs to be inspected or not.

Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru antiguru force-pushed the merge_batcher_temporal branch from 7fb5413 to 1d6a5e5 Compare March 1, 2024 19:24
@@ -151,18 +113,20 @@ where

struct MergeSorter<D, T, R> {
/// each power-of-two length list of allocations. Do not push/pop directly but use the corresponding functions.
queue: Vec<Vec<Vec<(D, T, R)>>>,
queue: Vec<Vec<(Antichain<T>, Vec<(D, T, R)>)>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a comment here about the role of the antichain relative to the vector of updates.

Comment on lines +100 to +102
self.lower.clone(),
upper.clone(),
Antichain::from_elem(T::minimum()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider (not in this PR) switching these to references.

stash: Vec<Vec<(D, T, R)>>,
pending: Vec<(D, T, R)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?


const BUFFER_SIZE_BYTES: usize = 1 << 13;
const BUFFER_SIZE_BYTES: usize = 64 << 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed, but future todo: extract this into something wrapping our buffers so that they can express an opinion without the MergeBatcher needing to be up to date on the opinions.

@@ -179,81 +143,235 @@ impl<D: Ord, T: Ord, R: Semigroup> MergeSorter<D, T, R> {
operator_id,
queue: Vec::new(),
stash: Vec::new(),
pending: Vec::new(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that ideally all of these start at zero capacity, so that if they are not used (e.g. on a zero volume channel, like an error path) they do not allocate. This means that we have to check the capacity elsewhere.


// Walk all chains, separate ready data from data to keep.
for mut chain in std::mem::take(&mut self.queue).drain(..) {
let mut block_list = Vec::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said aloud: "could this be ship_list?". Maybe not, but the name block_list is not very specific in this context (many lists of blocks here).

// Iterate block, sorting items into ship and keep
for datum in block.drain(..) {
if upper.less_equal(&datum.1) {
frontier.insert_ref(&datum.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the only (?) place we update frontier, even though when we ship none of the updates (the else case below) we still want to reflect those times in the overall frontier.

}
keep_list.push((block_frontier, block));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially a good moment, as we go, to perform the "adjacent blocks" compaction. That would allow us to return memory to self.empty() eagerly, and have it available as we loop. Waiting until maintain() is not wrong, but it seems like it may cause memory to spike during extract_into and return down only after maintain().

Comment on lines 347 to 350
while ship_list.len() > 1 {
let list1 = ship_list.pop().unwrap();
let list2 = ship_list.pop().unwrap();
ship_list.push(self.merge_by(list1, list2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason these need to be in the same geometric order as the held updates. There is the opportunity to resort them, or to use a binary heap to continually merge the smallest chains. No strong feelings, because it seems unlikely to be adversarially selected.

}

impl<D, T, R> MergeSorter<D, T, R>
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

antiguru added 2 commits March 6, 2024 18:01
Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru
Copy link
Member Author

We figured that the approach taken in this PR changes what we report as the next lower bound of data to be extracted. As before, seal captures a lower frontier of all the times in the batcher, but its semantics are different. Previously, the reported frontier was accurate, i.e., there existed data at the reported frontier. Now, it's any lower bound, but there is no guarantee that it's accurate.

The reason for this is that in the past, seal would merge all data into a single chain, which means that each (d, t) pair appears at most once. After this change, each (d, t) pair can occur in all chains, which means that we can only compute a lower bound frontier of the uncompacted data, but not the precise lower frontier of the compacted data. We maintain a logarithmic amount of chains.

This can become a problem when all data cancels out, and can cause an unknown amount of additional work for the rest of the system, because it needs to maintain more capabilities and might need to ask for more data more times.

We don't have an immediate solution for this problem, but there are some options:

  • We can accept this problem if we know that most data doesn't cancel, so the lower frontier should be mostly accurate. This requires that operators can handle hallucinated frontiers.
  • We can limit the implementation to be only used for totally ordered times, and combining it with time-sorted chains. This would allow us to drain chains from the beginning while the consolidation would cancel out the contents.
  • Longer term, we could form chains of totally ordered data within a partially ordered domain, where each chain can be consolidated efficiently. I favor this option least because implementation uncertainties.

@antiguru antiguru marked this pull request as draft June 20, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants