New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DNM: A merge batcher that gracefully handles non-ready data #463

Draft

antiguru wants to merge 3 commits into TimelyDataflow:master from antiguru:merge_batcher_temporal

Member

antiguru commented Feb 27, 2024 •

edited

Loading

This PR shows how to implement a merge batcher that is smart about reconsidering data that's in advance of the current frontier. It does the following things:

It extracts ready data from chains instead of merging all chains and then extracting data from the last remaining chain.
It separates canonicalization from the extraction operation, so it can be reused for inserts and extracts.
It memorizes a frontier per block, which allows for efficient frontier testing: If the extraction (upper) frontier is not less or equal to the block frontier, do not touch the block.

This should have the potential to reduce the amount of work for outstanding data from $O(n)$ where $n$ is the number of records in the merge batcher to $O(n/1024)$ by considering only the block itself, but not the data it contains.

I am sorry for the formatting noise which originates from copying this code from DD to Mz and back again :/


          Extract ready times from existing chains

1d6a5e5

Teaches the merge batcher to extract ready times from the existing
chains, and maintaining the chain invariant after extracting data. This
reduces the effort to maintain data that is not yet ready, by
maintaining a frontier per chain block that allows us to efficiently
decide that a block needs to be inspected or not.

Signed-off-by: Moritz Hoffmann <[email protected]>

antiguru force-pushed the merge_batcher_temporal branch from 7fb5413 to 1d6a5e5 Compare

March 1, 2024 19:24

frankmcsherry reviewed

View reviewed changes

src/trace/implementations/merge_batcher.rs

@@ @@ -151,18 +113,20 @@ where @@
               struct MergeSorter<D, T, R> {
                   /// each power-of-two length list of allocations. Do not push/pop directly but use the corresponding functions.
-                  queue: Vec<Vec<Vec<(D, T, R)>>>,
+                  queue: Vec<Vec<(Antichain<T>, Vec<(D, T, R)>)>>,

Member

frankmcsherry Mar 6, 2024

Leave a comment here about the role of the antichain relative to the vector of updates.

src/trace/implementations/merge_batcher.rs

Comment on lines +100 to +102

+                          self.lower.clone(),
+                          upper.clone(),
+                          Antichain::from_elem(T::minimum()),

Member

frankmcsherry Mar 6, 2024

Consider (not in this PR) switching these to references.

src/trace/implementations/merge_batcher.rs

		stash: Vec<Vec<(D, T, R)>>,
		pending: Vec<(D, T, R)>,

Member

frankmcsherry Mar 6, 2024

What is this?

src/trace/implementations/merge_batcher.rs

    
                  const BUFFER_SIZE_BYTES: usize = 1 << 13;

                  const BUFFER_SIZE_BYTES: usize = 64 << 10;

Member

frankmcsherry Mar 6, 2024

Discussed, but future todo: extract this into something wrapping our buffers so that they can express an opinion without the MergeBatcher needing to be up to date on the opinions.

src/trace/implementations/merge_batcher.rs

                           operator_id,
                           queue: Vec::new(),
                           stash: Vec::new(),
+                          pending: Vec::new(),

Member

frankmcsherry Mar 6, 2024

Add a note that ideally all of these start at zero capacity, so that if they are not used (e.g. on a zero volume channel, like an error path) they do not allocate. This means that we have to check the capacity elsewhere.

src/trace/implementations/merge_batcher.rs Outdated

+                      // Walk all chains, separate ready data from data to keep.
+                      for mut chain in std::mem::take(&mut self.queue).drain(..) {
+                          let mut block_list = Vec::default();

Member

frankmcsherry Mar 6, 2024

Said aloud: "could this be ship_list?". Maybe not, but the name block_list is not very specific in this context (many lists of blocks here).

src/trace/implementations/merge_batcher.rs Outdated

+                                  // Iterate block, sorting items into ship and keep
+                                  for datum in block.drain(..) {
+                                      if upper.less_equal(&datum.1) {
+                                          frontier.insert_ref(&datum.1);

Member

frankmcsherry Mar 6, 2024

This appears to be the only (?) place we update frontier, even though when we ship none of the updates (the else case below) we still want to reflect those times in the overall frontier.

src/trace/implementations/merge_batcher.rs

+                                  }
+                                  keep_list.push((block_frontier, block));
+                              }
+                          }

Member

frankmcsherry Mar 6, 2024

This is potentially a good moment, as we go, to perform the "adjacent blocks" compaction. That would allow us to return memory to self.empty() eagerly, and have it available as we loop. Waiting until maintain() is not wrong, but it seems like it may cause memory to spike during extract_into and return down only after maintain().

src/trace/implementations/merge_batcher.rs Outdated

Comment on lines 347 to 350

+                      while ship_list.len() > 1 {
+                          let list1 = ship_list.pop().unwrap();
+                          let list2 = ship_list.pop().unwrap();
+                          ship_list.push(self.merge_by(list1, list2));

Member

frankmcsherry Mar 6, 2024

No reason these need to be in the same geometric order as the held updates. There is the opportunity to resort them, or to use a binary heap to continually merge the smallest chains. No strong feelings, because it seems unlikely to be adversarially selected.

src/trace/implementations/merge_batcher.rs Outdated

-                  }
+              impl<D, T, R> MergeSorter<D, T, R>
+              {

Member

frankmcsherry Mar 6, 2024

???

antiguru added 2 commits

March 6, 2024 18:01


          wip address comments, but tests fail

c6838a5

Signed-off-by: Moritz Hoffmann <[email protected]>


          Make it work

575bb2f

Signed-off-by: Moritz Hoffmann <[email protected]>

Member Author

antiguru commented Mar 11, 2024

We figured that the approach taken in this PR changes what we report as the next lower bound of data to be extracted. As before, seal captures a lower frontier of all the times in the batcher, but its semantics are different. Previously, the reported frontier was accurate, i.e., there existed data at the reported frontier. Now, it's any lower bound, but there is no guarantee that it's accurate.

The reason for this is that in the past, seal would merge all data into a single chain, which means that each (d, t) pair appears at most once. After this change, each (d, t) pair can occur in all chains, which means that we can only compute a lower bound frontier of the uncompacted data, but not the precise lower frontier of the compacted data. We maintain a logarithmic amount of chains.

This can become a problem when all data cancels out, and can cause an unknown amount of additional work for the rest of the system, because it needs to maintain more capabilities and might need to ask for more data more times.

We don't have an immediate solution for this problem, but there are some options:

We can accept this problem if we know that most data doesn't cancel, so the lower frontier should be mostly accurate. This requires that operators can handle hallucinated frontiers.
We can limit the implementation to be only used for totally ordered times, and combining it with time-sorted chains. This would allow us to drain chains from the beginning while the consolidation would cancel out the contents.
Longer term, we could form chains of totally ordered data within a partially ordered domain, where each chain can be consolidated efficiently. I favor this option least because implementation uncertainties.

antiguru marked this pull request as draft

June 20, 2024 20:45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet