refactor: allocation free cut through algorithms #2567

JeremyRubin · 2019-02-13T02:07:43Z

In our current algorithms for cut_through and verify_cut_through, we allocate memory and use some relatively inefficient APIs (e.g., retain).

The algorithms presented should be more time-efficient (this is untested), but are certainly more space efficient.

Opening a PR now to get feedback on the approach.

I also am unclear that the sort_unstable_by_key I am doing in the beginning is correct -- is sorting by the commitments here the same order that the sort at the end of the function was previously doing? It seems like that's not accurate, but nothing came up running tests (perhaps a good test case to add). I think it just turns out that everywhere we call cut_through, we also pass false on verify_sort to TransactionBody::init, so it doesn't matter if we're mis-sorted.

core/src/core/transaction.rs

antiochp · 2019-02-13T14:26:24Z

core/src/core/transaction.rs

+	let mut outs_insert_idx = 0..;
+
+	let mut ncut = 0;
+	while let (Some(i), Some(i2)) = (inps.peek(), outs.peek()) {


I've read through this a few times an I still don't fully understand what we're doing here...

Lets say -

inputs: [A,B,F,H]

outputs: [B,C,H,J]

After cut-through we should see -

inputs: [A,F]

outputs: [C,J]

(removed: [B,H])

Reading through the impl here -

We sort inputs and outputs (and check outputs are unique)

then we iterate over peekable inputs and outputs

if they are equal we next() both and increment ncut

if not equal we next() one or the other

Then we drain slices of inputs and outputs using ncut

Given outputs [B,C,H,J] how do end up with [C,J] using a single drain with a single slice using ncut?
What does ncut do here? I'm not following this (and missing something obvious).

Yup! Let's go through step by step:
^ means the _insert_idx will next yield this location,
* means the inps/outs will next yield this location

Step 1: [^*A,B,F,H] [^*B,C,H,J] ncut = 0 A < B Step 2: [A,^*B,F,H] [^*B,C,H,J] ncut = 0 B == B Step 3: [A,^B,*F,H] [^B,*C,H,J] ncut = 1 F > C Step 4: [A,^B,*F,H] [C,^C,*H,J] ncut = 1 F < H Step 5: [A,F,^F,*H] [C,^C,*H,J] ncut = 1 H == H Step 6: [A,F,^F,H] * [C,^C,H,*J] ncut = 2 vec 1 has None to compare Step 7: vec1.drain(^..^+ncut) [A,F,_,_] [A,F] vec2.drain(^..^+ncut) [C,_,_,*J] [C,J]

I think I wrote this out correctly... if there are any errors, let me know.

Essentially what we are doing is walking both lists in sorted order with four fingers, and backshifting anything that is not in the other list, which we are guaranteed via the sort order. When we finish, the drain of that particular range cuts out only the ones we've already inserted into the list earlier. It leaves (and then backshifts) the range after which we've yet to look at (but don't need to, because it is greater than the greatest element in the other list).

Technically I think you can compute ncut as (*-^), but I think there is some edge condition, so tracking ncut seemed more straightforward.

👍 Thanks for the explanation. Looks good.

antiochp

I also am unclear that the sort_unstable_by_key I am doing in the beginning is correct -- is sorting by the commitments here the same order that the sort at the end of the function was previously doing? It seems like that's not accurate, but nothing came up running tests (perhaps a good test case to add). I think it just turns out that everywhere we call cut_through, we also pass false on verify_sort to TransactionBody::init, so it doesn't matter if we're mis-sorted.

I think we are good here.
You added an explicit sort to both inputs and outputs at the start of cut_through right?

The magic (and somewhat obscure) hashable_ord! macro on Inputs/Outputs/TxKernels ensures we sort them consistently by their hash().

Edit: So actually, you're right, the sort is different here.

outputs.sort_unstable_by(|v1, v2| v1.commitment().0.cmp(&v2.commitment().0));

should really be -

outputs.sort_unstable() (and let our hashable_ord! handle the cmp logic).

Same for inputs.

So actually similar to what we're doing in #2564.

JeremyRubin · 2019-02-14T19:09:31Z

I'm not sure that's right -- I think they need to be sorted by the commitment because we need to cut the commitments, and the order of the inputs.hash() will be uncorrelated with the order of the outputs.hash()?

This is a problem for the verify_cut_through method, which was assuming sort based on commitment...

JeremyRubin · 2019-02-14T19:37:03Z

Actually it seems that the hash of an input and the hash of an output are indeed the same, not just the commitment -- the only additional field is the Output Features flag!

Therefore sorting the inbound on either the commitment or the hash should be correct -- makes sense why I wasn't able to get errors on either case.

That the OutputFeatures flags match for matching commitments is checked in verify_features.

JeremyRubin · 2019-02-15T02:00:30Z

I pushed up an updated version of the algorithm.

In this version I:

Use the hashable_ord! implementation of sort
Got rid of the previous use of peekable iterator in favor of regular indexes (bit more verbose, but much easier to read IMO)
Added hash caching to the main comparison loop (should use approximately half the hashing, as we don't recompute the hash for both entries at every step).
edit: I also added a copy trait to Input -- there didn't seem to be a reason not to, as Output already implements it, and it implements Clone anyways

antiochp · 2019-02-15T15:30:39Z

Just testing this out locally - but looks good so far. 👍

added a copy trait to Input

Yes. Makes sense.

Added hash caching to the main comparison loop (should use approximately half the hashing, as we don't recompute the hash for both entries at every step).

We've actually discussed how useful it would be to have some kind of generic solution to this. There are a lot of places where we a lot of hashing (appending to MMRs for example). This is good here but we should revisit this at some point when we have a better idea of how to do this in a single place (if we even can).

antiochp

Merge conflict from the other PR but 👍 once it builds and the tests all pass.

Tested it locally and against mainnet and it looks fine.

JeremyRubin · 2019-02-15T19:36:46Z

Rebased

ignopeverell · 2019-02-15T23:42:15Z

To be honest I'm a little torn. I do agree this will likely do less copying and hence use less memory. But we're not sure. And we have no idea by how much. LLVM compiler optimizations can be pretty good too.

Now this replaces 25 lines of somewhat idiomatic functional Rust that's easy to read by 75 lines of iterative-looking, much harder to follow code.

I agree these functions are likely hot code. But I can't tell if it's a premature optimization or not and it bothers me.

JeremyRubin · 2019-02-16T01:36:19Z

I could introduce a PR that tries to bench the existing function -- it's hard to get a 'whole system' picture though. It's unclear if it is faster, but I think it is a pretty safe assumption that it's not slower. I think the main time draw of an algorithm like this in place of the previous one is predictability from a DoS perspective. A specially crafted set of transactions should perform roughly equivalently to any other irregardless of cut through. Whereas in the prior code, bigger cut through can yield O(N^2) behavior (n^2 retain). I don't really think it's the kind of thing a compiler can optimize.

It's possible to make this code more rust idiomatic and still get a pretty good win I think, and give our compiler a much better chance of optimizing it (the clone is only needed here to make the borrow checker happy...). I don't particularly find the below version less offensive though (iterators with peekable are tidy, but a bit harder to reason about IMO), and it has the extra copy. (I haven't tested the below)

 fn verify_cut_through(&self) -> Result<(), Error> {
                   let mut inputs = self.inputs.iter().map(|x| x.hash()).peekable();
                   let mut outputs = self.outputs.iter().map(|x| x.hash()).peekable();
                   loop {
                           match (inputs.peek(), outputs.peek()) {
                                   (Some(ih), Some(oh)) => match ih.cmp(oh) {
                                           Ordering::Less => {
                                                   inputs.next();
                                           }
                                           Ordering::Greater => {
                                                   outputs.next();
                                           }
                                           Ordering::Equal => {
                                                   return Err(Error::CutThrough);
                                           }
                                   },
                                   _ => {
                                           return Ok(());
                                   }
                           }
                   }
           }

pub fn cut_through(inputs: &mut Vec<Input>, outputs: &mut Vec<Output>) -> Result<(), Error> {
         // assemble output commitments set, checking they're all unique
         outputs.sort_unstable();
         if outputs.windows(2).any(|pair| pair[0] == pair[1]) {
                 return Err(Error::AggregationError);
         }
         inputs.sort_unstable();
         let inputs_cpy = inputs.clone();
         let outputs_cpy = outputs.clone();
         let mut input_hashes = inputs_cpy.iter().map(|x| (x, x.hash())).peekable();
         let mut output_hashes = outputs_cpy.iter().map(|x| (x, x.hash())).peekable();
         let mut insert_input = inputs.iter_mut().enumerate();
         let mut insert_output = outputs.iter_mut().enumerate();
         let mut ncut = 0;
         loop {
                 match (input_hashes.peek(), output_hashes.peek()) {
                         (Some((&i, ih)), Some((&o, oh))) => match ih.cmp(oh) {
                                 Ordering::Less => {
                                         *insert_input.next().unwrap().1 = i;
                                         input_hashes.next();
                                 }
                                 Ordering::Greater => {
                                         *insert_output.next().unwrap().1 = o;
                                         output_hashes.next();
                                 }
                                 Ordering::Equal => {
                                         ncut += 1;
                                         output_hashes.next();
                                         input_hashes.next();
                                 }
                         },
                         _ => {
                                 // Cut elements that have already been copied
                                 if let Some((drain_idx, _)) = insert_output.next() {
                                         outputs.drain(drain_idx..drain_idx + ncut);
                                 }
 
                                 if let Some((drain_idx, _)) = insert_input.next() {
                                         inputs.drain(drain_idx..drain_idx + ncut);
                                 }
                                 return Ok(());
                         }
                 }
         }
 }

JeremyRubin · 2019-02-16T01:37:41Z

p.s. do we have any benchmarking/profiling tools currently?

antiochp · 2019-02-16T10:52:49Z

I wonder if we punted on the hash caching for now (we know its a more general issue) - would this make the changes in this PR cleaner?

Related - #2584.

JeremyRubin · 2019-02-17T05:31:09Z

It's a little shorter -- if we think we'll solve hash caching elsewhere, then the below is very tolerable

 pub fn cut_through(inputs: &mut Vec<Input>, outputs: &mut Vec<Output>) -> Result<(), Error> { 
         // assemble output commitments set, checking they're all unique
         outputs.sort_unstable();
         if outputs.windows(2).any(|pair| pair[0] == pair[1]) { 
                 return Err(Error::AggregationError);
         }
         inputs.sort_unstable();
         let mut inputs_idx = 0;
         let mut outputs_idx = 0;
         let mut inputs_insert_idx = 0;
         let mut outputs_insert_idx = 0;
         let mut ncut = 0;
         while inputs_idx < inputs.len() && outputs_idx < outputs.len() { 
                 // if our cached hashes are stale
                 match inputs[inputs_idx].hash().cmp(&outputs[outputs_idx].hash()) { 
                         Ordering::Less => { 
                                 inputs[inputs_insert_idx] = inputs[inputs_idx];
                                 inputs_idx += 1;
                                 inputs_insert_idx += 1;
                         }
                         Ordering::Greater => { 
                                 outputs[outputs_insert_idx] = outputs[outputs_idx];
                                 outputs_idx += 1;
                                 outputs_insert_idx += 1;
                         }
                         Ordering::Equal => { 
                                 ncut += 1;
                                 inputs_idx += 1;
                                 outputs_idx += 1;
                         }
                 }
         }
         // Cut elements that have already been copied
         outputs.drain(outputs_insert_idx..outputs_insert_idx + ncut);
         inputs.drain(inputs_insert_idx..inputs_insert_idx + ncut);
         Ok(())
 }

For posterity and completeness, here's what a caching version looks like with iterators and no copying, using RefCell to fight the borrowchecker (can probably be cleaned up further).

pub fn cut_through(inputs: &mut Vec<Input>, outputs: &mut Vec<Output>) -> Result<(), Error> {
           // assemble output commitments set, checking they're all unique
           outputs.sort_unstable();
           if outputs.windows(2).any(|pair| pair[0] == pair[1]) {
                   return Err(Error::AggregationError);
           }
           inputs.sort_unstable();
           let mut inputs_r = RefCell::new(vec![]);
           let mut outputs_r = RefCell::new(vec![]);
           std::mem::swap(inputs_r.get_mut(), inputs);
           std::mem::swap(outputs_r.get_mut(), outputs);
           assert!(inputs.len() == 0);
           let (drain_idx_inputs, drain_idx_outputs) = {
                   let mut ncut = 0;
                   let mut drain_idx_inputs = 0;
                   let mut drain_idx_outputs = 0;
                   let inputs = &inputs_r;
                   let outputs = &outputs_r;
                   let mut input_hashes = (0..inputs.borrow().len())
                           .map(|x| (x, inputs.borrow()[x].hash()))
                           .peekable(); 
                   let mut insert_input = (0..).map(|x| {
                           move |y| {
                                   let mut inputs = inputs.borrow_mut();
                                   inputs[x] = inputs[y];
                                   x
                           }
                   });
   
                   let mut output_hashes = (0..outputs.borrow().len())
                           .map(|x| (x, outputs.borrow()[x].hash()))
                           .peekable(); 
                   let mut insert_output = (0..).map(|x| {
                           move |y| {
                                   let mut outputs = outputs.borrow_mut();
                                   outputs[x] = outputs[y];
                                   x
                           }
                   });
                   while let (Some((get_i, ih)), Some((get_o, oh))) =
                           (input_hashes.peek(), output_hashes.peek())
                   {
                           match ih.cmp(oh) {
                                   Ordering::Less => {
                                           drain_idx_inputs = insert_input.next().unwrap()(*get_i);
                                           input_hashes.next();
                                   }
                                   Ordering::Greater => {
                                           drain_idx_outputs = insert_output.next().unwrap()(*get_o);
                                           output_hashes.next();
                                   }
                                   Ordering::Equal => {
                                           ncut += 1; 
                                           output_hashes.next();
                                           input_hashes.next();
                                   }
                           }
                   }
                   (
                           drain_idx_inputs..(drain_idx_inputs + ncut),
                           drain_idx_outputs..(drain_idx_outputs + ncut),
                   )
           };
           // Cut elements that have already been copied
           std::mem::swap(outputs_r.get_mut(), outputs);
           std::mem::swap(inputs_r.get_mut(), inputs);
           outputs.drain(drain_idx_outputs);
           inputs.drain(drain_idx_inputs);
           Ok(())
   }

antiochp · 2019-02-18T10:43:21Z

I say we go with that most recent example, the then the below is very tolerable example.

Thanks for doing this exploratory work to look into different options here.
Premature optimization or not - I think this reads pretty cleanly and its short and does one thing.

👍

And its not like the "convert them into sets and find the intersection" approach was particularly intuitive in terms of knowing what it was doing at a glance anyway.

JeremyRubin · 2019-02-18T20:40:16Z

Ok. I pushed up new code for both, which I think is even a little simpler.

for verify_cut_through, I kept the iterable version because it is kind of the easiest to write/read, and there is no ownership issue with the inputs and outputs vec. This version caches inherently because of peekable, and it is the idomatic way to write it.
for cut_through, I got rid of the insert_idx variables and just compute the offsets using ncut. This is a bit simpler to comprehend because there are fewer variables to think about, and reads a little bit better (rather than arbitrary insert_idx, we are inserting the number back that we've cut).

antiochp

👍

antiochp reviewed Feb 13, 2019

View reviewed changes

core/src/core/transaction.rs Outdated Show resolved Hide resolved

antiochp reviewed Feb 13, 2019

View reviewed changes

core/src/core/transaction.rs Outdated Show resolved Hide resolved

antiochp reviewed Feb 13, 2019

View reviewed changes

antiochp reviewed Feb 14, 2019

View reviewed changes

JeremyRubin force-pushed the allocation-free-cut-through branch from f650b9c to 64d658b Compare February 15, 2019 01:57

antiochp added the enhancement label Feb 15, 2019

antiochp added this to the 1.0.2 milestone Feb 15, 2019

antiochp approved these changes Feb 15, 2019

View reviewed changes

JeremyRubin force-pushed the allocation-free-cut-through branch 3 times, most recently from ae2a68a to ea18c1a Compare February 15, 2019 19:36

JeremyRubin mentioned this pull request Feb 15, 2019

Caching Hash Computations #2584

Open

JeremyRubin changed the title ~~[WIP] refactor: allocation free cut through algorithms~~ refactor: allocation free cut through algorithms Feb 15, 2019

refactor: allocation free cut through algorithms

b809be0

JeremyRubin force-pushed the allocation-free-cut-through branch from ea18c1a to b809be0 Compare February 18, 2019 20:32

antiochp approved these changes Feb 19, 2019

View reviewed changes

antiochp merged commit eafe730 into mimblewimble:master Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: allocation free cut through algorithms #2567

refactor: allocation free cut through algorithms #2567

JeremyRubin commented Feb 13, 2019

antiochp Feb 13, 2019

JeremyRubin Feb 13, 2019 •

edited by antiochp

Loading

antiochp Feb 14, 2019

antiochp left a comment •

edited

Loading

JeremyRubin commented Feb 14, 2019

JeremyRubin commented Feb 14, 2019

JeremyRubin commented Feb 15, 2019 •

edited

Loading

antiochp commented Feb 15, 2019

antiochp left a comment

JeremyRubin commented Feb 15, 2019

ignopeverell commented Feb 15, 2019

JeremyRubin commented Feb 16, 2019

JeremyRubin commented Feb 16, 2019

antiochp commented Feb 16, 2019 •

edited

Loading

JeremyRubin commented Feb 17, 2019

antiochp commented Feb 18, 2019 •

edited

Loading

JeremyRubin commented Feb 18, 2019

antiochp left a comment

refactor: allocation free cut through algorithms #2567

refactor: allocation free cut through algorithms #2567

Conversation

JeremyRubin commented Feb 13, 2019

antiochp Feb 13, 2019

Choose a reason for hiding this comment

JeremyRubin Feb 13, 2019 • edited by antiochp Loading

Choose a reason for hiding this comment

antiochp Feb 14, 2019

Choose a reason for hiding this comment

antiochp left a comment • edited Loading

Choose a reason for hiding this comment

JeremyRubin commented Feb 14, 2019

JeremyRubin commented Feb 14, 2019

JeremyRubin commented Feb 15, 2019 • edited Loading

antiochp commented Feb 15, 2019

antiochp left a comment

Choose a reason for hiding this comment

JeremyRubin commented Feb 15, 2019

ignopeverell commented Feb 15, 2019

JeremyRubin commented Feb 16, 2019

JeremyRubin commented Feb 16, 2019

antiochp commented Feb 16, 2019 • edited Loading

JeremyRubin commented Feb 17, 2019

antiochp commented Feb 18, 2019 • edited Loading

JeremyRubin commented Feb 18, 2019

antiochp left a comment

Choose a reason for hiding this comment

JeremyRubin Feb 13, 2019 •

edited by antiochp

Loading

antiochp left a comment •

edited

Loading

JeremyRubin commented Feb 15, 2019 •

edited

Loading

antiochp commented Feb 16, 2019 •

edited

Loading

antiochp commented Feb 18, 2019 •

edited

Loading