Skip to content

Conversation

@jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Sep 24, 2020

@nevi-me this is the chunked iterator based approach i mentioned in #8223

I'm not fully satisfied with the solution yet:

  • I'd prefer to move all the bit-based functions into Bitmap, but creating a Bitmap from a &Buffer would involve cloning an Arc.
  • I need to do some benchmarking about how much the packed_simd implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code.

@github-actions
Copy link

@jhorstmann jhorstmann force-pushed the ARROW-10040-unaligned-bit-buffers branch 2 times, most recently from b279745 to 68baedc Compare September 24, 2020 18:58
@jhorstmann jhorstmann changed the title ARROW-10040 Iterate over and combine boolean buffers with arbitrary offsets ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets Sep 24, 2020
@jhorstmann jhorstmann force-pushed the ARROW-10040-unaligned-bit-buffers branch from 171b224 to 9f6127f Compare October 6, 2020 18:06
@jhorstmann jhorstmann marked this pull request as ready for review October 6, 2020 20:10
@jhorstmann
Copy link
Contributor Author

@nevi-me @jorgecarleitao I removed the draft status of this PR since I finally managed to make use of this refactoring in an aggregation kernel (#8370). The performance there looks really nice. I'm still open for suggestions about moving these bit-oriented methods maybe out of Buffer.

@jorgecarleitao
Copy link
Member

Thanks for the PR, @jhorstmann . I will defer to others as bitmaps and offsets are beyond my skills atm.

use crate::buffer::Buffer;

#[test]
fn test_iter_aligned() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this test :( Why are the values getting reversed? Does this have to do with endianness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endianness also confuses me, but I think this test is correct. It might have been clearer if I had also written it using binary literals.

We are counting bits starting from the least significant, since the data is using bytes it should not matter whether it's big- or little-endian. So bit 0 is the least significant bit in the first byte, bit 8 is the least significant in the second byte. The order in the u64 should be the same, so bit 8, starting from the least significant on the right is the 1 one from the second input byte. I hope this makes sense.


let remainder_bytes = ceil(left_chunks.remainder_len(), 8);
let rem = op(left_chunks.remainder_bits());
let rem = &rem.to_le_bytes()[0..remainder_bytes];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me Talking about endianness, I'm only about 85% sure that to_le_bytes is correct here instead of to_ne_bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use to_le_bytes, so I think it's fine. We don't yet support BE

@jhorstmann
Copy link
Contributor Author

I ran some benchmarks and it seems the simd versions of bitwise_bin_op etc are only about 20% to 50% faster than the iterator based ones. I think it would be an acceptable tradeoff to remove them to simplify the code.

@nevi-me nevi-me self-requested a review October 8, 2020 12:29
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've completed going through this, and I like the approach that you've taken. Had to play around with std::ptr::read_unaligned() to understand better what you're doing.

I have a question, but it's just confirming my understanding. It need not hold us back from merging this.

let result_chunks = result.typed_data_mut::<u64>().iter_mut();

result_chunks
.zip(left_chunks.iter().zip(right_chunks.iter()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amirite that here we're?:

  • indexing into left and right at size u64 chunks
  • creating a result_chunks of same output len in bytes (so len / 8)
  • iterating through the l&r chunks, and writing the result of op(*) to the result_chunks, which becomes aligned
  • doing the same with remainder_bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds correct. The result buffer is newly created and properly aligned for the largest primitive types.

The with_bitset call sets the len of the buffer to multiple of 8bytes / 64bits, as otherwise the typed_data_mut function would complain. The capacity of the buffer is however large enough to also contain the remainder bytes.

There are some comparison opcodes left in the assembly, I couldn't find a way to convince the compiler that all iterators have the same length. Still it seems to be not much slower than the simd version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, which toolkit or procedure do you guys use for looking at the assembly? I am curious (amazed) how you end up on that level of detail!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have fancy tooling for that. For isolated functions, the compiler explorer is very useful. When looking at benchmarks with bigger dependencies I use the objdump command which I think comes with linux by default. So I run a benchmark, notice the executable it is running and then do something like

objdump -M intel -S target/release/deps/buffer_bit_ops-a5cda2cafee5a9f9 | less

And then try to find my method in there. Sometimes it helps to give the method a unique name (sum for a kernel was hard to find) and to mark the method I'm interested in as #inline(never). Reading and understanding the assembly then is the tricky part. I usually expect a certain instruction sequence in the innermost loop that I'm looking for, here for example a combination of left shift / right shift (from the bit slice iterator) and binary and.

Vector instructions are usually a bit easier to spot since they are not that common. A good heuristic is also that simpler + shorter instruction sequences = faster.

I haven't tried it yet, but a reverse engineering tool like Ghidra could be useful to more easily analyze loops.

len_in_bits: usize,
) -> Buffer {
// SIMD implementation if available
// SIMD implementation if available and byte-aligned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable tradeoff, that if an array's slice isn't byte aligned, it might end up being processed on the slower path


impl<'a> BitChunks<'a> {
pub fn new(buffer: &'a Buffer, offset: usize, len: usize) -> Self {
assert!(ceil(offset + len, 8) <= buffer.len() * 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to add a failure message to this assertion, also helps with testing sometimes

@nevi-me nevi-me closed this in 20f2bd4 Oct 8, 2020
andygrove pushed a commit that referenced this pull request Oct 8, 2020
Built on top of [ARROW-10040][1] (#8262)

Benchmarks (run on a Ryzen 3700U laptop with some thermal problems)

Current master without simd:

```
$ cargo bench  --bench aggregate_kernels
sum 512                 time:   [3.9652 us 3.9722 us 3.9819 us]
                        change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [9.4577 us 9.4796 us 9.5112 us]
                        change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
```

This branch without simd (speedup probably due to accessing the data via a slice):

```
sum 512                 time:   [1.1066 us 1.1113 us 1.1168 us]
                        change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

sum nulls 512           time:   [1.3279 us 1.3364 us 1.3469 us]
                        change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe
```

This branch with simd:

```
sum 512                 time:   [108.58 ns 109.47 ns 110.57 ns]
                        change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

sum nulls 512           time:   [249.95 ns 250.50 ns 251.06 ns]
                        change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
```

 [1]: https://issues.apache.org/jira/browse/ARROW-10040

Closes #8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels

Lead-authored-by: Jörn Horstmann <[email protected]>
Co-authored-by: Jörn Horstmann <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants