ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262

jhorstmann · 2020-09-24T18:30:16Z

@nevi-me this is the chunked iterator based approach i mentioned in #8223

I'm not fully satisfied with the solution yet:

I'd prefer to move all the bit-based functions into Bitmap, but creating a Bitmap from a &Buffer would involve cloning an Arc.
I need to do some benchmarking about how much the packed_simd implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code.

github-actions · 2020-09-24T18:32:34Z

https://issues.apache.org/jira/browse/ARROW-10040

…ffsets

jhorstmann · 2020-10-06T20:56:59Z

@nevi-me @jorgecarleitao I removed the draft status of this PR since I finally managed to make use of this refactoring in an aggregation kernel (#8370). The performance there looks really nice. I'm still open for suggestions about moving these bit-oriented methods maybe out of Buffer.

jorgecarleitao · 2020-10-07T16:53:26Z

Thanks for the PR, @jhorstmann . I will defer to others as bitmaps and offsets are beyond my skills atm.

nevi-me · 2020-10-07T19:23:36Z

rust/arrow/src/util/bit_chunk_iterator.rs

+    use crate::buffer::Buffer;
+
+    #[test]
+    fn test_iter_aligned() {


I don't understand this test :( Why are the values getting reversed? Does this have to do with endianness?

Endianness also confuses me, but I think this test is correct. It might have been clearer if I had also written it using binary literals.

We are counting bits starting from the least significant, since the data is using bytes it should not matter whether it's big- or little-endian. So bit 0 is the least significant bit in the first byte, bit 8 is the least significant in the second byte. The order in the u64 should be the same, so bit 8, starting from the least significant on the right is the 1 one from the second input byte. I hope this makes sense.

jhorstmann · 2020-10-07T19:58:46Z

rust/arrow/src/buffer.rs


+    let remainder_bytes = ceil(left_chunks.remainder_len(), 8);
+    let rem = op(left_chunks.remainder_bits());
+    let rem = &rem.to_le_bytes()[0..remainder_bytes];


@nevi-me Talking about endianness, I'm only about 85% sure that to_le_bytes is correct here instead of to_ne_bytes

We always use to_le_bytes, so I think it's fine. We don't yet support BE

jhorstmann · 2020-10-07T20:28:54Z

I ran some benchmarks and it seems the simd versions of bitwise_bin_op etc are only about 20% to 50% faster than the iterator based ones. I think it would be an acceptable tradeoff to remove them to simplify the code.

nevi-me

I've completed going through this, and I like the approach that you've taken. Had to play around with std::ptr::read_unaligned() to understand better what you're doing.

I have a question, but it's just confirming my understanding. It need not hold us back from merging this.

nevi-me · 2020-10-08T12:39:47Z

rust/arrow/src/buffer.rs

+    let result_chunks = result.typed_data_mut::<u64>().iter_mut();
+
+    result_chunks
+        .zip(left_chunks.iter().zip(right_chunks.iter()))


Amirite that here we're?:

indexing into left and right at size u64 chunks

creating a result_chunks of same output len in bytes (so len / 8)

iterating through the l&r chunks, and writing the result of op(*) to the result_chunks, which becomes aligned

doing the same with remainder_bytes.

Yes, that sounds correct. The result buffer is newly created and properly aligned for the largest primitive types.

The with_bitset call sets the len of the buffer to multiple of 8bytes / 64bits, as otherwise the typed_data_mut function would complain. The capacity of the buffer is however large enough to also contain the remainder bytes.

There are some comparison opcodes left in the assembly, I couldn't find a way to convince the compiler that all iterators have the same length. Still it seems to be not much slower than the simd version.

Out of curiosity, which toolkit or procedure do you guys use for looking at the assembly? I am curious (amazed) how you end up on that level of detail!

I don't really have fancy tooling for that. For isolated functions, the compiler explorer is very useful. When looking at benchmarks with bigger dependencies I use the objdump command which I think comes with linux by default. So I run a benchmark, notice the executable it is running and then do something like

objdump -M intel -S target/release/deps/buffer_bit_ops-a5cda2cafee5a9f9 | less

And then try to find my method in there. Sometimes it helps to give the method a unique name (sum for a kernel was hard to find) and to mark the method I'm interested in as #inline(never). Reading and understanding the assembly then is the tricky part. I usually expect a certain instruction sequence in the innermost loop that I'm looking for, here for example a combination of left shift / right shift (from the bit slice iterator) and binary and.

Vector instructions are usually a bit easier to spot since they are not that common. A good heuristic is also that simpler + shorter instruction sequences = faster.

I haven't tried it yet, but a reverse engineering tool like Ghidra could be useful to more easily analyze loops.

nevi-me · 2020-10-08T12:43:38Z

rust/arrow/src/buffer.rs

+    len_in_bits: usize,
 ) -> Buffer {
-    // SIMD implementation if available
+    // SIMD implementation if available and byte-aligned


Seems like a reasonable tradeoff, that if an array's slice isn't byte aligned, it might end up being processed on the slower path

nevi-me · 2020-10-08T12:55:00Z

rust/arrow/src/util/bit_chunk_iterator.rs

+
+impl<'a> BitChunks<'a> {
+    pub fn new(buffer: &'a Buffer, offset: usize, len: usize) -> Self {
+        assert!(ceil(offset + len, 8) <= buffer.len() * 8);


Might be helpful to add a failure message to this assertion, also helps with testing sometimes

Built on top of [ARROW-10040][1] (#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes #8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <[email protected]> Co-authored-by: Jörn Horstmann <[email protected]> Signed-off-by: Andy Grove <[email protected]>

jhorstmann force-pushed the ARROW-10040-unaligned-bit-buffers branch 2 times, most recently from b279745 to 68baedc Compare September 24, 2020 18:58

jhorstmann changed the title ~~ARROW-10040 Iterate over and combine boolean buffers with arbitrary offsets~~ ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets Sep 24, 2020

nevi-me added the Component: Rust label Sep 25, 2020

jhorstmann mentioned this pull request Sep 26, 2020

ARROW-10103: [Rust] Add contains kernel #8280

Closed

jhorstmann added 5 commits October 6, 2020 20:03

ARROW-10040 Iterate over and combine boolean buffers with arbitrary o…

02586e4

…ffsets

ARROW-10040 Fix buffer size and capacity

d6ef904

ARROW-10040 Unused imports

cb1d2c8

ARROW-10040 Fix simd version of buffer_bin_and

b088b3d

ARROW-10040 Mark bit chunk methods as inline

9f6127f

jhorstmann force-pushed the ARROW-10040-unaligned-bit-buffers branch from 171b224 to 9f6127f Compare October 6, 2020 18:06

ARROW-10040 Use bit slices in compare_option_bitmap function

26c72b9

jhorstmann marked this pull request as ready for review October 6, 2020 20:10

jhorstmann mentioned this pull request Oct 6, 2020

ARROW-10015: [Rust] Simd aggregate kernels #8370

Closed

nevi-me reviewed Oct 7, 2020

View reviewed changes

jhorstmann commented Oct 7, 2020

View reviewed changes

jhorstmann added 2 commits October 7, 2020 22:00

ARROW-10040 Comments for bitwise helper functions

ac01f1d

ARROW-10040 Cleanup benchmark code

31be262

nevi-me self-requested a review October 8, 2020 12:29

nevi-me approved these changes Oct 8, 2020

View reviewed changes

nevi-me closed this in 20f2bd4 Oct 8, 2020

jhorstmann mentioned this pull request Nov 6, 2020

ARROW-10500: [Rust] Refactor bit slice, bit view iterator for array buffers #8598

Closed

asfimport mentioned this pull request Oct 9, 2020

[Rust] Create a way to slice unalligned offset buffers #26062

Closed

ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262

ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262

Uh oh!

Conversation

jhorstmann commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2020

Uh oh!

jhorstmann commented Oct 6, 2020

Uh oh!

jorgecarleitao commented Oct 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhorstmann commented Oct 7, 2020

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhorstmann commented Sep 24, 2020 •

edited

Loading