-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets #8262
Conversation
b279745 to
68baedc
Compare
171b224 to
9f6127f
Compare
|
@nevi-me @jorgecarleitao I removed the draft status of this PR since I finally managed to make use of this refactoring in an aggregation kernel (#8370). The performance there looks really nice. I'm still open for suggestions about moving these bit-oriented methods maybe out of |
|
Thanks for the PR, @jhorstmann . I will defer to others as bitmaps and offsets are beyond my skills atm. |
| use crate::buffer::Buffer; | ||
|
|
||
| #[test] | ||
| fn test_iter_aligned() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this test :( Why are the values getting reversed? Does this have to do with endianness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Endianness also confuses me, but I think this test is correct. It might have been clearer if I had also written it using binary literals.
We are counting bits starting from the least significant, since the data is using bytes it should not matter whether it's big- or little-endian. So bit 0 is the least significant bit in the first byte, bit 8 is the least significant in the second byte. The order in the u64 should be the same, so bit 8, starting from the least significant on the right is the 1 one from the second input byte. I hope this makes sense.
|
|
||
| let remainder_bytes = ceil(left_chunks.remainder_len(), 8); | ||
| let rem = op(left_chunks.remainder_bits()); | ||
| let rem = &rem.to_le_bytes()[0..remainder_bytes]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me Talking about endianness, I'm only about 85% sure that to_le_bytes is correct here instead of to_ne_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use to_le_bytes, so I think it's fine. We don't yet support BE
|
I ran some benchmarks and it seems the simd versions of |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've completed going through this, and I like the approach that you've taken. Had to play around with std::ptr::read_unaligned() to understand better what you're doing.
I have a question, but it's just confirming my understanding. It need not hold us back from merging this.
| let result_chunks = result.typed_data_mut::<u64>().iter_mut(); | ||
|
|
||
| result_chunks | ||
| .zip(left_chunks.iter().zip(right_chunks.iter())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amirite that here we're?:
- indexing into left and right at size u64 chunks
- creating a
result_chunksof same output len in bytes (so len / 8) - iterating through the l&r chunks, and writing the result of
op(*)to theresult_chunks, which becomes aligned - doing the same with
remainder_bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds correct. The result buffer is newly created and properly aligned for the largest primitive types.
The with_bitset call sets the len of the buffer to multiple of 8bytes / 64bits, as otherwise the typed_data_mut function would complain. The capacity of the buffer is however large enough to also contain the remainder bytes.
There are some comparison opcodes left in the assembly, I couldn't find a way to convince the compiler that all iterators have the same length. Still it seems to be not much slower than the simd version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, which toolkit or procedure do you guys use for looking at the assembly? I am curious (amazed) how you end up on that level of detail!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have fancy tooling for that. For isolated functions, the compiler explorer is very useful. When looking at benchmarks with bigger dependencies I use the objdump command which I think comes with linux by default. So I run a benchmark, notice the executable it is running and then do something like
objdump -M intel -S target/release/deps/buffer_bit_ops-a5cda2cafee5a9f9 | less
And then try to find my method in there. Sometimes it helps to give the method a unique name (sum for a kernel was hard to find) and to mark the method I'm interested in as #inline(never). Reading and understanding the assembly then is the tricky part. I usually expect a certain instruction sequence in the innermost loop that I'm looking for, here for example a combination of left shift / right shift (from the bit slice iterator) and binary and.
Vector instructions are usually a bit easier to spot since they are not that common. A good heuristic is also that simpler + shorter instruction sequences = faster.
I haven't tried it yet, but a reverse engineering tool like Ghidra could be useful to more easily analyze loops.
| len_in_bits: usize, | ||
| ) -> Buffer { | ||
| // SIMD implementation if available | ||
| // SIMD implementation if available and byte-aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable tradeoff, that if an array's slice isn't byte aligned, it might end up being processed on the slower path
|
|
||
| impl<'a> BitChunks<'a> { | ||
| pub fn new(buffer: &'a Buffer, offset: usize, len: usize) -> Self { | ||
| assert!(ceil(offset + len, 8) <= buffer.len() * 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be helpful to add a failure message to this assertion, also helps with testing sometimes
Built on top of [ARROW-10040][1] (#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes #8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <[email protected]> Co-authored-by: Jörn Horstmann <[email protected]> Signed-off-by: Andy Grove <[email protected]>
@nevi-me this is the chunked iterator based approach i mentioned in #8223
I'm not fully satisfied with the solution yet:
Bitmap, but creating aBitmapfrom a&Bufferwould involve cloning anArc.packed_simdimplementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code.