-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10810: [Rust] Speed up comparison kernels #8837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Really sweet improvement.
| let mut result = BooleanBufferBuilder::new($left.len()); | ||
| let byte_capacity = bit_util::ceil($left.len(), 8); | ||
| let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity); | ||
| let mut buffer = MutableBuffer::new(actual_capacity); | ||
| buffer.resize(byte_capacity); | ||
|
|
||
| let data = buffer.raw_data_mut(); | ||
| for i in 0..$left.len() { | ||
| result.append($op($left.value(i), $right))?; | ||
| if $op($left.value(i), $right) { | ||
| unsafe { | ||
| bit_util::set_bit_raw(data, i); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side note: In general, if we are bypassing builders for speed, it can mean that builders are not a good abstraction, as they significantly impact performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would be nice to see if we can get the combination of a clean/safe API and good performance as well.
|
Let's close this for now in favor of #8842 |
This PR creates a new struct `BooleanArray`, that replaces `PrimitiveArray<BooleanType>`, so that we do not have to consider the differences between being bit-packed and non-bit packed. This difference is causing a significant performance degradation described on ARROW-10453 and #8837 . This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed. IMPORTANT: this removed support from Boolean array to UnionArray, as `UnionArray` currently only supports `PrimitiveType`. Micro benchmarks (worse to best, statistically insignificant ignored): | benchmark | variation | |-------------- | -------------- | | min nulls 512 | 33.7 | | record_batches_to_csv | 23.1 | | array_string_from_vec 256 | 5.6 | | array_string_from_vec 512 | 5.2 | | take bool nulls 512 | 4.9 | | cast int32 to int64 512 | 2.5 | | equal_512 | 2.3 | | filter u8 very low selectivity | 2.2 | | array_slice 512 | 2.1 | | take bool nulls 1024 | 2.0 | | cast int64 to int32 512 | 1.6 | | min 512 | 1.6 | | take i32 512 | 1.1 | | add 512 | 1.1 | | array_slice 2048 | 1.0 | | length | 1.0 | | filter u8 low selectivity | 0.9 | | filter u8 high selectivity | 0.9 | | array_string_from_vec 128 | 0.9 | | cast int32 to float64 512 | 0.9 | | cast timestamp_ms to i64 512 | 0.8 | | take str null indices 512 | 0.6 | | sum 512 | 0.4 | | filter context u8 very low selectivity | -0.7 | | take i32 1024 | -0.9 | | filter context f32 very low selectivity | -0.9 | | cast float64 to float32 512 | -1.0 | | equal_nulls_512 | -1.0 | | cast time32s to time32ms 512 | -1.1 | | sort 2^12 | -1.2 | | struct_array_from_vec 128 | -1.4 | | array_from_vec 256 | -1.4 | | array_from_vec 128 | -1.5 | | filter context u8 high selectivity | -1.6 | | limit 512, 512 | -1.7 | | equal_string_nulls_512 | -1.8 | | take i32 nulls 1024 | -1.8 | | struct_array_from_vec 512 | -1.9 | | filter context f32 high selectivity | -2.0 | | cast timestamp_ms to timestamp_ns 512 | -2.2 | | take i32 nulls 512 | -2.3 | | buffer_bit_ops or | -2.4 | | array_from_vec 512 | -2.6 | | cast float64 to uint64 512 | -2.7 | | take str 512 | -2.8 | | min nulls string 512 | -3.1 | | cast int32 to int32 512 | -3.3 | | array_slice 128 | -3.3 | | filter context u8 w NULLs very low selectivity | -3.3 | | buffer_bit_ops and | -3.4 | | struct_array_from_vec 256 | -4.2 | | cast int32 to uint32 512 | -4.5 | | multiply 512 | -5.2 | | equal_string_512 | -5.5 | | take str null values null indices 1024 | -6.8 | | sum nulls 512 | -13.3 | | add_nulls_512 | -17.6 | | like_utf8 scalar contains | -17.8 | | nlike_utf8 scalar contains | -17.9 | | nlike_utf8 scalar complex | -24.6 | | like_utf8 scalar complex | -25.2 | | cast time64ns to time32s 512 | -42.7 | | cast date64 to date32 512 | -49.1 | | cast date32 to date64 512 | -50.7 | | nlike_utf8 scalar starts with | -51.1 | | nlike_utf8 scalar ends with | -55.1 | | like_utf8 scalar ends with | -55.5 | | like_utf8 scalar starts with | -56.3 | | nlike_utf8 scalar equals | -67.8 | | like_utf8 scalar equals | -74.2 | | eq Float32 | -75.7 | | gt_eq Float32 | -76.1 | | lt_eq Float32 | -76.5 | | not | -77.1 | | and | -78.6 | | or | -78.7 | | lt_eq scalar Float32 | -79.4 | | eq scalar Float32 | -82.1 | | neq Float32 | -82.1 | | lt scalar Float32 | -82.1 | | lt Float32 | -82.3 | | gt Float32 | -82.4 | | gt_eq scalar Float32 | -82.4 | | neq scalar Float32 | -82.6 | | gt scalar Float32 | -84.7 | Closes #8842 from jorgecarleitao/boolean Lead-authored-by: Jorge C. Leitao <[email protected]> Co-authored-by: Jorge Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
This PR speeds up the (non-simd) comparison kernels by ~8 times:
Together with #8832 brings even more improvements to query 12 (~1400 -> ~1250ms)