ARROW-10810: [Rust] Speed up comparison kernels #8837

Dandandan · 2020-12-05T13:21:37Z

This PR speeds up the (non-simd) comparison kernels by ~8 times:

Together with #8832 brings even more improvements to query 12 (~1400 -> ~1250ms)

Query 12 iteration 0 took 1233 ms
Query 12 iteration 1 took 1233 ms
Query 12 iteration 2 took 1235 ms
Query 12 iteration 3 took 1235 ms
Query 12 iteration 4 took 1297 ms
Query 12 iteration 5 took 1246 ms
Query 12 iteration 6 took 1257 ms
Query 12 iteration 7 took 1250 ms
Query 12 iteration 8 took 1265 ms
Query 12 iteration 9 took 1279 ms

eq Float32              time:   [105.96 us 106.01 us 106.06 us]                       
                        change: [-82.463% -82.423% -82.378%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

eq scalar Float32       time:   [61.439 us 61.530 us 61.662 us]                              
                        change: [-88.282% -88.255% -88.221%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

neq Float32             time:   [71.018 us 71.080 us 71.144 us]                        
                        change: [-86.580% -86.563% -86.546%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

neq scalar Float32      time:   [68.706 us 68.773 us 68.838 us]                               
                        change: [-86.207% -86.188% -86.171%] (p = 0.00 < 0.05)
                        Performance has improved.

lt Float32              time:   [70.655 us 70.703 us 70.753 us]                       
                        change: [-85.629% -85.617% -85.604%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

lt scalar Float32       time:   [50.626 us 50.664 us 50.698 us]                               
                        change: [-89.802% -89.764% -89.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

lt_eq Float32           time:   [101.34 us 101.43 us 101.51 us]                          
                        change: [-82.825% -82.797% -82.767%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [68.894 us 68.913 us 68.933 us]                                 
                        change: [-86.575% -86.557% -86.538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild

gt Float32              time:   [71.260 us 71.332 us 71.400 us]                       
                        change: [-87.481% -87.450% -87.418%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

gt scalar Float32       time:   [38.852 us 38.888 us 38.929 us]                               
                        change: [-91.745% -91.733% -91.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  11 (11.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [99.404 us 99.451 us 99.503 us]                          
                        change: [-80.870% -80.848% -80.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

gt_eq scalar Float32    time:   [55.892 us 55.926 us 55.963 us]                                 
                        change: [-88.783% -88.751% -88.727%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

github-actions · 2020-12-05T13:32:48Z

https://issues.apache.org/jira/browse/ARROW-10810

jorgecarleitao

LGTM. Really sweet improvement.

jorgecarleitao · 2020-12-05T19:16:38Z

rust/arrow/src/compute/kernels/comparison.rs

-        let mut result = BooleanBufferBuilder::new($left.len());
+        let byte_capacity = bit_util::ceil($left.len(), 8);
+        let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+        let mut buffer = MutableBuffer::new(actual_capacity);
+        buffer.resize(byte_capacity);
+
+        let data = buffer.raw_data_mut();
        for i in 0..$left.len() {
-            result.append($op($left.value(i), $right))?;
+            if $op($left.value(i), $right) {
+                unsafe {
+                    bit_util::set_bit_raw(data, i);
+                }
+            }


A side note: In general, if we are bypassing builders for speed, it can mean that builders are not a good abstraction, as they significantly impact performance.

Yes, would be nice to see if we can get the combination of a clean/safe API and good performance as well.

Dandandan · 2020-12-06T03:46:19Z

Let's close this for now in favor of #8842

This PR creates a new struct `BooleanArray`, that replaces `PrimitiveArray<BooleanType>`, so that we do not have to consider the differences between being bit-packed and non-bit packed. This difference is causing a significant performance degradation described on ARROW-10453 and #8837 . This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed. IMPORTANT: this removed support from Boolean array to UnionArray, as `UnionArray` currently only supports `PrimitiveType`. Micro benchmarks (worse to best, statistically insignificant ignored): | benchmark | variation | |-------------- | -------------- | | min nulls 512 | 33.7 | | record_batches_to_csv | 23.1 | | array_string_from_vec 256 | 5.6 | | array_string_from_vec 512 | 5.2 | | take bool nulls 512 | 4.9 | | cast int32 to int64 512 | 2.5 | | equal_512 | 2.3 | | filter u8 very low selectivity | 2.2 | | array_slice 512 | 2.1 | | take bool nulls 1024 | 2.0 | | cast int64 to int32 512 | 1.6 | | min 512 | 1.6 | | take i32 512 | 1.1 | | add 512 | 1.1 | | array_slice 2048 | 1.0 | | length | 1.0 | | filter u8 low selectivity | 0.9 | | filter u8 high selectivity | 0.9 | | array_string_from_vec 128 | 0.9 | | cast int32 to float64 512 | 0.9 | | cast timestamp_ms to i64 512 | 0.8 | | take str null indices 512 | 0.6 | | sum 512 | 0.4 | | filter context u8 very low selectivity | -0.7 | | take i32 1024 | -0.9 | | filter context f32 very low selectivity | -0.9 | | cast float64 to float32 512 | -1.0 | | equal_nulls_512 | -1.0 | | cast time32s to time32ms 512 | -1.1 | | sort 2^12 | -1.2 | | struct_array_from_vec 128 | -1.4 | | array_from_vec 256 | -1.4 | | array_from_vec 128 | -1.5 | | filter context u8 high selectivity | -1.6 | | limit 512, 512 | -1.7 | | equal_string_nulls_512 | -1.8 | | take i32 nulls 1024 | -1.8 | | struct_array_from_vec 512 | -1.9 | | filter context f32 high selectivity | -2.0 | | cast timestamp_ms to timestamp_ns 512 | -2.2 | | take i32 nulls 512 | -2.3 | | buffer_bit_ops or | -2.4 | | array_from_vec 512 | -2.6 | | cast float64 to uint64 512 | -2.7 | | take str 512 | -2.8 | | min nulls string 512 | -3.1 | | cast int32 to int32 512 | -3.3 | | array_slice 128 | -3.3 | | filter context u8 w NULLs very low selectivity | -3.3 | | buffer_bit_ops and | -3.4 | | struct_array_from_vec 256 | -4.2 | | cast int32 to uint32 512 | -4.5 | | multiply 512 | -5.2 | | equal_string_512 | -5.5 | | take str null values null indices 1024 | -6.8 | | sum nulls 512 | -13.3 | | add_nulls_512 | -17.6 | | like_utf8 scalar contains | -17.8 | | nlike_utf8 scalar contains | -17.9 | | nlike_utf8 scalar complex | -24.6 | | like_utf8 scalar complex | -25.2 | | cast time64ns to time32s 512 | -42.7 | | cast date64 to date32 512 | -49.1 | | cast date32 to date64 512 | -50.7 | | nlike_utf8 scalar starts with | -51.1 | | nlike_utf8 scalar ends with | -55.1 | | like_utf8 scalar ends with | -55.5 | | like_utf8 scalar starts with | -56.3 | | nlike_utf8 scalar equals | -67.8 | | like_utf8 scalar equals | -74.2 | | eq Float32 | -75.7 | | gt_eq Float32 | -76.1 | | lt_eq Float32 | -76.5 | | not | -77.1 | | and | -78.6 | | or | -78.7 | | lt_eq scalar Float32 | -79.4 | | eq scalar Float32 | -82.1 | | neq Float32 | -82.1 | | lt scalar Float32 | -82.1 | | lt Float32 | -82.3 | | gt Float32 | -82.4 | | gt_eq scalar Float32 | -82.4 | | neq scalar Float32 | -82.6 | | gt scalar Float32 | -84.7 | Closes #8842 from jorgecarleitao/boolean Lead-authored-by: Jorge C. Leitao <[email protected]> Co-authored-by: Jorge Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

Dandandan added 3 commits December 5, 2020 13:28

Comparison kernel speedup

32344ef

Make it hae same sizing as the bufferbuilder

55f3820

Simplify

62e71a2

github-actions bot added the Component: Rust label Dec 5, 2020

Dandandan changed the title ~~ARROW-10810: [Rust] Speed up comparison kernels~~ ARROW-10810: [Rust] Speed up comparison kernels Dec 5, 2020

apache deleted a comment from github-actions bot Dec 5, 2020

jorgecarleitao approved these changes Dec 5, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Dec 5, 2020

ARROW-10812: [Rust] Make BooleanArray not a PrimitiveArray #8842

Closed

Dandandan closed this Dec 6, 2020

asfimport mentioned this pull request Dec 15, 2020

[Rust] Speed up comparison kernels #26748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-10810: [Rust] Speed up comparison kernels #8837

ARROW-10810: [Rust] Speed up comparison kernels #8837

Uh oh!

Dandandan commented Dec 5, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Dec 5, 2020

Uh oh!

jorgecarleitao left a comment

Uh oh!

jorgecarleitao Dec 5, 2020

Uh oh!

Dandandan Dec 5, 2020

Uh oh!

Dandandan commented Dec 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-10810: [Rust] Speed up comparison kernels #8837

ARROW-10810: [Rust] Speed up comparison kernels #8837

Uh oh!

Conversation

Dandandan commented Dec 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Dec 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Dec 5, 2020 •

edited

Loading