Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 5, 2020

This PR speeds up the (non-simd) comparison kernels by ~8 times:

Together with #8832 brings even more improvements to query 12 (~1400 -> ~1250ms)

Query 12 iteration 0 took 1233 ms
Query 12 iteration 1 took 1233 ms
Query 12 iteration 2 took 1235 ms
Query 12 iteration 3 took 1235 ms
Query 12 iteration 4 took 1297 ms
Query 12 iteration 5 took 1246 ms
Query 12 iteration 6 took 1257 ms
Query 12 iteration 7 took 1250 ms
Query 12 iteration 8 took 1265 ms
Query 12 iteration 9 took 1279 ms
eq Float32              time:   [105.96 us 106.01 us 106.06 us]                       
                        change: [-82.463% -82.423% -82.378%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

eq scalar Float32       time:   [61.439 us 61.530 us 61.662 us]                              
                        change: [-88.282% -88.255% -88.221%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

neq Float32             time:   [71.018 us 71.080 us 71.144 us]                        
                        change: [-86.580% -86.563% -86.546%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

neq scalar Float32      time:   [68.706 us 68.773 us 68.838 us]                               
                        change: [-86.207% -86.188% -86.171%] (p = 0.00 < 0.05)
                        Performance has improved.

lt Float32              time:   [70.655 us 70.703 us 70.753 us]                       
                        change: [-85.629% -85.617% -85.604%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

lt scalar Float32       time:   [50.626 us 50.664 us 50.698 us]                               
                        change: [-89.802% -89.764% -89.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

lt_eq Float32           time:   [101.34 us 101.43 us 101.51 us]                          
                        change: [-82.825% -82.797% -82.767%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [68.894 us 68.913 us 68.933 us]                                 
                        change: [-86.575% -86.557% -86.538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild

gt Float32              time:   [71.260 us 71.332 us 71.400 us]                       
                        change: [-87.481% -87.450% -87.418%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

gt scalar Float32       time:   [38.852 us 38.888 us 38.929 us]                               
                        change: [-91.745% -91.733% -91.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  11 (11.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [99.404 us 99.451 us 99.503 us]                          
                        change: [-80.870% -80.848% -80.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

gt_eq scalar Float32    time:   [55.892 us 55.926 us 55.963 us]                                 
                        change: [-88.783% -88.751% -88.727%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

@Dandandan Dandandan changed the title ARROW-10810: [Rust] Speed up comparison kernels ARROW-10810: [Rust] Speed up comparison kernels Dec 5, 2020
@github-actions
Copy link

github-actions bot commented Dec 5, 2020

@apache apache deleted a comment from github-actions bot Dec 5, 2020
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Really sweet improvement.

Comment on lines -71 to +91
let mut result = BooleanBufferBuilder::new($left.len());
let byte_capacity = bit_util::ceil($left.len(), 8);
let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
let mut buffer = MutableBuffer::new(actual_capacity);
buffer.resize(byte_capacity);

let data = buffer.raw_data_mut();
for i in 0..$left.len() {
result.append($op($left.value(i), $right))?;
if $op($left.value(i), $right) {
unsafe {
bit_util::set_bit_raw(data, i);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side note: In general, if we are bypassing builders for speed, it can mean that builders are not a good abstraction, as they significantly impact performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be nice to see if we can get the combination of a clean/safe API and good performance as well.

@Dandandan
Copy link
Contributor Author

Let's close this for now in favor of #8842

@Dandandan Dandandan closed this Dec 6, 2020
jorgecarleitao added a commit that referenced this pull request Dec 11, 2020
This PR creates a new struct `BooleanArray`, that replaces `PrimitiveArray<BooleanType>`, so that we do not have to consider the differences between being bit-packed and non-bit packed.

This difference is causing a significant performance degradation described on ARROW-10453 and #8837 .

This usage of different logic is already observed in most of our kernels, as the code for byte-width and bit-packed is almost always different, due to how offsets are computed. With this PR, that offset computation no longer depends on bit-packed vs non-bit-packed.

IMPORTANT: this removed support from Boolean array to UnionArray, as `UnionArray` currently only supports `PrimitiveType`.

Micro benchmarks (worse to best, statistically insignificant ignored):

|  benchmark | variation |
|-------------- | -------------- |
| min nulls 512 | 33.7 |
| record_batches_to_csv | 23.1 |
| array_string_from_vec 256 | 5.6 |
| array_string_from_vec 512 | 5.2 |
| take bool nulls 512 | 4.9 |
| cast int32 to int64 512 | 2.5 |
| equal_512 | 2.3 |
| filter u8 very low selectivity | 2.2 |
| array_slice 512 | 2.1 |
| take bool nulls 1024 | 2.0 |
| cast int64 to int32 512 | 1.6 |
| min 512 | 1.6 |
| take i32 512 | 1.1 |
| add 512 | 1.1 |
| array_slice 2048 | 1.0 |
| length | 1.0 |
| filter u8 low selectivity | 0.9 |
| filter u8 high selectivity | 0.9 |
| array_string_from_vec 128 | 0.9 |
| cast int32 to float64 512 | 0.9 |
| cast timestamp_ms to i64 512 | 0.8 |
| take str null indices 512 | 0.6 |
| sum 512 | 0.4 |
| filter context u8 very low selectivity | -0.7 |
| take i32 1024 | -0.9 |
| filter context f32 very low selectivity | -0.9 |
| cast float64 to float32 512 | -1.0 |
| equal_nulls_512 | -1.0 |
| cast time32s to time32ms 512 | -1.1 |
| sort 2^12 | -1.2 |
| struct_array_from_vec 128 | -1.4 |
| array_from_vec 256 | -1.4 |
| array_from_vec 128 | -1.5 |
| filter context u8 high selectivity | -1.6 |
| limit 512, 512 | -1.7 |
| equal_string_nulls_512 | -1.8 |
| take i32 nulls 1024 | -1.8 |
| struct_array_from_vec 512 | -1.9 |
| filter context f32 high selectivity | -2.0 |
| cast timestamp_ms to timestamp_ns 512 | -2.2 |
| take i32 nulls 512 | -2.3 |
| buffer_bit_ops or | -2.4 |
| array_from_vec 512 | -2.6 |
| cast float64 to uint64 512 | -2.7 |
| take str 512 | -2.8 |
| min nulls string 512 | -3.1 |
| cast int32 to int32 512 | -3.3 |
| array_slice 128 | -3.3 |
| filter context u8 w NULLs very low selectivity | -3.3 |
| buffer_bit_ops and | -3.4 |
| struct_array_from_vec 256 | -4.2 |
| cast int32 to uint32 512 | -4.5 |
| multiply 512 | -5.2 |
| equal_string_512 | -5.5 |
| take str null values null indices 1024 | -6.8 |
| sum nulls 512 | -13.3 |
| add_nulls_512 | -17.6 |
| like_utf8 scalar contains | -17.8 |
| nlike_utf8 scalar contains | -17.9 |
| nlike_utf8 scalar complex | -24.6 |
| like_utf8 scalar complex | -25.2 |
| cast time64ns to time32s 512 | -42.7 |
| cast date64 to date32 512 | -49.1 |
| cast date32 to date64 512 | -50.7 |
| nlike_utf8 scalar starts with | -51.1 |
| nlike_utf8 scalar ends with | -55.1 |
| like_utf8 scalar ends with | -55.5 |
| like_utf8 scalar starts with | -56.3 |
| nlike_utf8 scalar equals | -67.8 |
| like_utf8 scalar equals | -74.2 |
| eq Float32 | -75.7 |
| gt_eq Float32 | -76.1 |
| lt_eq Float32 | -76.5 |
| not | -77.1 |
| and | -78.6 |
| or | -78.7 |
| lt_eq scalar Float32 | -79.4 |
| eq scalar Float32 | -82.1 |
| neq Float32 | -82.1 |
| lt scalar Float32 | -82.1 |
| lt Float32 | -82.3 |
| gt Float32 | -82.4 |
| gt_eq scalar Float32 | -82.4 |
| neq scalar Float32 | -82.6 |
| gt scalar Float32 | -84.7 |

Closes #8842 from jorgecarleitao/boolean

Lead-authored-by: Jorge C. Leitao <[email protected]>
Co-authored-by: Jorge Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants