ARROW-10810: [Rust] Improve comparison kernels performance #8900

Dandandan · 2020-12-12T12:33:50Z

This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after #8842 .
This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some value functions they are explicitly not there whereas in other (like for string) they are there.

I guess there will be always some overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.

FYI @jorgecarleitao

Gnuplot not found, using plotters backend
eq Float32              time:   [107.02 us 107.29 us 107.60 us]                       
                        change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [70.271 us 70.356 us 70.446 us]                              
                        change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

neq Float32             time:   [71.580 us 71.655 us 71.732 us]                        
                        change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [70.011 us 70.079 us 70.155 us]                               
                        change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild

lt Float32              time:   [70.945 us 70.991 us 71.038 us]                       
                        change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [50.708 us 50.789 us 50.882 us]                               
                        change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

lt_eq Float32           time:   [106.29 us 106.40 us 106.52 us]                          
                        change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [71.089 us 71.170 us 71.261 us]                                 
                        change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [71.759 us 71.939 us 72.131 us]                       
                        change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [38.748 us 38.782 us 38.821 us]                               
                        change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [102.79 us 102.87 us 102.96 us]                          
                        change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

gt_eq scalar Float32    time:   [55.034 us 55.109 us 55.201 us]                                 
                        change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

github-actions · 2020-12-12T12:38:39Z

https://issues.apache.org/jira/browse/ARROW-10810

codecov-io · 2020-12-12T12:40:40Z

Codecov Report

Merging #8900 (23c8ff2) into master (1378c20) will increase coverage by 22.81%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #8900       +/-   ##
===========================================
+ Coverage   53.96%   76.77%   +22.81%     
===========================================
  Files         170      181       +11     
  Lines       30707    41009    +10302     
===========================================
+ Hits        16571    31485    +14914     
+ Misses      14136     9524     -4612

Impacted Files	Coverage Δ
rust/arrow/src/compute/kernels/comparison.rs	`96.27% <ø> (ø)`
rust/arrow/src/ipc/writer.rs	`83.65% <0.00%> (-3.44%)`	⬇️
rust/arrow/src/json/reader.rs	`83.17% <0.00%> (-0.09%)`	⬇️
rust/arrow/src/buffer.rs	`97.96% <0.00%> (ø)`
rust/arrow/src/datatypes.rs	`76.85% <0.00%> (ø)`
rust/arrow-flight/src/utils.rs	`0.00% <0.00%> (ø)`
rust/datafusion/examples/flight_server.rs	`0.00% <0.00%> (ø)`
rust/arrow/src/compute/kernels/arithmetic.rs	`98.45% <0.00%> (ø)`
...ion-testing/src/bin/arrow-json-integration-test.rs	`0.00% <0.00%> (ø)`
...ntegration-testing/src/bin/arrow-stream-to-file.rs	`0.00% <0.00%> (ø)`
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1378c20...23c8ff2. Read the comment docs.

andygrove · 2020-12-12T18:53:12Z

rust/arrow/src/array/array_binary.rs

    /// Returns the element at index `i` as a byte slice.
    pub fn value(&self, i: usize) -> &[u8] {
-        assert!(
+        debug_assert!(


Does this change mean that we should mark the function as unsafe? Should we provide safe and unsafe versions of these functions?

I think we can think more about safe/unsafe and being more clear about it. I will move this change from this PR for now.

andygrove · 2020-12-12T18:54:21Z

rust/arrow/src/compute/kernels/comparison.rs

        for i in 0..$left.len() {
-            result.append($op($left.value(i), $right.value(i)))?;
+            if $op($left.value(i), $right.value(i)) {
+                unsafe {


Functions that use this macro should now be declared unsafe?

I will revert the change with the asserts for now. I think the value functions should be marked unsafe that don't perform bound checks, and other functions that can trigger UB. The macro's themselves should be "relatively" safe I think, as long as the .len() value is correct.
I think a better solution for the future would be to have a safe iterator that doesn't do bound checking, so I think it's better to move the particular change out of this PR for now.

I agree, @Dandandan . Note that all primitives and strings have iterators and FromIterator: https://github.com/apache/arrow/blob/master/rust/arrow/src/array/iterator.rs , but they are for Option<T>, not T.

I agree with you that we should mark that fn value as unsafe and offer an iterator over T (besides the one over Option<T>). That UB is really obvious and it is also a security vulnerability causing an escalation of privileges as it allows privileged access to the application's memory via out of bounds accesses.

I usually see the iterator over T when they can mask the result or OR / AND the null bitmaps, while Option<T> is used when that is not possible / useful.

Dandandan · 2020-12-12T20:29:19Z

Will revert the assert changes tomorrow, will create a ticket for the iterators over T to avoid bounds checking.

Dandandan · 2020-12-13T08:40:00Z

@andygrove @jorgecarleitao
Reverted the assert changes and created
https://issues.apache.org/jira/browse/ARROW-10892 for the iterator.

alamb · 2020-12-13T12:07:47Z

FWIW I tested this branch with a local micro benchmark I had and it shows significant improvement. I wrote up some details here: https://docs.google.com/document/d/15DRpIr1EUqo7zVR1psBhoX95ML1PO55kkHwt2EasM88/edit#heading=h.pfa677dojqtv

Backstory was I was looking at the performance of these kernels (inspired by @rdettai ) as utf8_neq_scalar came up in a talk I was giving last week. What a wonderful surprise that there is a PR actively improving them (and it happens to also fit the narrative of my talk well!)

Dandandan · 2020-12-13T15:35:56Z

Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.

The docs / presentation are looking very interesting, it probably isn't recorded?

As mentioned there is for some datatypes (like strings) there is some overhead still related to the assert!s that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.

Dandandan · 2020-12-13T15:39:23Z

Ah just saw the link to the YouTube video, thanks @alamb

Dandandan · 2020-12-13T16:35:36Z

@alamb

Took it a bit further and the assert! is also having the same effect on your benchmark:

commit 23c8ff2

Hello, world!
example_with_vec
created array with 20000000 elements in 625.670877ms
Completed finding bitset: 20000000 elements in 62.028751ms
Completed finding bitset: 20000000 elements in 61.464365ms
Completed finding bitset: 20000000 elements in 59.990842ms
Completed finding bitset: 20000000 elements in 56.594087ms
Completed finding bitset: 20000000 elements in 55.969562ms
Completed finding bitset: 20000000 elements in 56.594535ms
Completed finding bitset: 20000000 elements in 55.987112ms
Completed finding bitset: 20000000 elements in 56.379664ms
Completed finding bitset: 20000000 elements in 56.47164ms
Completed finding bitset: 20000000 elements in 56.342637ms
example_with_arrow
created array with 20000000 elements in 773.245663ms
Found 20000000 not in west in 72.449124ms
Found 20000000 not in west in 74.481405ms
Found 20000000 not in west in 75.160204ms
Found 20000000 not in west in 75.033379ms
Found 20000000 not in west in 75.064686ms
Found 20000000 not in west in 75.221221ms
Found 20000000 not in west in 75.979403ms
Found 20000000 not in west in 76.288346ms
Found 20000000 not in west in 74.639085ms
Found 20000000 not in west in 73.664288ms

commit 3b67f70 :

Hello, world!
example_with_vec
created array with 20000000 elements in 550.194933ms
Completed finding bitset: 20000000 elements in 58.936386ms
Completed finding bitset: 20000000 elements in 57.900239ms
Completed finding bitset: 20000000 elements in 54.868759ms
Completed finding bitset: 20000000 elements in 52.913171ms
Completed finding bitset: 20000000 elements in 52.707473ms
Completed finding bitset: 20000000 elements in 53.600913ms
Completed finding bitset: 20000000 elements in 53.315755ms
Completed finding bitset: 20000000 elements in 53.188747ms
Completed finding bitset: 20000000 elements in 52.896293ms
Completed finding bitset: 20000000 elements in 54.134426ms
example_with_arrow
created array with 20000000 elements in 666.080968ms
Found 20000000 not in west in 58.209991ms
Found 20000000 not in west in 58.496839ms
Found 20000000 not in west in 58.294964ms
Found 20000000 not in west in 57.771822ms
Found 20000000 not in west in 58.266169ms
Found 20000000 not in west in 58.028011ms
Found 20000000 not in west in 58.431954ms
Found 20000000 not in west in 58.545879ms
Found 20000000 not in west in 58.131794ms
Found 20000000 not in west in 59.211415ms

alamb

@andygrove and @jorgecarleitao -- I think this PR is looking ready to merge. Are you satisfied with the resolution of the unsafe discussion?

Dandandan added 3 commits December 12, 2020 13:02

Some minor perf tweaks

9e38fe9

Also do for scalar

c5749d9

Fix buffer code

c64e6f0

github-actions bot added the Component: Rust label Dec 12, 2020

fmt

3b67f70

Dandandan changed the title ~~ARROW-10810: Comparison kernels performance~~ ARROW-10810: [Rust] Comparison kernels performance Dec 12, 2020

andygrove reviewed Dec 12, 2020

View reviewed changes

Revert debug asserts, add comment about safety

23c8ff2

alamb approved these changes Dec 13, 2020

View reviewed changes

alamb changed the title ~~ARROW-10810: [Rust] Comparison kernels performance~~ ARROW-10810: [Rust] Improve comparison kernels performance Dec 13, 2020

alamb approved these changes Dec 14, 2020

View reviewed changes

Add safety comment for scalar kenel as well

ef3bffa

jorgecarleitao approved these changes Dec 15, 2020

View reviewed changes

nevi-me closed this in 408e5be Dec 15, 2020

Dandandan mentioned this pull request Dec 15, 2020

ARROW-10889: [Rust] [Proposal] Add guidelines about usage of unsafe #8901

Closed

asfimport mentioned this pull request Dec 15, 2020

[Rust] Speed up comparison kernels #26748

Closed

ARROW-10810: [Rust] Improve comparison kernels performance #8900

ARROW-10810: [Rust] Improve comparison kernels performance #8900

Uh oh!

Conversation

Dandandan commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 12, 2020

Uh oh!

codecov-io commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Dec 12, 2020

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Dec 12, 2020

Uh oh!

Dandandan commented Dec 13, 2020

Uh oh!

alamb commented Dec 13, 2020

Uh oh!

Dandandan commented Dec 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Dec 13, 2020

Uh oh!

Dandandan commented Dec 13, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Dandandan commented Dec 12, 2020 •

edited

Loading

codecov-io commented Dec 12, 2020 •

edited

Loading

Dandandan Dec 12, 2020 •

edited

Loading

Dandandan commented Dec 13, 2020 •

edited

Loading