Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 12, 2020

This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after #8842 .
This also accounts for a ~5% difference on some queries in DataFusion (when not using the simd feature, where the implementation doesn't use the builder). Also the bounds checks are a bit expensive. In some value functions they are explicitly not there whereas in other (like for string) they are there.

I guess there will be always some overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.

FYI @jorgecarleitao

Gnuplot not found, using plotters backend
eq Float32              time:   [107.02 us 107.29 us 107.60 us]                       
                        change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [70.271 us 70.356 us 70.446 us]                              
                        change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

neq Float32             time:   [71.580 us 71.655 us 71.732 us]                        
                        change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [70.011 us 70.079 us 70.155 us]                               
                        change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  3 (3.00%) high mild

lt Float32              time:   [70.945 us 70.991 us 71.038 us]                       
                        change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [50.708 us 50.789 us 50.882 us]                               
                        change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

lt_eq Float32           time:   [106.29 us 106.40 us 106.52 us]                          
                        change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

lt_eq scalar Float32    time:   [71.089 us 71.170 us 71.261 us]                                 
                        change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [71.759 us 71.939 us 72.131 us]                       
                        change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [38.748 us 38.782 us 38.821 us]                               
                        change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

gt_eq Float32           time:   [102.79 us 102.87 us 102.96 us]                          
                        change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe

gt_eq scalar Float32    time:   [55.034 us 55.109 us 55.201 us]                                 
                        change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 12, 2020

Codecov Report

Merging #8900 (23c8ff2) into master (1378c20) will increase coverage by 22.81%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #8900       +/-   ##
===========================================
+ Coverage   53.96%   76.77%   +22.81%     
===========================================
  Files         170      181       +11     
  Lines       30707    41009    +10302     
===========================================
+ Hits        16571    31485    +14914     
+ Misses      14136     9524     -4612     
Impacted Files Coverage Δ
rust/arrow/src/compute/kernels/comparison.rs 96.27% <ø> (ø)
rust/arrow/src/ipc/writer.rs 83.65% <0.00%> (-3.44%) ⬇️
rust/arrow/src/json/reader.rs 83.17% <0.00%> (-0.09%) ⬇️
rust/arrow/src/buffer.rs 97.96% <0.00%> (ø)
rust/arrow/src/datatypes.rs 76.85% <0.00%> (ø)
rust/arrow-flight/src/utils.rs 0.00% <0.00%> (ø)
rust/datafusion/examples/flight_server.rs 0.00% <0.00%> (ø)
rust/arrow/src/compute/kernels/arithmetic.rs 98.45% <0.00%> (ø)
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
...ntegration-testing/src/bin/arrow-stream-to-file.rs 0.00% <0.00%> (ø)
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1378c20...23c8ff2. Read the comment docs.

@Dandandan Dandandan changed the title ARROW-10810: Comparison kernels performance ARROW-10810: [Rust] Comparison kernels performance Dec 12, 2020
/// Returns the element at index `i` as a byte slice.
pub fn value(&self, i: usize) -> &[u8] {
assert!(
debug_assert!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean that we should mark the function as unsafe? Should we provide safe and unsafe versions of these functions?

Copy link
Contributor Author

@Dandandan Dandandan Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can think more about safe/unsafe and being more clear about it. I will move this change from this PR for now.

for i in 0..$left.len() {
result.append($op($left.value(i), $right.value(i)))?;
if $op($left.value(i), $right.value(i)) {
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions that use this macro should now be declared unsafe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert the change with the asserts for now. I think the value functions should be marked unsafe that don't perform bound checks, and other functions that can trigger UB. The macro's themselves should be "relatively" safe I think, as long as the .len() value is correct.
I think a better solution for the future would be to have a safe iterator that doesn't do bound checking, so I think it's better to move the particular change out of this PR for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, @Dandandan . Note that all primitives and strings have iterators and FromIterator: https://github.com/apache/arrow/blob/master/rust/arrow/src/array/iterator.rs , but they are for Option<T>, not T.

I agree with you that we should mark that fn value as unsafe and offer an iterator over T (besides the one over Option<T>). That UB is really obvious and it is also a security vulnerability causing an escalation of privileges as it allows privileged access to the application's memory via out of bounds accesses.

I usually see the iterator over T when they can mask the result or OR / AND the null bitmaps, while Option<T> is used when that is not possible / useful.

@Dandandan
Copy link
Contributor Author

Will revert the assert changes tomorrow, will create a ticket for the iterators over T to avoid bounds checking.

@Dandandan
Copy link
Contributor Author

@andygrove @jorgecarleitao
Reverted the assert changes and created
https://issues.apache.org/jira/browse/ARROW-10892 for the iterator.

@alamb
Copy link
Contributor

alamb commented Dec 13, 2020

FWIW I tested this branch with a local micro benchmark I had and it shows significant improvement. I wrote up some details here: https://docs.google.com/document/d/15DRpIr1EUqo7zVR1psBhoX95ML1PO55kkHwt2EasM88/edit#heading=h.pfa677dojqtv

Backstory was I was looking at the performance of these kernels (inspired by @rdettai ) as utf8_neq_scalar came up in a talk I was giving last week. What a wonderful surprise that there is a PR actively improving them (and it happens to also fit the narrative of my talk well!)

@alamb alamb changed the title ARROW-10810: [Rust] Comparison kernels performance ARROW-10810: [Rust] Improve comparison kernels performance Dec 13, 2020
@Dandandan
Copy link
Contributor Author

Dandandan commented Dec 13, 2020

Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.

The docs / presentation are looking very interesting, it probably isn't recorded?

As mentioned there is for some datatypes (like strings) there is some overhead still related to the assert!s that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.

@Dandandan
Copy link
Contributor Author

Ah just saw the link to the YouTube video, thanks @alamb

@Dandandan
Copy link
Contributor Author

@alamb

Took it a bit further and the assert! is also having the same effect on your benchmark:

commit 23c8ff2

Hello, world!
example_with_vec
created array with 20000000 elements in 625.670877ms
Completed finding bitset: 20000000 elements in 62.028751ms
Completed finding bitset: 20000000 elements in 61.464365ms
Completed finding bitset: 20000000 elements in 59.990842ms
Completed finding bitset: 20000000 elements in 56.594087ms
Completed finding bitset: 20000000 elements in 55.969562ms
Completed finding bitset: 20000000 elements in 56.594535ms
Completed finding bitset: 20000000 elements in 55.987112ms
Completed finding bitset: 20000000 elements in 56.379664ms
Completed finding bitset: 20000000 elements in 56.47164ms
Completed finding bitset: 20000000 elements in 56.342637ms
example_with_arrow
created array with 20000000 elements in 773.245663ms
Found 20000000 not in west in 72.449124ms
Found 20000000 not in west in 74.481405ms
Found 20000000 not in west in 75.160204ms
Found 20000000 not in west in 75.033379ms
Found 20000000 not in west in 75.064686ms
Found 20000000 not in west in 75.221221ms
Found 20000000 not in west in 75.979403ms
Found 20000000 not in west in 76.288346ms
Found 20000000 not in west in 74.639085ms
Found 20000000 not in west in 73.664288ms

commit 3b67f70 :

Hello, world!
example_with_vec
created array with 20000000 elements in 550.194933ms
Completed finding bitset: 20000000 elements in 58.936386ms
Completed finding bitset: 20000000 elements in 57.900239ms
Completed finding bitset: 20000000 elements in 54.868759ms
Completed finding bitset: 20000000 elements in 52.913171ms
Completed finding bitset: 20000000 elements in 52.707473ms
Completed finding bitset: 20000000 elements in 53.600913ms
Completed finding bitset: 20000000 elements in 53.315755ms
Completed finding bitset: 20000000 elements in 53.188747ms
Completed finding bitset: 20000000 elements in 52.896293ms
Completed finding bitset: 20000000 elements in 54.134426ms
example_with_arrow
created array with 20000000 elements in 666.080968ms
Found 20000000 not in west in 58.209991ms
Found 20000000 not in west in 58.496839ms
Found 20000000 not in west in 58.294964ms
Found 20000000 not in west in 57.771822ms
Found 20000000 not in west in 58.266169ms
Found 20000000 not in west in 58.028011ms
Found 20000000 not in west in 58.431954ms
Found 20000000 not in west in 58.545879ms
Found 20000000 not in west in 58.131794ms
Found 20000000 not in west in 59.211415ms

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove and @jorgecarleitao -- I think this PR is looking ready to merge. Are you satisfied with the resolution of the unsafe discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants