Skip to content

Conversation

@vertexclique
Copy link
Contributor

@vertexclique vertexclique commented Nov 15, 2020

  • Up to %95 improvements on many kernels.
  • Implements safe bit operations for Arrow
  • Implements typed_bits to get bits as Vec
  • Implements various bit operations for the use with arrow arrays
  • Adjusts parquet array reader to use Arrow bit operations

I squashed all my work into a single commit and closing PR: #8598 with comment: #8598 (comment)

Benchmarks

sum 2^20                time:   [160.84 us 161.83 us 163.10 us]                     
                        change: [-88.535% -88.450% -88.365%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking min 2^20: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
min 2^20                time:   [1.4026 ms 1.4027 ms 1.4028 ms]                      
                        change: [-0.0516% -0.0286% -0.0098%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  8 (8.00%) high mild
  3 (3.00%) high severe

sum nulls 2^20          time:   [783.74 us 797.75 us 812.02 us]                           
                        change: [-80.849% -80.392% -79.903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

min nulls 2^20          time:   [12.366 ms 12.375 ms 12.389 ms]                           
                        change: [+24.048% +24.202% +24.381%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

min string 2^20         time:   [9.9124 ms 9.9145 ms 9.9174 ms]                            
                        change: [+1.9874% +2.0847% +2.1694%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  8 (8.00%) high mild
  7 (7.00%) high severe

min nulls string 2^20   time:   [19.348 ms 19.352 ms 19.357 ms]                                  
                        change: [+3.4758% +3.6867% +3.8658%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/arithmetic_kernels-ce51190c627182f2
add 2^20                time:   [971.06 us 972.57 us 974.19 us]                     
                        change: [-58.352% -58.310% -58.264%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild

subtract 2^20           time:   [970.69 us 972.03 us 973.30 us]                          
                        change: [-58.416% -58.371% -58.327%] (p = 0.00 < 0.05)
                        Performance has improved.

multiply 2^20           time:   [977.12 us 978.10 us 979.03 us]                          
                        change: [-58.449% -58.374% -58.308%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking divide 2^20: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
divide 2^20             time:   [1.7273 ms 1.7275 ms 1.7278 ms]                         
                        change: [-53.618% -53.550% -53.507%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  12 (12.00%) high mild
  2 (2.00%) high severe

limit 2^20, 512         time:   [114.62 ns 114.65 ns 114.69 ns]                            
                        change: [+4.5154% +4.6662% +4.7672%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  1 (1.00%) high severe

add_nulls_2^20          time:   [980.77 us 982.49 us 984.18 us]                           
                        change: [-58.295% -58.224% -58.159%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking divide_nulls_2^20: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
divide_nulls_2^20       time:   [1.7293 ms 1.7295 ms 1.7297 ms]                               
                        change: [-54.156% -54.132% -54.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe


     Running /home/vertexclique/projects/arrow/rust/target/release/deps/array_from_vec-45cab0a3ecf39249
array_from_vec 128      time:   [484.23 ns 484.68 ns 485.12 ns]                               
                        change: [+2.1162% +2.2409% +2.3687%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

array_from_vec 256      time:   [712.61 ns 713.50 ns 714.36 ns]                                
                        change: [+2.1834% +2.3743% +2.5626%] (p = 0.00 < 0.05)
                        Performance has regressed.

array_from_vec 512      time:   [1.1879 us 1.1892 us 1.1906 us]                                
                        change: [-4.2637% -4.1494% -4.0486%] (p = 0.00 < 0.05)
                        Performance has improved.

array_string_from_vec 128                                                                             
                        time:   [2.2701 us 2.2709 us 2.2716 us]
                        change: [+6.6458% +6.7030% +6.7594%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

array_string_from_vec 256                                                                             
                        time:   [3.6563 us 3.6570 us 3.6577 us]
                        change: [+5.7356% +5.7737% +5.8114%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [6.1225 us 6.1246 us 6.1268 us]
                        change: [+5.6870% +5.7640% +5.8419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 128                                                                            
                        time:   [117.76 us 118.15 us 118.53 us]
                        change: [+3455.9% +3472.4% +3491.6%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 256                                                                            
                        time:   [105.76 us 106.43 us 107.21 us]
                        change: [+2015.8% +2029.9% +2045.7%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [9.0257 us 9.0293 us 9.0328 us]
                        change: [+23.421% +23.496% +23.570%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [27.670 us 27.759 us 27.859 us]
                        change: [+134.39% +135.24% +136.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/array_slice-49615c84b19508ee
array_slice 128         time:   [163.62 ns 163.71 ns 163.80 ns]                            
                        change: [+28.932% +28.988% +29.037%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe

array_slice 512         time:   [265.55 ns 265.63 ns 265.71 ns]                            
                        change: [+98.289% +98.369% +98.439%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

array_slice 2048        time:   [649.40 ns 650.84 ns 652.87 ns]                              
                        change: [+319.75% +320.19% +320.76%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/boolean_kernels-6fc1ad233dfa3072
and                     time:   [34.437 us 34.447 us 34.457 us]                 
                        change: [+39.672% +39.729% +39.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

or                      time:   [34.540 us 34.550 us 34.560 us]                
                        change: [+39.274% +39.429% +39.552%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

not                     time:   [17.377 us 17.386 us 17.393 us]                 
                        change: [+38.007% +38.289% +38.681%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/buffer_bit_ops-7f2013fd9e28372f
buffer_bit_ops and      time:   [28.631 us 28.749 us 28.893 us]                                
                        change: [+4064.1% +4101.4% +4148.8%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/builder-b9aacab6b54f093d
Benchmarking bench_primitive: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.7s, enable flat sampling, or reduce sample count to 50.
bench_primitive         time:   [1.5190 ms 1.5193 ms 1.5195 ms]                             
                        thrpt:  [2.5707 GiB/s 2.5711 GiB/s 2.5715 GiB/s]
                 change:
                        time:   [+20.305% +20.339% +20.372%] (p = 0.00 < 0.05)
                        thrpt:  [-16.924% -16.902% -16.878%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

bench_bool              time:   [3.7009 ms 3.7045 ms 3.7080 ms]                        
                        thrpt:  [134.84 MiB/s 134.97 MiB/s 135.10 MiB/s]
                 change:
                        time:   [+17.329% +17.443% +17.585%] (p = 0.00 < 0.05)
                        thrpt:  [-14.955% -14.853% -14.770%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/cast_kernels-93bbf3677ef76b84
cast int32 to int32 512 time:   [29.420 ns 29.421 ns 29.423 ns]                                     
                        change: [-0.5265% -0.4906% -0.4550%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

cast int32 to uint32 512                                                                             
                        time:   [12.053 us 12.055 us 12.057 us]
                        change: [+126.57% +126.69% +126.87%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

cast int32 to float32 512                                                                             
                        time:   [12.306 us 12.309 us 12.312 us]
                        change: [+115.72% +115.86% +115.97%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast int32 to float64 512                                                                             
                        time:   [12.389 us 12.391 us 12.393 us]
                        change: [+133.82% +133.93% +134.04%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

cast int32 to int64 512 time:   [12.402 us 12.405 us 12.407 us]                                     
                        change: [+126.67% +126.81% +126.92%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

cast float32 to int32 512                                                                             
                        time:   [12.765 us 12.767 us 12.768 us]
                        change: [+114.01% +114.11% +114.20%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild

cast float64 to float32 512                                                                             
                        time:   [12.522 us 12.524 us 12.526 us]
                        change: [+133.22% +133.35% +133.47%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

cast float64 to uint64 512                                                                             
                        time:   [13.187 us 13.191 us 13.194 us]
                        change: [+100.11% +100.20% +100.30%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

cast int64 to int32 512 time:   [11.755 us 11.758 us 11.761 us]                                     
                        change: [+122.67% +122.77% +122.87%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast date64 to date32 512                                                                             
                        time:   [23.245 us 23.248 us 23.252 us]
                        change: [+37.888% +37.929% +37.969%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

cast date32 to date64 512                                                                             
                        time:   [22.948 us 22.951 us 22.954 us]
                        change: [+35.232% +35.391% +35.493%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

cast time32s to time32ms 512                                                                             
                        time:   [1.1747 us 1.1750 us 1.1753 us]
                        change: [-34.441% -34.422% -34.402%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

cast time32s to time64us 512                                                                             
                        time:   [13.960 us 13.961 us 13.962 us]
                        change: [+79.328% +79.399% +79.468%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

cast time64ns to time32s 512                                                                             
                        time:   [25.761 us 25.765 us 25.768 us]
                        change: [+19.682% +19.723% +19.766%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

cast timestamp_ns to timestamp_s 512                                                                             
                        time:   [29.816 ns 29.825 ns 29.834 ns]
                        change: [+0.3917% +0.4310% +0.4744%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

cast timestamp_ms to timestamp_ns 512                                                                             
                        time:   [1.5295 us 1.5299 us 1.5304 us]
                        change: [-17.402% -17.337% -17.272%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

cast timestamp_ms to i64 512                                                                            
                        time:   [176.67 ns 176.84 ns 177.19 ns]
                        change: [-1.7586% -1.6436% -1.4924%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/comparison_kernels-f8939ec12975f45e
eq Float32              time:   [36.413 us 36.429 us 36.447 us]                        
                        change: [-95.357% -95.353% -95.349%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

eq scalar Float32       time:   [33.540 us 33.551 us 33.562 us]                               
                        change: [-94.698% -94.690% -94.684%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

neq Float32             time:   [36.756 us 36.768 us 36.781 us]                         
                        change: [-94.049% -94.047% -94.045%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [30.706 us 30.718 us 30.734 us]                                
                        change: [-95.092% -95.089% -95.086%] (p = 0.00 < 0.05)
                        Performance has improved.

lt Float32              time:   [36.489 us 36.498 us 36.509 us]                        
                        change: [-94.514% -94.504% -94.495%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

lt scalar Float32       time:   [30.855 us 30.871 us 30.892 us]                               
                        change: [-94.996% -94.993% -94.990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

lt_eq Float32           time:   [36.478 us 36.492 us 36.508 us]                           
                        change: [-94.700% -94.697% -94.695%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

lt_eq scalar Float32    time:   [32.641 us 32.653 us 32.668 us]                                  
                        change: [-95.305% -95.299% -95.292%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [36.647 us 36.658 us 36.672 us]                        
                        change: [-94.121% -94.119% -94.116%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

gt scalar Float32       time:   [36.541 us 36.562 us 36.583 us]                               
                        change: [-94.288% -94.280% -94.273%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

gt_eq Float32           time:   [36.510 us 36.524 us 36.540 us]                           
                        change: [-95.402% -95.396% -95.390%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

gt_eq scalar Float32    time:   [32.121 us 32.141 us 32.163 us]                                  
                        change: [-94.877% -94.875% -94.873%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) low mild
  1 (1.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/csv_writer-25ed34d22309b4ac
record_batches_to_csv   time:   [80.056 us 80.180 us 80.318 us]                                  
                        change: [+0.3760% +0.6016% +0.8372%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/equal-df2ef777e50b709e
equal_512               time:   [42.140 ns 42.144 ns 42.150 ns]                       
                        change: [-0.2711% -0.2454% -0.2186%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

equal_nulls_512         time:   [5.1449 us 5.1573 us 5.1706 us]                             
                        change: [+44.552% +44.886% +45.227%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

equal_string_512        time:   [61.514 ns 61.519 ns 61.525 ns]                             
                        change: [+0.4486% +0.5006% +0.5374%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  5 (5.00%) high mild
  7 (7.00%) high severe

equal_string_nulls_512  time:   [5.6210 us 5.6224 us 5.6246 us]                                    
                        change: [+21.881% +21.981% +22.074%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/filter_kernels-2401e6818fe1ed21
filter u8 low selectivity                                                                            
                        time:   [166.39 us 166.42 us 166.45 us]
                        change: [+48.953% +49.302% +49.925%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

filter u8 high selectivity                                                                             
                        time:   [22.850 us 22.854 us 22.858 us]
                        change: [+456.68% +457.63% +459.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

filter u8 very low selectivity                                                                             
                        time:   [29.303 us 29.309 us 29.316 us]
                        change: [+178.25% +178.35% +178.44%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

filter context u8 low selectivity                                                                            
                        time:   [146.14 us 146.35 us 146.58 us]
                        change: [+32.621% +32.779% +32.988%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe

filter context u8 high selectivity                                                                             
                        time:   [2.8556 us 2.8600 us 2.8651 us]
                        change: [+9.7379% +9.9672% +10.234%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 25 outliers among 100 measurements (25.00%)
  4 (4.00%) low mild
  10 (10.00%) high mild
  11 (11.00%) high severe

filter context u8 very low selectivity                                                                             
                        time:   [9.2290 us 9.2300 us 9.2312 us]
                        change: [+2.5952% +2.6384% +2.7096%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

filter context u8 w NULLs low selectivity                                                                            
                        time:   [188.52 us 188.57 us 188.62 us]
                        change: [+41.939% +41.998% +42.073%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  1 (1.00%) high severe

filter context u8 w NULLs high selectivity                                                                             
                        time:   [3.2570 us 3.2574 us 3.2578 us]
                        change: [+9.3361% +9.3846% +9.4290%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

filter context u8 w NULLs very low selectivity                                                                            
                        time:   [177.11 us 177.14 us 177.17 us]
                        change: [+13.701% +13.737% +13.774%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

filter context f32 low selectivity                                                                            
                        time:   [170.99 us 171.02 us 171.05 us]
                        change: [+17.562% +17.610% +17.667%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

filter context f32 high selectivity                                                                             
                        time:   [3.1712 us 3.1720 us 3.1730 us]
                        change: [+6.1956% +6.3237% +6.4579%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

filter context f32 very low selectivity                                                                             
                        time:   [21.043 us 21.046 us 21.048 us]
                        change: [-1.2631% -1.2246% -1.1980%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/length_kernel-4e640ebc955ede9d
length                  time:   [34.337 us 34.348 us 34.359 us]                    
                        change: [+9.6250% +9.6772% +9.7281%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/sort_kernel-aeee7d9710cfd390
sort 2^10               time:   [424.17 us 424.21 us 424.25 us]                      
                        change: [+150.38% +150.44% +150.48%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sort 2^12               time:   [2.0103 ms 2.0105 ms 2.0108 ms]                       
                        change: [+134.93% +134.98% +135.03%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

sort nulls 2^10         time:   [517.68 us 517.70 us 517.73 us]                            
                        change: [+234.77% +234.84% +234.90%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  9 (9.00%) high mild
  2 (2.00%) high severe

sort nulls 2^12         time:   [2.4546 ms 2.4548 ms 2.4551 ms]                             
                        change: [+232.12% +232.39% +232.56%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/take_kernels-87f6d0017a97f9d0
take i32 512            time:   [9.8257 us 9.8271 us 9.8287 us]                          
                        change: [+302.80% +302.94% +303.11%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

take i32 1024           time:   [30.237 us 30.365 us 30.539 us]                           
                        change: [+583.96% +585.97% +588.58%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

take i32 nulls 512      time:   [9.8708 us 9.8723 us 9.8738 us]                                
                        change: [+280.95% +281.07% +281.18%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

take i32 nulls 1024     time:   [30.473 us 30.540 us 30.616 us]                                 
                        change: [+605.48% +606.87% +608.70%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

take bool 512           time:   [11.558 us 11.561 us 11.563 us]                           
                        change: [+330.29% +332.21% +333.55%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  12 (12.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

take bool 1024          time:   [34.171 us 34.219 us 34.275 us]                            
                        change: [+611.75% +613.60% +616.34%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

take bool nulls 512     time:   [12.292 us 12.295 us 12.298 us]                                 
                        change: [+169.77% +170.18% +170.59%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

take bool nulls 1024    time:   [32.251 us 32.353 us 32.466 us]                                  
                        change: [+222.43% +223.62% +224.93%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  7 (7.00%) high mild
  5 (5.00%) high severe

take str 512            time:   [18.346 us 18.347 us 18.349 us]                          
                        change: [+380.69% +380.91% +381.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

take str 1024           time:   [48.017 us 48.128 us 48.283 us]                           
                        change: [+611.88% +613.44% +615.32%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

take str nulls 512      time:   [20.669 us 20.673 us 20.678 us]                                
                        change: [+296.16% +296.32% +296.48%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

take str nulls 1024     time:   [50.519 us 50.588 us 50.667 us]                                 
                        change: [+369.23% +370.23% +371.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

@github-actions
Copy link

@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

@vertexclique did you work on this on top of #8663 from @jhorstmann (or at least both PRs remove the popcnt table)?
The changes here look reasonable to me, so I can review after a rebase, as I've just merged #8663

@nevi-me nevi-me added the needs-rebase A PR that needs to be rebased by the author label Nov 15, 2020
@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 15, 2020

This is using safe bitvec interface to manage bits. I worked on top of my existing PR. Since that didn't change all code to use safe operations, now this does. I squashed all my work into a single commit and closing PR: #8598

Closed that with comment: #8598 (comment)

Bitvec is also issuing popcnt instruction like the other pr which just got merged. So I didn't put effort to fix them instead dependency does them safely.

@nevi-me nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Nov 15, 2020
@vertexclique vertexclique force-pushed the ARROW-10588-safe-bit-operations-for-arrow branch from b5d3219 to 4948aa0 Compare November 15, 2020 13:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to review this PR carefully tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all I like this PR. Nice work @vertexclique -- it removes a bunch of unsafe calls and I think the structure is an improvement.

I carefully went through this PR and I think it would be ok to merge as is, and address comments as a follow on. However, given that it needs a rebase anyways, I would suggest looking at some of the comment suggestions as they may help future readers / reviewers. Removing bit_utils.rs would be good too

I also ran this code under valgrind and it did not find any errors 👍

As feedback, this PR might have been less intimidating (and thus I would have been more likely to be able to review it earlier and more quickly) if it had been broken into smaller pieces. Some potential ways to split it up could have been:

  1. Rename of bit_utils.rs to utils.rs
  2. Rename of Buffer functions like count_set_bits --> count_ones
  3. Introduction of bit_ops / rewrite to use BitBufferSlice*

@jorgecarleitao
Copy link
Member

Super cool. Great work @vertexclique ! 💯 I am all in in removing that unsafe code. 👍

Just curious, do you know why it has a 120% hit in performance on a filtering op?

filter u8 high selectivity                                                                             
                        time:   [18.523 us 18.529 us 18.535 us]
                        change: [+122.09% +122.22% +122.33%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

@vertexclique
Copy link
Contributor Author

Weird, on my machine when I pushed the initial implementation of this PR I got the numbers above. Seems like it is regressed for me too.

@vertexclique
Copy link
Contributor Author

So any idea how to proceed with this? Based on that I will close all related PRs. Though, seems like #8688 can use these changes to build slice alignment on top.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 19, 2020

If this is not stalling any other PRs, you can leave it open. I've started looking into what's causing the regression, but I'll only finish on the weekend (I'm mostly working on the Parquet writer with my evening time).

I'm also seeing big regressions of 400%+

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 24, 2020

@nevi-me @alamb @jorgecarleitao
The good news, I have found a solution for the performance related considerations. I have experimented on the sum and my roofline analysis brought some good results. Also, criterion benches are here:

Before (current master):

sum 2^20                time:   [900.13 us 902.01 us 904.02 us]                     
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

sum nulls 2^20          time:   [2.5859 ms 2.5909 ms 2.5967 ms]                            
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

After:

sum 2^20                time:   [236.61 us 238.02 us 239.58 us]                     
                        change: [-73.888% -73.699% -73.493%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

sum nulls 2^20          time:   [549.14 us 551.39 us 554.07 us]                           
                        change: [-78.784% -78.671% -78.548%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

Since it is a time-consuming task, I am not going to perform a full rewrite until we agree that this performance improvement is enough. Looking forward to receiving your feedback.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 24, 2020

Thanks @vertexclique, that looks great. I'm here with the approach

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 24, 2020

Ok, so crucial operations are improved. I've updated the benchmarks. Since benchmarks with 512 elements fit most caches it creates unstable benchmarks. Kernels only can get better after this PR got merged and they are rewritten with parallel iterators. Feel free to benchmark this PR.
Things to do after this pr:

  • Other kernels can be improved by different prs
  • Some code can be removed. e.g. mask_from_u64. These are not needed and will improve performance.
  • Most of the operations are not operating on the larger data so benchmarks are kind of not reliable, these also can be changed by yet another PR.
  • We can start writing parallel code.

@nevi-me @jorgecarleitao @alamb

Because bit ops performance has been improved these benchmarks have been improved significantly. The rest of the improvements are in the pr description:


     Running /home/vertexclique/projects/arrow/rust/target/release/deps/comparison_kernels-f8939ec12975f45e
eq Float32              time:   [36.413 us 36.429 us 36.447 us]                        
                        change: [-95.357% -95.353% -95.349%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

eq scalar Float32       time:   [33.540 us 33.551 us 33.562 us]                               
                        change: [-94.698% -94.690% -94.684%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

neq Float32             time:   [36.756 us 36.768 us 36.781 us]                         
                        change: [-94.049% -94.047% -94.045%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

neq scalar Float32      time:   [30.706 us 30.718 us 30.734 us]                                
                        change: [-95.092% -95.089% -95.086%] (p = 0.00 < 0.05)
                        Performance has improved.

lt Float32              time:   [36.489 us 36.498 us 36.509 us]                        
                        change: [-94.514% -94.504% -94.495%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

lt scalar Float32       time:   [30.855 us 30.871 us 30.892 us]                               
                        change: [-94.996% -94.993% -94.990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

lt_eq Float32           time:   [36.478 us 36.492 us 36.508 us]                           
                        change: [-94.700% -94.697% -94.695%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

lt_eq scalar Float32    time:   [32.641 us 32.653 us 32.668 us]                                  
                        change: [-95.305% -95.299% -95.292%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

gt Float32              time:   [36.647 us 36.658 us 36.672 us]                        
                        change: [-94.121% -94.119% -94.116%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

gt scalar Float32       time:   [36.541 us 36.562 us 36.583 us]                               
                        change: [-94.288% -94.280% -94.273%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

gt_eq Float32           time:   [36.510 us 36.524 us 36.540 us]                           
                        change: [-95.402% -95.396% -95.390%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

gt_eq scalar Float32    time:   [32.121 us 32.141 us 32.163 us]                                  
                        change: [-94.877% -94.875% -94.873%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) low mild
  1 (1.00%) high mild

@vertexclique vertexclique changed the title ARROW-10588: [Rust] Safe bit operations for Arrow ARROW-10588: [Rust] Safe and parallel bit operations for Arrow Nov 24, 2020
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@alamb @andygrove @jorgecarleitao @vertexclique

I'd like to propose that we merge this (pending rebase and CI), and address any extra follow-up as separate PRs.

From my side:

  • Addition of rayon: I think this might cause issues with wasm32 (IIRC the threading might be an issue, but my knowledge might be dated). We've had some small interest in supporting wasm32, and there might be a few users of Arrow with wasm already; but as we don't yet have the target in CI, we can't reliably tell if something could affect that target, and thus need gating. We might have to put rayon behind a feature flag, but we can worry about this later (or whoever wants to use wasm32 can contribute this work?).
  • Introducing thread parallelism at a compute kernel level: my previous rudimentary experiments over a year ago weren't showing much improvement, but our codebase has changed a lot, so it's great that we're seeing the speed-ups. It'd be great if we can confirm if the speedups in the microbenchmarks also get realised by datafusion. This is not a concern per se, but just noting it.

Thanks for the great work @vertexclique!
I'm currently running the benchmarks, I'll post an update when they're done


[features]
default = []
default = ["simd"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably shouldn't change the default to include simd, as we'd like the default features to allow users to compile with stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh forgot to set it back.

accumulator + *value
});
let total = data
.par_iter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that this is yielding better results. I would have thought that rayon being introduced at thsi level, would incur enough overhead to slow the kernels down. I've previously applied parallelism at an Array level instead.

Copy link
Contributor Author

@vertexclique vertexclique Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what MKL suggests, Array level non-primitive parallelism has shared pointer overhead, that's the reason. I had so much bad time before with arc overhead in my projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayon is pretty smart in this respect, as its execution model starts the first part of the iterator immediately.

I am thankful that @vertexclique introduced this, as I also believe that we greatly benefit from multi-threading at this level.

What I am a bit unsure is whether there is any issue in nested spawning: we may want to fork in other places that are nested with this op. But I am confident that someone knows the answer (👀 to @vertexclique and @alamb).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rayon is initializing global pool with 1.5 * logical cores. Since they are all spawned but not joined threads, that will easily work from direct users of arrow. Closures that spawn on though, are executing like fork-join. Coming to the process forking point of view, while using this library as is, won't cause problems. In the nested operations case, they will be queued and spawned to any free slot, all operations are going through global producer-consumer hub inside rayon (take a gaze into bit ops code in this pr). Sometime later if a contributor comes and says that, "I want to configure tlp" then we can just expose the pool config by a method. But since it configures itself on the machine it's running we can directly skip that part.

where
T::Native: Add<Output = T::Native>,
T: ArrowNumericType,
// T::Native: Add<Output = T::Native> + Sum + Sum<T::Simd>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should the comment be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, eagle eyes.

@jorgecarleitao
Copy link
Member

Really impressive improvement, @vertexclique .

@nevi-me , can I have 12h to review it?

@nevi-me
Copy link
Contributor

nevi-me commented Nov 25, 2020

@nevi-me , can I have 12h to review it?

Yup, I more meant that if someone else picks up things they'd like addressed, we could open JIRAs for them instead of trying to address them as part of this PR (unless they're clear blockers). In any case, we're still weeks or at least 2 months before the next release, so we still have time even for potential blockers.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this without going line by line and I am already convinced:

  • the CI passes
  • new code contains good coverage
  • the amount of unsafe removed ❤️
  • the cleanness of the APIs

:shipit:

Impressive work, @vertexclique . Really well done. Thank you so much for this.

///
/// Get bit value at the given index in this bit view
#[inline]
pub fn get_bit(&self, index: usize) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this go out of bounds? What do you think about making this unsafe and expose a safe version with a check?

@jhorstmann
Copy link
Contributor

Introducing thread parallelism at a compute kernel level

Conceptually this is not a small change. Personally I think parallelizing on the datafusion level and keeping kernels single-threaded is the better model.

The benchmark is now also testing a much larger array (2^20 elements) than what is usually used as a chunk size, so in reality the speedup due to parallelism would be much smaller.

I'm totally fine with a small slowdown if that leads to cleaner and safer code.

@vertexclique
Copy link
Contributor Author

Conceptually this is not a small change. Personally I think parallelizing on the datafusion level and keeping kernels single-threaded is the better model.

I don't think personally, I am experimenting and working on that based on proofs. It is not a better model, even Intel says that. So there is still no conflict for further parallelizing in the DataFusion (which is already done via reactor). We are using arrow in the company. and yet still, we haven't parallelized chunks which is also wrong in many directions.

The benchmark is now also testing a much larger array (2^20 elements) than what is usually used as a chunk size, so in reality the speedup due to parallelism would be much smaller.

There are no predetermined chunk sizes and benchmarks shouldn't work on cache fittable data at all. That's against the benchmarking rule. Otherwise, you don't see the actual processing time, at all. Moreover, if this data format can't process large data on demand, there is no point in having this data format. (Users will think like that even I don't, which is also the most valid thing out there.) Also please don't alter my comments/words or use them against the pr.

Please also stop undermining others' work and be honest about what has been done. I have already told you to do this way but you preferred sharing raw pointers in an unsafe context, which won't and never ever going to be parallelized without a full rewrite. This is that full rewrite.

Finally, It is entirely unethical and dishonest to open PRs that I told/taught to you. I am not your rival, I don't see myself as a rival to anyone in any form, because there is no rivalry. This is an open-source environment. I don't see goodwill from your comments too. Also, some of your comments are giving false information (which I stopped giving feedback). I prefer instead of having counterproductive comments, productive comments from you. Even I told this plenty of times seems like that's not going to happen. So, please don't comment and review my PRs. Thanks.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the remaining benchmarks, I think that we need to discuss this further:

All large scale operations (take, filter, sort) have a 100-500% performance degradation. These are the most important kernels and are where we verify that the operations are in place.

I admit I was a bit blindsided by the statement

Up to %95 improvements on many kernels.

which makes no reference to the degradation on other (often more important) benchmarks.

@vertexclique
Copy link
Contributor Author

Things to do after this pr:

  • Other kernels can be improved by different prs

I have already mentioned that here I think: #8664 (comment)

@alamb
Copy link
Contributor

alamb commented Nov 25, 2020

Conceptually this is not a small change. Personally I think parallelizing on the datafusion level and keeping kernels single-threaded is the better model.

I agree with @jhorstmann 's opinion here -- I think we should keep arrow single threaded (parallelizing their invocation across record batches can be done at a higher level).

One challenge of adding parallelism like rayon at a lower level such as the kernel is that the higher level libraries or applications lose control of the resources (for example, if some app wanted a particular background operation with only a single core on a multi core, they couldn't do that easily with the code in this PR anymore)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of introducing parallelism into the lowest level aggregate kernels.

I would like to request we remove the changes to the aggregate kernels out of this PR and keep it focused on increasing the safety of bit operations

Given the large variety of opinions regarding using rayon it is probably best to have those conversations in a separate PR.

@andygrove
Copy link
Member

Also, some of your comments are giving false information (which I stopped giving feedback). I prefer instead of having counterproductive comments, productive comments from you. Even I told this plenty of times seems like that's not going to happen. So, please don't comment and review my PRs. Thanks.

@vertexclique This really isn't acceptable behavior. It is against the project's code of conduct [1] to insult or harass other contributors. We don't do that here.

[1] https://www.apache.org/foundation/policies/conduct

@vertexclique
Copy link
Contributor Author

Personally, I think violation against me also refers to the code of conduct even in two occurrences:

Be empathetic, welcoming, friendly, and patient. We work together to resolve conflict, assume good intentions, and do our best to act in an empathetic fashion. We may all experience some frustration from time to time, but we do not allow frustration to turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. We should be respectful when dealing with other community members as well as with people outside our community.

I tried my best to assume good intentions. Even there is no problem in PRs I have received comments which are not meant to bring any productivity but changing the entire aim of the PR.
#8609 (comment)
#8665 (comment)
But preventing PRs and keep raising debates(which are not related to the aim of the PR) in every PR is not what I expect as neither empathetic nor welcoming behavior.

Be collaborative. Our work will be used by other people, and in turn we will depend on the work of others. When we make something for the benefit of the project, we are willing to explain to others how it works, so that they can build on the work to make it even better. Any decision we make will affect users and colleagues, and we take those consequences seriously when making decisions.

Which I did:

I was thinking 3 months ago the project was going in a nice direction, I think it got prevented somehow and now it is not going very well from my point of view.

Moreover in general also Apache Foundation's values described here not quite held in my opinion. https://www.apache.org/foundation/how-it-works.html#meritocracy

So, finally, I would like to say that, I am totally ok with not committing to this project. Feel free to close this PR.

@paddyhoran
Copy link
Contributor

Hi @vertexclique.

All your contributions are very much appreciated. You are one of the most advanced contributors to the project meaning that it's important for the other members of the community to be able to review and ask questions. This may be frustrating for you but it is necessary.

In general, smaller more focused PR's will be easier to merge. Larger PR's or ones that are low level in nature need to be reviewed in more depth and we should be glad to have @jhorstmann and others to provide constructive feedback. Also, you may be asked to make changes to your PR's and you may not always agree. This can always happen in a community run project.

In general, you need to not take feedback so personally. I have re-read the interactions you mentioned and I'm sorry I can't see the issues you are referring to and I think you were wrong to call out @jhorstmann. On this PR I agreed with @jhorstmann's perspective, as did @alamb. I at least thought the question was a good one to ask.

On the other PR's you linked to it's clear that @jhorstmann is doing his best to be polite while providing feedback, for example:

  • Just an idea, ...
  • Nice performance improvement! I'm a bit surprised...

Hopefully, you can move past this and keep contributing to the project.

@Dandandan
Copy link
Contributor

I think there are some very interesting things in this PR:

  • Usage of bitvec / new structure for null buffer. I think it makes sense to use this library here rather than reinvent it.
  • For the benchmarks it makes sense to have some bigger / more realistic ones as well. 2 ^ 20 maybe is a bit big. We also have some benchmarks in datafusion / benchmarks directory which can be extended to cover more realistic scenario's.
  • For parallelism, I am also not convinced that it's a good idea to introduce rayon without being able to turn it off / control it. For big arrays it can be a good idea, but for smaller arrays, projects like datafusion and libraries, it can actually slow it down and/or use more resources overall. I think it would maybe be nice to revisit this sometime and see if we can make it an optional dependency (it's pretty big) and you could opt-in to use it for some kernels / ops?

Would love if this PR would be continued, maybe in a slimmed down form?

@alamb
Copy link
Contributor

alamb commented Nov 30, 2020

I agree with @Dandandan 's comments 👍

@alamb
Copy link
Contributor

alamb commented Jan 13, 2021

@vertexclique -- Given the imminent Arrow 3.0 release, I am trying to clean up older Rust PRs and see if the authors have plans to move them forward.

Do you plan on working on this PR in the near future? If not, should we close this PR until there is time to make progress? Thanks again for your contributions so far.

@vertexclique
Copy link
Contributor Author

@alamb Thanks for reaching out! I don't have time to work on these PRs. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Rust - DataFusion Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants