Skip to content

Conversation

@vertexclique
Copy link
Contributor

@vertexclique vertexclique commented Nov 5, 2020

Currently, bit slice, bit view, and operations looking blurry.

  • Support native endianness
  • Fix problems related to bit operations
  • Method docs are written.
  • Gives any primitive interpretation for an underlying bit field, doesn't force concrete u64. So different offset sizes in offset buffers are also reinterpretable.
  • Bitfield tags are stored in pointers, and not an unsafe implementation
  • Separate bit view and bit operation
  • Have good benchmarks still
Benchmarks are here.


     Running /home/vertexclique/projects/arrow/rust/target/release/deps/aggregate_kernels-3e042984093382ca
sum 512                 time:   [433.35 ns 434.31 ns 435.26 ns]                    
                        change: [-0.3329% +0.0366% +0.3572%] (p = 0.83 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

min 512                 time:   [643.32 ns 644.21 ns 645.17 ns]                     
                        change: [-0.7371% -0.2747% +0.1556%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild

sum nulls 512           time:   [305.83 ns 306.31 ns 306.82 ns]                          
                        change: [+25.194% +25.552% +25.936%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

min nulls 512           time:   [1.7087 us 1.7140 us 1.7202 us]                           
                        change: [+28.765% +29.314% +29.800%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/arithmetic_kernels-f72551135f7f2174
add 512                 time:   [863.39 ns 864.53 ns 865.68 ns]                     
                        change: [-1.5504% -1.0396% -0.5536%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) low mild
  1 (1.00%) high mild

subtract 512            time:   [993.75 ns 995.69 ns 997.80 ns]                          
                        change: [-1.7588% -1.3428% -0.9414%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

multiply 512            time:   [937.78 ns 940.16 ns 942.80 ns]                          
                        change: [-3.5003% -3.1603% -2.8236%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

divide 512              time:   [1.2736 us 1.2764 us 1.2799 us]                        
                        change: [-5.3774% -5.0102% -4.6537%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

limit 512, 512          time:   [87.252 ns 87.369 ns 87.490 ns]                           
                        change: [-5.3767% -4.9759% -4.5970%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  1 (1.00%) high severe

add_nulls_512           time:   [928.68 ns 930.17 ns 931.79 ns]                           
                        change: [-4.9229% -4.5627% -4.1789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild

divide_nulls_512        time:   [1.6221 us 1.6245 us 1.6270 us]                              
                        change: [+20.220% +20.707% +21.194%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/array_from_vec-e9623439ba2f607b
array_from_vec 128      time:   [316.35 ns 316.93 ns 317.55 ns]                               
                        change: [-0.3248% +0.1281% +0.5785%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

array_from_vec 256      time:   [486.03 ns 488.46 ns 491.22 ns]                               
                        change: [-1.2414% -0.4064% +0.2912%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

array_from_vec 512      time:   [810.71 ns 812.36 ns 814.09 ns]                                
                        change: [-2.8851% -2.4780% -2.0935%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

array_string_from_vec 128                                                                             
                        time:   [3.2412 us 3.2469 us 3.2531 us]
                        change: [-5.1256% -4.7502% -4.3519%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

array_string_from_vec 256                                                                             
                        time:   [5.9510 us 5.9714 us 5.9957 us]
                        change: [-2.7639% -2.2907% -1.8394%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe

array_string_from_vec 512                                                                             
                        time:   [11.147 us 11.162 us 11.177 us]
                        change: [-1.2898% -0.7899% -0.2991%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe

struct_array_from_vec 128                                                                             
                        time:   [4.6671 us 4.6752 us 4.6842 us]
                        change: [+1.3235% +1.6633% +1.9898%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 256                                                                             
                        time:   [7.8150 us 7.8344 us 7.8540 us]
                        change: [+0.4179% +0.7983% +1.1630%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 512                                                                             
                        time:   [14.048 us 14.082 us 14.134 us]
                        change: [+1.0522% +1.4037% +1.7699%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

struct_array_from_vec 1024                                                                             
                        time:   [26.091 us 26.126 us 26.160 us]
                        change: [+1.0905% +1.4287% +1.7556%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/boolean_kernels-edd14f2e1fbef932
and                     time:   [26.883 us 26.932 us 26.983 us]                 
                        change: [+2.1206% +2.4766% +2.9152%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

or                      time:   [26.991 us 27.029 us 27.071 us]                
                        change: [+1.5021% +1.8273% +2.1485%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild

not                     time:   [13.515 us 13.535 us 13.556 us]                 
                        change: [-1.1505% -0.3964% +0.2990%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/buffer_bit_ops-7f780c65b1d8eaab
buffer_bit_ops and      time:   [1.1393 us 1.1413 us 1.1433 us]                                
                        change: [+889.05% +892.72% +896.41%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/builder-e60ce23eee65a1bb
bench_primitive         time:   [647.05 us 647.45 us 647.87 us]                            
                        thrpt:  [6.0293 GiB/s 6.0333 GiB/s 6.0370 GiB/s]
                 change:
                        time:   [+0.4803% +0.5896% +0.7038%] (p = 0.00 < 0.05)
                        thrpt:  [-0.6989% -0.5861% -0.4780%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Benchmarking bench_bool: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
bench_bool              time:   [1.7426 ms 1.7466 ms 1.7499 ms]                        
                        thrpt:  [285.73 MiB/s 286.28 MiB/s 286.93 MiB/s]
                 change:
                        time:   [+33.359% +33.822% +34.309%] (p = 0.00 < 0.05)
                        thrpt:  [-25.545% -25.274% -25.015%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/cast_kernels-400ed028b73d2deb
cast int32 to int32 512 time:   [19.370 ns 19.419 ns 19.476 ns]                                     
                        change: [-0.9568% -0.2378% +0.4532%] (p = 0.52 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

cast int32 to uint32 512                                                                             
                        time:   [3.6040 us 3.6103 us 3.6177 us]
                        change: [-12.943% -12.451% -11.939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

cast int32 to float32 512                                                                             
                        time:   [3.8512 us 3.8585 us 3.8659 us]
                        change: [-3.5584% -3.1973% -2.8752%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild

cast int32 to float64 512                                                                             
                        time:   [3.8591 us 3.8676 us 3.8768 us]
                        change: [-2.9339% -2.6433% -2.3617%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

cast int32 to int64 512 time:   [3.8327 us 3.8395 us 3.8464 us]                                     
                        change: [-4.9455% -4.6372% -4.3215%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild

cast float32 to int32 512                                                                             
                        time:   [4.5197 us 4.5256 us 4.5316 us]
                        change: [-3.4145% -3.0686% -2.6943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  4 (4.00%) high mild
  2 (2.00%) high severe

cast float64 to float32 512                                                                             
                        time:   [4.0178 us 4.0397 us 4.0624 us]
                        change: [+0.8169% +1.2843% +1.8100%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

cast float64 to uint64 512                                                                             
                        time:   [4.8151 us 4.8231 us 4.8319 us]
                        change: [-2.6650% -2.3405% -2.0186%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe

cast int64 to int32 512 time:   [3.9086 us 3.9171 us 3.9256 us]                                     
                        change: [-5.8420% -5.5229% -5.2079%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

cast date64 to date32 512                                                                             
                        time:   [10.490 us 10.534 us 10.584 us]
                        change: [-11.948% -11.631% -11.351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

cast date32 to date64 512                                                                             
                        time:   [10.353 us 10.373 us 10.396 us]
                        change: [+0.4349% +0.7733% +1.1149%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

cast time32s to time32ms 512                                                                             
                        time:   [1.3260 us 1.3277 us 1.3296 us]
                        change: [-5.8847% -5.5440% -5.2058%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

cast time32s to time64us 512                                                                             
                        time:   [5.6457 us 5.6540 us 5.6631 us]
                        change: [-0.3788% +0.0713% +0.4866%] (p = 0.75 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

cast time64ns to time32s 512                                                                             
                        time:   [13.031 us 13.049 us 13.067 us]
                        change: [-0.1244% +0.2410% +0.6103%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high severe

cast timestamp_ns to timestamp_s 512                                                                             
                        time:   [21.281 ns 21.313 ns 21.348 ns]
                        change: [-0.4576% -0.0328% +0.3543%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  6 (6.00%) high mild

cast timestamp_ms to timestamp_ns 512                                                                             
                        time:   [1.5613 us 1.5675 us 1.5747 us]
                        change: [+4.4836% +4.9421% +5.4441%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast timestamp_ms to i64 512                                                                            
                        time:   [124.79 ns 124.96 ns 125.14 ns]
                        change: [-0.0308% +0.3341% +0.7194%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/comparison_kernels-126efdff90a59970
eq Float32              time:   [910.04 us 910.85 us 911.66 us]                       
                        change: [+0.3318% +0.5602% +0.7961%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild

eq scalar Float32       time:   [857.78 us 858.90 us 860.00 us]                              
                        change: [-0.2335% +0.0622% +0.3295%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

neq Float32             time:   [860.97 us 862.66 us 864.49 us]                        
                        change: [-0.9691% -0.6156% -0.2754%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

neq scalar Float32      time:   [859.15 us 861.81 us 865.05 us]                               
                        change: [-0.0690% +0.2563% +0.5652%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

lt Float32              time:   [831.21 us 832.13 us 832.99 us]                       
                        change: [-0.5075% -0.1969% +0.0800%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild

lt scalar Float32       time:   [841.17 us 842.20 us 843.20 us]                              
                        change: [-0.6568% -0.2773% +0.0582%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

lt_eq Float32           time:   [875.15 us 876.07 us 877.02 us]                          
                        change: [-0.5285% -0.2381% +0.0487%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  6 (6.00%) high mild

lt_eq scalar Float32    time:   [849.93 us 851.46 us 853.36 us]                                 
                        change: [-0.7749% -0.3402% +0.0591%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

gt Float32              time:   [814.51 us 816.63 us 819.22 us]                       
                        change: [+0.1219% +0.5652% +1.0208%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

gt scalar Float32       time:   [805.43 us 806.41 us 807.40 us]                              
                        change: [-0.6462% -0.3443% -0.0458%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

gt_eq Float32           time:   [861.43 us 864.94 us 868.90 us]                          
                        change: [-1.0678% -0.6583% -0.2618%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

gt_eq scalar Float32    time:   [828.43 us 829.87 us 831.27 us]                                 
                        change: [-2.1827% -1.7063% -1.2261%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/csv_writer-e570bd44f0db88a4
record_batches_to_csv   time:   [62.303 us 63.088 us 64.063 us]                                  
                        change: [+1.2521% +5.7178% +9.8717%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/filter_kernels-43f0ed7006b2704e
filter u8 low selectivity                                                                            
                        time:   [94.045 us 94.399 us 94.769 us]
                        change: [-0.3916% +0.1902% +0.7859%] (p = 0.53 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

filter u8 high selectivity                                                                             
                        time:   [5.1268 us 5.1350 us 5.1432 us]
                        change: [-1.3219% -1.0230% -0.7251%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild

filter u8 very low selectivity                                                                             
                        time:   [9.5795 us 9.6086 us 9.6447 us]
                        change: [-0.8019% -0.3305% +0.1971%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

filter context u8 low selectivity                                                                            
                        time:   [88.982 us 89.197 us 89.475 us]
                        change: [-0.5393% -0.2111% +0.1700%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

filter context u8 high selectivity                                                                             
                        time:   [1.8202 us 1.8233 us 1.8265 us]
                        change: [-1.0453% -0.5680% -0.1510%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild

filter context u8 very low selectivity                                                                             
                        time:   [6.2812 us 6.2902 us 6.3008 us]
                        change: [+1.4431% +2.0101% +2.5168%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

filter context u8 w NULLs low selectivity                                                                            
                        time:   [105.17 us 105.59 us 106.07 us]
                        change: [+0.4925% +0.8519% +1.2438%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low mild
  7 (7.00%) high mild
  5 (5.00%) high severe

filter context u8 w NULLs high selectivity                                                                             
                        time:   [2.0865 us 2.0889 us 2.0915 us]
                        change: [-0.7819% -0.4249% -0.0307%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

filter context u8 w NULLs very low selectivity                                                                            
                        time:   [103.79 us 103.89 us 103.99 us]
                        change: [-0.3091% +0.0545% +0.4393%] (p = 0.78 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild

filter context f32 low selectivity                                                                            
                        time:   [106.97 us 107.14 us 107.32 us]
                        change: [-0.1227% +0.1939% +0.5226%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

filter context f32 high selectivity                                                                             
                        time:   [2.2141 us 2.2171 us 2.2202 us]
                        change: [+1.4120% +1.8660% +2.3514%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

filter context f32 very low selectivity                                                                             
                        time:   [14.434 us 14.458 us 14.482 us]
                        change: [-0.7158% -0.2419% +0.2242%] (p = 0.32 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/length_kernel-f9b5d42ca44f8471
length                  time:   [26.904 us 26.949 us 26.994 us]                    
                        change: [+5.3416% +5.7166% +6.0977%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/sort_kernel-5075ab6a937b1d1c
sort 2^10               time:   [133.99 us 134.20 us 134.43 us]                      
                        change: [-1.1862% -0.8625% -0.5382%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild

sort 2^12               time:   [648.98 us 649.82 us 650.68 us]                      
                        change: [-0.1925% +0.1137% +0.3914%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild

sort nulls 2^10         time:   [162.81 us 163.30 us 163.87 us]                            
                        change: [-0.3715% -0.0518% +0.3043%] (p = 0.76 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

sort nulls 2^12         time:   [784.42 us 785.56 us 786.71 us]                            
                        change: [-1.5899% -1.2958% -1.0139%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

     Running /home/vertexclique/projects/arrow/rust/target/release/deps/take_kernels-5681334f96563498
take i32 512            time:   [1.4828 us 1.4850 us 1.4873 us]                          
                        change: [+10.283% +10.716% +11.109%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe

take i32 1024           time:   [2.6363 us 2.6452 us 2.6556 us]                           
                        change: [+20.307% +20.947% +21.535%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

take bool 512           time:   [1.4307 us 1.4324 us 1.4341 us]                           
                        change: [+34.741% +35.165% +35.642%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take bool 1024          time:   [2.4476 us 2.4516 us 2.4560 us]                            
                        change: [+38.662% +39.280% +39.922%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

take str 512            time:   [4.2765 us 4.2836 us 4.2904 us]                          
                        change: [-6.6173% -6.2652% -5.9479%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild

take str 1024           time:   [7.6875 us 7.6992 us 7.7113 us]                           
                        change: [-4.3967% -4.0820% -3.7321%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild

@github-actions
Copy link

github-actions bot commented Nov 5, 2020

@jhorstmann
Copy link
Contributor

Test failure seems to be because the sum kernel is now adding the remainder elements first, resulting in slightly different rounding. This might be on example where we should actually assert with some epsilon value.

Was there a specific test failure on a big endian machine with the previous code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't familiar with the term "bit slice" before reading this PR (It is mentioned in the docs of https://docs.rs/bitvec/0.19.4/bitvec/, where perhaps the term came from.

Suggested change
/// Bit slice representation of buffer data
/// Bit slice representation of buffer data. A bit slice is a
/// view on top of a buffer of bytes which can be used to
/// access each bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns immutable view with the given offset in bits and length in bits.
/// Returns a new bit slice relative to self, with the given offset in bits and length in bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why this doesn't need to refer to len_in_bits -- how do we know that len_in_bits covers the entire buffer? Maybe this should be self.slice(len_in_bits/8)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the idea, bit view doesn't cover the whole Buffer. If you give the whole buffer's length in bits and start offset as 0 then it will cover the whole buffer. Otherwise, we can use a partial bit view on the Buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give the whole buffer's length in bits and start offset as 0 then it will cover the whole buffer

Right, what I don't understand is how the test for len_in_bits % 8 == 0 is checking for the whole buffer length. It seems like it is checking that len_in_bits is a multiple of 8 (aka represents whole bytes)

Maybe there is some assumption here like self.len_in_bits < 8?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vertexclique!

In your description, you mention that this PR does the following

Support native endianness

It wasn't obvious to me that this was the case (probably because I don't fully understand the issue) but I also didn't see any test cases

Fix problems related to bit operations

Could you possibly explain what problem you saw with native endianness (maybe even with a test case) as well as what problems with bit operations you were fixing?

Gives any primitive interpretation for an underlying bit field, doesn't force concrete u64. So different offset sizes in offset buffers are also reinterpretable.

I suggest adding test cases demonstrating using an offset size other than u64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the rationale for this change? It seems to use more code to accomplish the same functionality without any performance improvement. I am likely missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the remainder calculation here. The remainder calculation was at the end of this method. It looks like more code because now we explicitly know what is bit view and what is chunk iterator. It was quite a bit blurry before. Now you can have a bit view over the buffer while having chunks and remaining bits do their work. It is for readability and not consuming chunk iterator.

Copy link
Contributor Author

@vertexclique vertexclique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't obvious to me that this was the case (probably because I don't fully understand the issue) but I also didn't see any test cases

Previous implementation forces use of little endianness in Buffer data. From the writer's point of view, this might be desired, like in ARMv7(or more likely mips). Where writers don't need to reverse every byte they needed to write into a buffer in little-endian format to make it work while reading too. It is a small consideration.

Could you possibly explain what problem you saw with native endianness (maybe even with a test case) as well as what problems with bit operations you were fixing?

Sure, problems were not about native endianness. It was about the previous implementation. In the production code, I see some problems with the current implementation like https://issues.apache.org/jira/browse/ARROW-10461 . So I wanted to prevent it once and for all and wanted to adapt the current code to also interpret data not only u64 but u32, u16, u8, and when LLVM allows us u1.

About the test cases, I am going to go under ARMv7 integration soon. I can write some MSB tests here too. But I think I will touch most code parts for that anyway. If wanted I can pour some msb tests here.

I suggest adding test cases demonstrating using an offset size other than u64

Sure, u32 interpretation, I can do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the remainder calculation here. The remainder calculation was at the end of this method. It looks like more code because now we explicitly know what is bit view and what is chunk iterator. It was quite a bit blurry before. Now you can have a bit view over the buffer while having chunks and remaining bits do their work. It is for readability and not consuming chunk iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the idea, bit view doesn't cover the whole Buffer. If you give the whole buffer's length in bits and start offset as 0 then it will cover the whole buffer. Otherwise, we can use a partial bit view on the Buffer.

@vertexclique
Copy link
Contributor Author

I am going to change the whole data fusion tests to take machine epsilon into consideration. Didn't see that coming.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2020

It seems like #8571 may conflict with this PR as well

@vertexclique
Copy link
Contributor Author

This PR solves that problem intrinsically. Yes, it is a conflict.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2020

So my feedback here is that it is not clear to me what this PR is trying to accomplish (aka answer the question of why make the changes in this PR) and thus it is not clear how to review / evaluate it.

If the PR's aim is to add support for endianness, I would expect some demonstration that the new code can do something that the old code can't (aka tests)

If the PR's aim is to fix bug, I would expect some explanation / demonstration / of something that fails without the changes in the PR and passes with changes in the PR. The bugs this PR's changes fixes are probably obvious to you, but sadly they are not to me :(

If the PR's aim is to make the code easier to understand, I would expect some description of why the new code is easier to understand than the old (which will be a subjective judgement, for sure).

Since this PR seems to have elements of all three goals, but is light on the explanation, I am struggling to evaluate it concisely

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the tests this PR looks good. I still don't fully understand https://github.com/apache/arrow/pull/8598/files#r518733422, so I would appreciate some additional clarifications

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give the whole buffer's length in bits and start offset as 0 then it will cover the whole buffer

Right, what I don't understand is how the test for len_in_bits % 8 == 0 is checking for the whole buffer length. It seems like it is checking that len_in_bits is a multiple of 8 (aka represents whole bytes)

Maybe there is some assumption here like self.len_in_bits < 8?

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 6, 2020

So my feedback here is that it is not clear to me what this PR is trying to accomplish (aka answer the question of why make the changes in this pr) and thus it is not clear how to review / evaluate it.

Here comes the explanation @alamb and team:

If the PR's aim is to fix bug, I would expect some explanation / demonstration / of something that fails without the changes in the PR and passes with changes in the PR. The bugs this PR's changes fixes are probably obvious to you, but sadly they are not to me :(

  1. Bugfix: Other pr (ARROW-10461: [Rust] Fix offset bug in remainder bits #8571) fixes one offset problem. This pr fixes it intrinsically, tests are added for that in commit: d6e4744 . Moreover, data fusion tests are not using machine epsilon (small read: https://floating-point-gui.de/errors/comparison/), and I have just implemented an assertion method to be used with data fusion tests in c108026 . That was yet another bug. Now a small note: I didn't add these tests initially to not to be understood as rude against @jhorstmann and the pr opened there, but as you can see I have committed the exact tests in d6e4744 with the co-authoring feature to make it cumulative effort.

  2. Extensibility: Now iterators can be extended with different iterator types. Now, if you want, bit view can dispense not exact size chunk iterator or bit by bit iterator. Whatever you like. Just adding the wrapping iterator for cases makes it easier.

If the PR's aim is to add support for endianness, I would expect some demonstration that the new code can do something that the old code can't (aka tests)

  1. Architecture support: Now it can compile and run on big-endian architectures. Still, we have work to do there but we will get there eventually. For big-endian tests are written in c7428fb . Moreover, I think we should write more generic implementations, like how we are doing over the last 4 months, and still support platforms that we have promised. Personally, I don't want to write too much architecture-specific code in Rust to make it work over the upcoming months, and I can also advocate for that for the members of the Arrow Rust team. In the C++ version, I saw these and you can infer from how it is hard to support multiple platforms: https://github.com/apache/arrow/pull/7507/files#diff-c3b0484ad8586ff46fa035d446a7d1c3a30cd35d13cd05678c99814938e07d5bR78-R214

Here you can see mips (be) test results:

   Compiling arrow v3.0.0-SNAPSHOT (/project)
    Finished test [unoptimized + debuginfo] target(s) in 5.64s
     Running /target/mips-unknown-linux-gnu/debug/deps/arrow-ba04cf069343d58e

running 6 tests
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_aligned ... ok
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_reinterpret ... ok
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_unaligned ... ok
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_unaligned_remainder_1_byte ... ok
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_unaligned_remainder_bits_across_bytes ... ok
test util::bit_slice_iterator::tests_bit_slices_big_endian::test_bit_slice_iter_unaligned_remainder_bits_large ... ok

Here you can see armv7 (le) test results:

    Finished test [unoptimized + debuginfo] target(s) in 0.11s
     Running /target/armv7-unknown-linux-gnueabihf/debug/deps/arrow-6ffb743de7744875

running 6 tests
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_aligned ... ok
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_reinterpret ... ok
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_unaligned ... ok
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_unaligned_remainder_1_byte ... ok
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_unaligned_remainder_bits_across_bytes ... ok
test util::bit_slice_iterator::tests_bit_slices_little_endian::test_bit_slice_iter_unaligned_remainder_bits_large ... ok
  1. Preventing bugs: As you can see the C++ implementation's sophisticated code, it is really easy to make mistakes in this field, while doing bit shaking, bit twiddling etc. You might carry one bit right but you don't consider the carry and it works for a long time until we realize that it is not working anymore. So abstracting some things from the development is always good from my point of view. And I find this pragmatically correct for this case.

If the PR's aim is to make the code easier to understand, I would expect some description of why the new code is easier to understand than the old (which will be a subjective judgement, for sure).

  1. Flexibility: You can see byte reinterpretation as I've mentioned/promised before in the tests contained in commit c7428fb . Moreover, you can see that the new implementation without comments is only 100 lines exact. Also, views, buffers, iterators, bit sequence interpretation is completely separate. Obviously, as you said, that is subjective to the reader. I find the separation better atm.

Since this PR seems to have elements of all three goals, but is light on the explanation, I am struggling to evaluate it concisely

I hope I have answered all your questions.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2020

fyi @jhorstmann this PR likely would cause conflicts with #8571 -- I wonder if you have any opinions on how to proceed

@alamb
Copy link
Contributor

alamb commented Nov 6, 2020

but as you can see I have committed the exact tests in d6e4744 with the co-authoring feature to make it cumulative effort.

Thank you!

I think with the additional tests demonstrating bug fixes and features, this PR is a good step forward and I would be amenable to merging it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the remainder loop down again so that we are summing elements in the order that they are in the array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether this is really correct, the way I understood it is that little/big endian only affect the layout of bytes in memory, not how individual bits are accessed in a number. In this testcase the least significant bit of the first byte is zero and would be considered the first value if this was a boolean array or null bitmap. Same for the 4th least significant bit, which is where the slice here should start. This means the least significant bit of the chunk should be zero.

Or am I missing something?

@jhorstmann
Copy link
Contributor

When I introduced this initially in ARROW-10040 one feedback was that big endian was not supported yet anyway so it would not be necessary to worry about that now. I think it could be made to work rather easily by calling to_le in 2-3 places if I had access to a big endian test machine or CI pipeline.

Adding a dependency that already implements the chunking and remainder logic is nice. I would have expected that to reduce the code size though.

The buffer_bit_ops microbenchmark seems to be affected quite a bit:

buffer_bit_ops and      time:   [1.1393 us 1.1413 us 1.1433 us]                                
                        change: [+889.05% +892.72% +896.41%] (p = 0.00 < 0.05)
                        Performance has regressed.

The sum aggregation kernel is another bigger user of the bit slice functions and also regressed a bit:

sum nulls 512           time:   [305.83 ns 306.31 ns 306.82 ns]                          
                        change: [+25.194% +25.552% +25.936%] (p = 0.00 < 0.05)
                        Performance has regressed.

Most benchmarks don't seem to be affected much, probably because there is some other overhead or they are not using the chunked functions. Cast kernels for example are implemented using iterators of optional values and so use a different code path.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI This PR may also be related / partially conflict with #8541

I think we should address @jhorstmann 's measurements of performance regressions before this pR is merged.

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 7, 2020

I don't think that's needed at all. Optimizations were extremely premature. We should have thought about the future beforehand. I think we are prolonging the discussion without any good point in it. Performance can improve later when we have correct code first.

+ This code is already improved a lot. I find all the comments kind of blocking after the effort I put into it. So ping me when you want to merge this. We need these changes anyway to run on other platforms.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vertexclique , thanks a lot for this, and specially for bringing the bitvec lib.

I went through this PR and I believe that the idea is good, specially the parts where we replace unsafe code, that we use a dependency to solve this problem for us, and that we do not limit to a single chunk size.

Furthermore, I am also of the opinion that correct code is better than fast code.

As @alamb and @jhorstmann , I am a bit concerned with the performance regression that this entails. 8x on buffer_bit_ops is a significant, IMO, and a 25% on an horizontal aggregation also seems significant for me.

@vertexclique , no one is trying to block this PR per se, I think that folks are just a bit concerned over the performance regression. Do you have any idea of where this performance hit could have come from?

For example, I would be fine with it if reason is that we now perform checks to guarantee no undefined behavior (UB), because in that case the comparison is not really fair.

One thing that I think would help this PR get merged would be to separate it into two: one where we refactor the code using bitvec, and another where we go through the endianess.

IMO the endianess is a feature sufficiently large that justifies a discussion on the mailing list: we need to understand if the maintainers are willing to take it. We would also need to setup the CI, understand which tests we need to generalize to be run on the two platforms, etc.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 9, 2020

A few weeks/months ago when I was trying to work on bit slicing, bit_vec was recommended over at Reddit, so I think it's generally a good dependency for us to carry along, as it'll also have more eyes over it, and more time being used.

I'm going to look at what the reason for the perf regressions is, during the week in my evenings. I haven't been able to profile criterion benchmarks with the tool that I use (https://superluminal.eu/), so I'll write some small application(s) so I can get the perf profiles for this.
I'll revert back after doing this, as I'd also like to understand the cause of the regression before continuing with this PR :)

@vertexclique @jorgecarleitao there's been work on Java and C++ to support big-endian architectures; so maybe we can check in on previous mailing list discussions for guidance. I think CI might be the main concern (incl for arm-v7 support).

I'd also prefer if we separate the big-endian functionality, and work on it as a separate PR.

@vertexclique
Copy link
Contributor Author

I have created: https://issues.apache.org/jira/browse/ARROW-10535 I can only move tests to another PR because it is a generic code and without the generic code won't compile. I can remove the big-endian tests and after merging this PR I can create yet another PR from master containing big-endian tests.

@andygrove andygrove added the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020
@vertexclique vertexclique force-pushed the ARROW-10500-refactor-bitslice-iterator branch 2 times, most recently from 801ca15 to d3c6305 Compare November 10, 2020 22:46
@nevi-me nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020
@vertexclique vertexclique force-pushed the ARROW-10500-refactor-bitslice-iterator branch from 4f5e1cf to 7ca1bc5 Compare November 11, 2020 01:07
@vertexclique
Copy link
Contributor Author

So, I have fixed all unreproducible benchmarks that we are running in #8635. Also open-sourced a benchmark utility https://github.com/vertexclique/zor to create always the same baseline for all people who runs cargo bench somehow.

Running master with reproducible benches and doing the same for this PR gives these results:
https://gist.github.com/vertexclique/b5860ad836e78044d331cd3bb93fcf20
So far this PR is giving a boost to many operations.

I think we should address jhorstmann 's measurements of performance regressions before this pR is merged.

I measured the performance. 🙃 It is in the PR description.

I'm going to look at what the reason for the perf regressions is, during the week in my evenings. I haven't been able to profile criterion benchmarks with the tool that I use (https://superluminal.eu/), so I'll write some small application(s) so I can get the perf profiles for this.

Would be nice! I have used the perf counters, and disasm so far and didn't see anything except bounds checks and proper alignment checks. Attached left is the master's code right is this pr. Mind that left doesn't jump to remainder checks because it is erased by const prop in remainder bits that I have applied with 7696b89#diff-4eec3bf3d3993d5a6a7fa0b6a0b057dc5b517b7737eb47e00a887e7a3dcb1c37R65

ss

@vertexclique @jorgecarleitao there's been work on Java and C++ to support big-endian architectures; so maybe we can check in on previous mailing list discussions for guidance. I think CI might be the main concern (incl for arm-v7 support).

Definitely, I will start a thread tomorrow. Also opened: #8634

@jhorstmann
Copy link
Contributor

I think we should address jhorstmann 's measurements of performance regressions before this pR is merged.

I measured the performance. upside_down_face It is in the PR description.

That's exactly where I took the benchmark results from. But yes, the regression in buffer_bit_ops and does not seem to have any big effect.

I had one other comment about the separate testcases for big-endian architectures, or restricting tests to little-endian, that was not yet addressed:

I'm wondering whether this is really correct, the way I understood it is that little/big endian only affect the layout of bytes in memory, not how individual bits are accessed in a number. In this testcase the least significant bit of the first byte is zero and would be considered the first value if this was a boolean array or null bitmap. Same for the 4th least significant bit, which is where the slice here should start. This means the least significant bit of the chunk should be zero.

Consider the following buffer of u8, used as bit-packed data, with the indices of bytes and bits written below

00000000 00010000
       0        1
76543210 76543210

To get the value of the 12th bit we would check bit (12%8) of byte (12/8). Viewing this as a larger type (u16 for simplification):

0001000000000000
               0
111111
5432109876543210

To check the same bit we would need to check bit (12%16) of word (12/16). So the value as u16 would be 4096 and this should be independent of the machine-endianness. Endianness only influences how the u16 would be stored in memory, but our underlying data consists of u8 in memory.

@alamb
Copy link
Contributor

alamb commented Nov 13, 2020

Some additional data: I ran the tests under valgrind (as described in #8645 (comment)) on this branch after rebasing against master.

This branch appears to have the same errors as reported on master (It did not introduce anything new, but neither does it fix the issue that #8645 appears to do):

I did:

git checkout  ARROW-10500-refactor-bitslice-iterator
git rebase apache/master
PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data ~/Software/valgrind/bin/valgrind /home/andrew/Software/arrow/rust/target/debug/deps/arrow-175722a3038eb7da  --test-threads=1

Which then reported:

test compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8 ... ==11668== Invalid read of size 1
==11668==    at 0x2D63A2: arrow::util::bit_util::set_bits_raw (bit_util.rs:128)
==11668==    by 0x6C1C49: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::append_n (builder.rs:339)
==11668==    by 0x6E9054: arrow::array::builder::PrimitiveBuilder<T>::append_slice (builder.rs:591)
==11668==    by 0x8CA316: arrow::array::builder::StringBuilder::append_value (builder.rs:1781)
==11668==    by 0x75DEFB: arrow::array::builder::StringDictionaryBuilder<K>::append (builder.rs:2435)
==11668==    by 0x38CFB3: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8 (cast.rs:2612)
==11668==    by 0x347D69: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8::{{closure}} (cast.rs:2598)
==11668==    by 0xAD460D: core::ops::function::FnOnce::call_once (function.rs:232)
==11668==    by 0xB9EEB5: call_once<(),FnOnce<()>> (boxed.rs:1008)
==11668==    by 0xB9EEB5: call_once<(),alloc::boxed::Box<FnOnce<()>>> (panic.rs:318)
==11668==    by 0xB9EEB5: do_call<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panicking.rs:331)
==11668==    by 0xB9EEB5: try<(),std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>> (panicking.rs:274)
==11668==    by 0xB9EEB5: catch_unwind<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panic.rs:394)
==11668==    by 0xB9EEB5: run_test_in_process (lib.rs:541)
==11668==    by 0xB9EEB5: test::run_test::run_test_inner::{{closure}} (lib.rs:450)
==11668==    by 0xB9E548: test::run_test::run_test_inner (lib.rs:475)
==11668==    by 0xB9C739: test::run_test (lib.rs:505)
==11668==    by 0xB8A528: run_tests<closure-2> (lib.rs:284)
==11668==    by 0xB8A528: test::console::run_tests_console (console.rs:280)
==11668==  Address 0x65c1900 is 0 bytes after a block of size 128 alloc'd
==11668==    at 0x4C34443: memalign (vg_replace_malloc.c:906)
==11668==    by 0x4C34546: posix_memalign (vg_replace_malloc.c:1070)
==11668==    by 0xEBC083: aligned_malloc (alloc.rs:95)
==11668==    by 0xEBC083: alloc (alloc.rs:22)
==11668==    by 0xEBC083: realloc_fallback (alloc.rs:39)
==11668==    by 0xEBC083: realloc (alloc.rs:50)
==11668==    by 0xEBC083: __rdl_realloc (alloc.rs:320)
==11668==    by 0x33E2FC: alloc::alloc::realloc (alloc.rs:124)
==11668==    by 0x3FB608: arrow::memory::reallocate (memory.rs:187)
==11668==    by 0x2BEFE6: arrow::buffer::MutableBuffer::reserve (buffer.rs:686)
==11668==    by 0x6BBD95: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::reserve (builder.rs:307)
==11668==    by 0x6C1AEF: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::append_n (builder.rs:335)
==11668==    by 0x6E9054: arrow::array::builder::PrimitiveBuilder<T>::append_slice (builder.rs:591)
==11668==    by 0x8CA316: arrow::array::builder::StringBuilder::append_value (builder.rs:1781)
==11668==    by 0x75DEFB: arrow::array::builder::StringDictionaryBuilder<K>::append (builder.rs:2435)
==11668==    by 0x38CFB3: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8 (cast.rs:2612)
==11668== 
==11668== Invalid write of size 1
==11668==    at 0x2D63A4: arrow::util::bit_util::set_bits_raw (bit_util.rs:128)
==11668==    by 0x6C1C49: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::append_n (builder.rs:339)
==11668==    by 0x6E9054: arrow::array::builder::PrimitiveBuilder<T>::append_slice (builder.rs:591)
==11668==    by 0x8CA316: arrow::array::builder::StringBuilder::append_value (builder.rs:1781)
==11668==    by 0x75DEFB: arrow::array::builder::StringDictionaryBuilder<K>::append (builder.rs:2435)
==11668==    by 0x38CFB3: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8 (cast.rs:2612)
==11668==    by 0x347D69: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8::{{closure}} (cast.rs:2598)
==11668==    by 0xAD460D: core::ops::function::FnOnce::call_once (function.rs:232)
==11668==    by 0xB9EEB5: call_once<(),FnOnce<()>> (boxed.rs:1008)
==11668==    by 0xB9EEB5: call_once<(),alloc::boxed::Box<FnOnce<()>>> (panic.rs:318)
==11668==    by 0xB9EEB5: do_call<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panicking.rs:331)
==11668==    by 0xB9EEB5: try<(),std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>> (panicking.rs:274)
==11668==    by 0xB9EEB5: catch_unwind<std::panic::AssertUnwindSafe<alloc::boxed::Box<FnOnce<()>>>,()> (panic.rs:394)
==11668==    by 0xB9EEB5: run_test_in_process (lib.rs:541)
==11668==    by 0xB9EEB5: test::run_test::run_test_inner::{{closure}} (lib.rs:450)
==11668==    by 0xB9E548: test::run_test::run_test_inner (lib.rs:475)
==11668==    by 0xB9C739: test::run_test (lib.rs:505)
==11668==    by 0xB8A528: run_tests<closure-2> (lib.rs:284)
==11668==    by 0xB8A528: test::console::run_tests_console (console.rs:280)
==11668==  Address 0x65c1900 is 0 bytes after a block of size 128 alloc'd
==11668==    at 0x4C34443: memalign (vg_replace_malloc.c:906)
==11668==    by 0x4C34546: posix_memalign (vg_replace_malloc.c:1070)
==11668==    by 0xEBC083: aligned_malloc (alloc.rs:95)
==11668==    by 0xEBC083: alloc (alloc.rs:22)
==11668==    by 0xEBC083: realloc_fallback (alloc.rs:39)
==11668==    by 0xEBC083: realloc (alloc.rs:50)
==11668==    by 0xEBC083: __rdl_realloc (alloc.rs:320)
==11668==    by 0x33E2FC: alloc::alloc::realloc (alloc.rs:124)
==11668==    by 0x3FB608: arrow::memory::reallocate (memory.rs:187)
==11668==    by 0x2BEFE6: arrow::buffer::MutableBuffer::reserve (buffer.rs:686)
==11668==    by 0x6BBD95: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::reserve (builder.rs:307)
==11668==    by 0x6C1AEF: <arrow::array::builder::BufferBuilder<T> as arrow::array::builder::BufferBuilderTrait<T>>::append_n (builder.rs:335)
==11668==    by 0x6E9054: arrow::array::builder::PrimitiveBuilder<T>::append_slice (builder.rs:591)
==11668==    by 0x8CA316: arrow::array::builder::StringBuilder::append_value (builder.rs:1781)
==11668==    by 0x75DEFB: arrow::array::builder::StringDictionaryBuilder<K>::append (builder.rs:2435)
==11668==    by 0x38CFB3: arrow::compute::kernels::cast::tests::test_cast_dict_to_dict_bad_index_value_utf8 (cast.rs:2612)
==11668== 
..

==11668== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 13, 2020

Yes, it won't until that method rewritten using the bit-slice iterator :) written here: #8645 (comment)
p.s: totally unrelated topic, how do you run Valgrind on mac?

@vertexclique vertexclique force-pushed the ARROW-10500-refactor-bitslice-iterator branch from 82171cc to 54c7056 Compare November 14, 2020 18:40
@vertexclique
Copy link
Contributor Author

Closing in favor of ARROW-10588 at #8664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants