Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl · 2024-08-23T21:42:33Z

Which issue does this PR close?

Closes #6296.

Rationale for this change

See issue.

What changes are included in this PR?

Changes pad_nulls for FixedLenByteArrayBuffer as described in the issue. Benchmarking seems to indicate that using the new approach for byte_length > 4 is optimal. The loop is better for byte_length <= 4 because the compiler will eliminate the loop via unrolling.

This PR does include extensive changes to the arrow_reader bench to allow benchmarking these changes.

Are there any user-facing changes?

No

etseidl · 2024-08-23T21:49:01Z

Benchmark results for these changes (pad loop is as in master branch):

group                                                                                                      pad_loop                               pad_nulls_vec2
-----                                                                                                      --------                               --------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.45    558.3±4.21µs        ? ?/sec    1.00    385.9±3.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.67    422.5±4.04µs        ? ?/sec    1.00    253.3±4.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.03    119.9±0.74µs        ? ?/sec    1.00    116.5±1.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.04    105.7±1.24µs        ? ?/sec    1.00    101.2±1.09µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.63    440.0±1.65µs        ? ?/sec    1.00    269.5±1.76µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.17    297.8±3.19µs        ? ?/sec    1.00    137.1±3.25µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    108.4±4.14µs        ? ?/sec    1.00    101.2±1.03µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.00     86.8±1.53µs        ? ?/sec    1.00     86.8±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.01    148.4±1.92µs        ? ?/sec    1.00    146.9±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.02    117.0±0.40µs        ? ?/sec    1.00    115.2±1.27µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.33    241.4±2.93µs        ? ?/sec    1.00    181.5±2.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.49    174.5±3.44µs        ? ?/sec    1.00    116.9±1.26µs        ? ?/sec

etseidl · 2024-08-23T22:28:58Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays. Master vs 1a1e967

Linux with newer xeon

group                                                                                                      pad_loop                               pad_nulls_slice
-----                                                                                                      --------                               ---------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.52    558.3±4.21µs        ? ?/sec    1.00    367.8±3.17µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.85    422.5±4.04µs        ? ?/sec    1.00    228.8±3.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.07    119.9±0.74µs        ? ?/sec    1.00    112.3±1.55µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.16    105.7±1.24µs        ? ?/sec    1.00     91.3±0.85µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.69    440.0±1.65µs        ? ?/sec    1.00    259.7±2.18µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.36    297.8±3.19µs        ? ?/sec    1.00    126.3±1.59µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.20    108.4±4.14µs        ? ?/sec    1.00     90.6±0.81µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.16     86.8±1.53µs        ? ?/sec    1.00     74.9±0.63µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.06    148.4±1.92µs        ? ?/sec    1.00    140.4±2.67µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.08    117.0±0.40µs        ? ?/sec    1.00    108.0±1.19µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.39    241.4±2.93µs        ? ?/sec    1.00    173.1±2.77µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.60    174.5±3.44µs        ? ?/sec    1.00    109.0±1.27µs        ? ?/sec

Hmm, this is pretty system specific.

Old mac laptop

group                                                                                                      pad_loop                               pad_slice                               pad_vec
-----                                                                                                      --------                               ---------                               -------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.15  1373.8±32.79µs        ? ?/sec    1.02  1218.4±282.47µs        ? ?/sec    1.00  1199.0±38.80µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.32  1035.7±27.45µs        ? ?/sec    1.00   785.5±16.07µs        ? ?/sec     1.02   804.6±25.53µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.00    308.7±6.87µs        ? ?/sec    1.00   309.2±14.52µs        ? ?/sec     1.04    320.1±9.10µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.07    266.8±8.11µs        ? ?/sec    1.00    249.0±6.98µs        ? ?/sec     1.12    278.7±8.53µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.32  1041.8±28.99µs        ? ?/sec    1.00   790.8±19.98µs        ? ?/sec     1.01   795.1±16.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               1.55   697.3±13.37µs        ? ?/sec    1.00   449.4±11.70µs        ? ?/sec     1.01    456.0±9.74µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    279.4±5.72µs        ? ?/sec    1.00   260.4±16.31µs        ? ?/sec     1.11    289.6±5.28µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.08    236.6±4.57µs        ? ?/sec    1.00    218.3±4.90µs        ? ?/sec     1.14    249.2±9.71µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.00    371.7±7.78µs        ? ?/sec    1.04    386.3±7.68µs        ? ?/sec     1.11   410.9±10.57µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.00    287.8±6.90µs        ? ?/sec    1.05    303.1±6.91µs        ? ?/sec     1.13    323.9±6.61µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.15   594.5±16.03µs        ? ?/sec    1.04   535.7±11.47µs        ? ?/sec     1.00   517.0±13.36µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.22   430.9±11.69µs        ? ?/sec    1.12   396.6±21.44µs        ? ?/sec     1.00   354.6±10.20µs        ? ?/sec

alamb · 2024-08-28T11:19:33Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

etseidl · 2024-08-28T14:14:42Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

Sorry, I tend to ramble...I concluded that copy_from_slice is the way to go since it's better on modern CPUs. I only added the mac results to show I hadn't imagined the other approach being faster when I did my initial testing. 😅

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

agreed!

alamb · 2024-09-18T20:13:40Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

wiedld · 2024-09-27T05:27:43Z

I'm starting to review this one. Just give me a little bit to dig into the decoding.

wiedld · 2024-09-30T07:40:15Z

parquet/src/arrow/array_reader/fixed_len_byte_array.rs

+) where
+    F: FnMut(&mut Vec<u8>, usize, usize, usize),
+{
+    for (value_pos, level_pos) in values_range.rev().zip(iter_set_bits_rev(valid_mask)) {


Learning question. Confirming that the iter_set_bits_rev (reversing) is due to little endian?

IIUC iter_set_bits_rev is returning an iterator of valid slot indexes for the array, but in reverse order, to match the values_range iterator which is also reversed (which I believe is just a monotonically decreasing counter from num_non_null-1 to 0). We're turning a dense array of non-null values into a sparse one where null slots are left undefined, but doing it in place, which is why it's done from the tail.

wiedld · 2024-09-30T08:14:41Z

parquet/benches/arrow_reader.rs

+    // byte_stream_split encoded, no NULLs
+    let data = build_encoded_primitive_page_iterator::<T>(


Nice additional coverage. 👍🏼

Not related to a change you made:
I noticed that under one of the bench_primitive test scenarios, I think this should be "binary packed skip" and use the bench_array_reader_skip. Do you agree?

Binary packed skip is handled here and here. I think the test you highlighted is correct, but it's quite possible I'm missing something.

I'll put up the suggested changes in a draft PR, since it's unrelated, and then ping you. Thanks for considering!

wiedld · 2024-09-30T08:48:10Z

parquet/benches/arrow_reader.rs

-    decimal_benches,
-    byte_stream_split_benches,
-);
+criterion_group!(benches, add_benches, decimal_benches, float16_benches,);


Very nice refactoring of the tests. TY.

wiedld

LGTM

alamb

Thank you @etseidl and thank you very much for the review @wiedld

alamb · 2024-10-02T10:21:25Z

Thanks again @etseidl -- sorry this took so long to review

etseidl and others added 9 commits August 22, 2024 17:34

optimize pad_nulls for fixed_len_byte_array

f77b125

start refactor of benchmarks

da0bbd4

refactor float16

fe97120

add fixed_len_byte_array to float16 benches

1ca0222

replace copy_within with vectorizable copy

b715f80

update comment

1c5221e

move branch on byte_length outside of loop

90c21a8

Merge remote-tracking branch 'origin/master' into bss_pad_nulls

cea8ca3

reduce code duplication while preserving performance gains

4e2dc4f

github-actions bot added the parquet Changes to the parquet crate label Aug 23, 2024

etseidl added 3 commits August 23, 2024 14:51

formatting

63116aa

silence clippy

a13b96d

clippy won again

1a1e967

Merge remote-tracking branch 'origin/master' into bss_pad_nulls

764d4aa

wiedld reviewed Sep 30, 2024

View reviewed changes

wiedld approved these changes Oct 1, 2024

View reviewed changes

alamb approved these changes Oct 1, 2024

View reviewed changes

wiedld mentioned this pull request Oct 1, 2024

Benchmark both scenarios, with records skipped and without skipping, for delta-bin-packed primitive arrays with half nulls. #6489

Merged

alamb merged commit 9b9217b into apache:master Oct 2, 2024
16 checks passed

alamb mentioned this pull request Oct 2, 2024

Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient #6296

Closed

etseidl deleted the bss_pad_nulls branch October 2, 2024 18:08

alamb added enhancement Any new improvement worthy of a entry in the changelog performance labels Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl commented Aug 23, 2024

etseidl commented Aug 23, 2024 •

edited

Loading

etseidl commented Aug 23, 2024 •

edited

Loading

alamb commented Aug 28, 2024

etseidl commented Aug 28, 2024

alamb commented Sep 18, 2024

wiedld commented Sep 27, 2024

wiedld Sep 30, 2024

etseidl Sep 30, 2024

wiedld Sep 30, 2024

wiedld Sep 30, 2024

etseidl Sep 30, 2024

wiedld Oct 1, 2024

wiedld Oct 1, 2024

wiedld Sep 30, 2024

wiedld left a comment

alamb left a comment

alamb commented Oct 2, 2024

		// byte_stream_split encoded, no NULLs
		let data = build_encoded_primitive_page_iterator::<T>(

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Conversation

etseidl commented Aug 23, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Aug 23, 2024 • edited Loading

etseidl commented Aug 23, 2024 • edited Loading

alamb commented Aug 28, 2024

etseidl commented Aug 28, 2024

alamb commented Sep 18, 2024

wiedld commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 2, 2024

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl commented Aug 23, 2024 •

edited

Loading

etseidl commented Aug 23, 2024 •

edited

Loading