Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Merged
merged 13 commits into from
Oct 2, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 23, 2024

Which issue does this PR close?

Closes #6296.

Rationale for this change

See issue.

What changes are included in this PR?

Changes pad_nulls for FixedLenByteArrayBuffer as described in the issue. Benchmarking seems to indicate that using the new approach for byte_length > 4 is optimal. The loop is better for byte_length <= 4 because the compiler will eliminate the loop via unrolling.

This PR does include extensive changes to the arrow_reader bench to allow benchmarking these changes.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 23, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Aug 23, 2024

Benchmark results for these changes (pad loop is as in master branch):

group                                                                                                      pad_loop                               pad_nulls_vec2
-----                                                                                                      --------                               --------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.45    558.3±4.21µs        ? ?/sec    1.00    385.9±3.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.67    422.5±4.04µs        ? ?/sec    1.00    253.3±4.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.03    119.9±0.74µs        ? ?/sec    1.00    116.5±1.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.04    105.7±1.24µs        ? ?/sec    1.00    101.2±1.09µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.63    440.0±1.65µs        ? ?/sec    1.00    269.5±1.76µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.17    297.8±3.19µs        ? ?/sec    1.00    137.1±3.25µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    108.4±4.14µs        ? ?/sec    1.00    101.2±1.03µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.00     86.8±1.53µs        ? ?/sec    1.00     86.8±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.01    148.4±1.92µs        ? ?/sec    1.00    146.9±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.02    117.0±0.40µs        ? ?/sec    1.00    115.2±1.27µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.33    241.4±2.93µs        ? ?/sec    1.00    181.5±2.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.49    174.5±3.44µs        ? ?/sec    1.00    116.9±1.26µs        ? ?/sec

@etseidl
Copy link
Contributor Author

etseidl commented Aug 23, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays. Master vs 1a1e967

Linux with newer xeon
group                                                                                                      pad_loop                               pad_nulls_slice
-----                                                                                                      --------                               ---------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.52    558.3±4.21µs        ? ?/sec    1.00    367.8±3.17µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.85    422.5±4.04µs        ? ?/sec    1.00    228.8±3.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.07    119.9±0.74µs        ? ?/sec    1.00    112.3±1.55µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.16    105.7±1.24µs        ? ?/sec    1.00     91.3±0.85µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.69    440.0±1.65µs        ? ?/sec    1.00    259.7±2.18µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.36    297.8±3.19µs        ? ?/sec    1.00    126.3±1.59µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.20    108.4±4.14µs        ? ?/sec    1.00     90.6±0.81µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.16     86.8±1.53µs        ? ?/sec    1.00     74.9±0.63µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.06    148.4±1.92µs        ? ?/sec    1.00    140.4±2.67µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.08    117.0±0.40µs        ? ?/sec    1.00    108.0±1.19µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.39    241.4±2.93µs        ? ?/sec    1.00    173.1±2.77µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.60    174.5±3.44µs        ? ?/sec    1.00    109.0±1.27µs        ? ?/sec

Hmm, this is pretty system specific.

Old mac laptop
group                                                                                                      pad_loop                               pad_slice                               pad_vec
-----                                                                                                      --------                               ---------                               -------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.15  1373.8±32.79µs        ? ?/sec    1.02  1218.4±282.47µs        ? ?/sec    1.00  1199.0±38.80µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.32  1035.7±27.45µs        ? ?/sec    1.00   785.5±16.07µs        ? ?/sec     1.02   804.6±25.53µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.00    308.7±6.87µs        ? ?/sec    1.00   309.2±14.52µs        ? ?/sec     1.04    320.1±9.10µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.07    266.8±8.11µs        ? ?/sec    1.00    249.0±6.98µs        ? ?/sec     1.12    278.7±8.53µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.32  1041.8±28.99µs        ? ?/sec    1.00   790.8±19.98µs        ? ?/sec     1.01   795.1±16.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               1.55   697.3±13.37µs        ? ?/sec    1.00   449.4±11.70µs        ? ?/sec     1.01    456.0±9.74µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    279.4±5.72µs        ? ?/sec    1.00   260.4±16.31µs        ? ?/sec     1.11    289.6±5.28µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.08    236.6±4.57µs        ? ?/sec    1.00    218.3±4.90µs        ? ?/sec     1.14    249.2±9.71µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.00    371.7±7.78µs        ? ?/sec    1.04    386.3±7.68µs        ? ?/sec     1.11   410.9±10.57µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.00    287.8±6.90µs        ? ?/sec    1.05    303.1±6.91µs        ? ?/sec     1.13    323.9±6.61µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.15   594.5±16.03µs        ? ?/sec    1.04   535.7±11.47µs        ? ?/sec     1.00   517.0±13.36µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.22   430.9±11.69µs        ? ?/sec    1.12   396.6±21.44µs        ? ?/sec     1.00   354.6±10.20µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 28, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

@etseidl
Copy link
Contributor Author

etseidl commented Aug 28, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

Sorry, I tend to ramble...I concluded that copy_from_slice is the way to go since it's better on modern CPUs. I only added the mac results to show I hadn't imagined the other approach being faster when I did my initial testing. 😅

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

agreed!

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@wiedld
Copy link
Contributor

wiedld commented Sep 27, 2024

I'm starting to review this one. Just give me a little bit to dig into the decoding.

) where
F: FnMut(&mut Vec<u8>, usize, usize, usize),
{
for (value_pos, level_pos) in values_range.rev().zip(iter_set_bits_rev(valid_mask)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learning question. Confirming that the iter_set_bits_rev (reversing) is due to little endian?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC iter_set_bits_rev is returning an iterator of valid slot indexes for the array, but in reverse order, to match the values_range iterator which is also reversed (which I believe is just a monotonically decreasing counter from num_non_null-1 to 0). We're turning a dense array of non-null values into a sparse one where null slots are left undefined, but doing it in place, which is why it's done from the tail.

Comment on lines +1096 to +1097
// byte_stream_split encoded, no NULLs
let data = build_encoded_primitive_page_iterator::<T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice additional coverage. 👍🏼

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to a change you made:
I noticed that under one of the bench_primitive test scenarios, I think this should be "binary packed skip" and use the bench_array_reader_skip. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary packed skip is handled here and here. I think the test you highlighted is correct, but it's quite possible I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put up the suggested changes in a draft PR, since it's unrelated, and then ping you. Thanks for considering!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimal_benches,
byte_stream_split_benches,
);
criterion_group!(benches, add_benches, decimal_benches, float16_benches,);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactoring of the tests. TY.

Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl and thank you very much for the review @wiedld

@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

Thanks again @etseidl -- sorry this took so long to review

@etseidl etseidl deleted the bss_pad_nulls branch October 2, 2024 18:08
@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog performance labels Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient
3 participants