Implement directly build byte view array on top of parquet buffer #5972

XiangpengHao · 2024-06-27T20:41:02Z

Which issue does this PR close?

~~This PR is not ready to review until we merged #5970 .~~

Part of #5904 , sequel to #5968 and #5970

Rationale for this change

This PR has the real work to directly transform the parquet page buffer into a view buffer without extra copy.
To see the performance difference, you can run:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/Binary.*Array/plain encoded"

arrow_array_reader/BinaryArray/plain encoded, mandatory, no NULLs
                        time:   [321.33 µs 322.06 µs 323.48 µs]
                        change: [-1.1124% -0.8542% -0.4913%] (p = 0.00 < 0.05)
                        Change within noise threshold.
arrow_array_reader/BinaryArray/plain encoded, optional, no NULLs
                        time:   [327.34 µs 327.65 µs 328.06 µs]
                        change: [-0.8087% -0.7037% -0.5861%] (p = 0.00 < 0.05)
                        Change within noise threshold.
arrow_array_reader/BinaryArray/plain encoded, optional, half NULLs
                        time:   [416.00 µs 416.32 µs 416.65 µs]
                        change: [+0.4281% +0.6535% +0.8394%] (p = 0.00 < 0.05)
                        Change within noise threshold.

arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs
                        time:   [238.51 µs 239.00 µs 239.40 µs]
                        change: [-3.9669% -3.6590% -3.3511%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs
                        time:   [243.73 µs 243.82 µs 243.91 µs]
                        change: [+1.3771% +1.5136% +1.6413%] (p = 0.00 < 0.05)
                        Performance has regressed.
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs
                        time:   [179.69 µs 180.14 µs 180.87 µs]
                        change: [-0.7692% -0.4983% -0.1523%] (p = 0.00 < 0.05)
                        Change within noise threshold.

You should find that with this PR, reading BinaryViewArray is faster than reading BinaryArray -- a milestone from making StringViewArray faster than StringArray.

When this (set of) PR is merged, the last piece is to make utf8 validation fast, so that string view can maintain the advantage.

What changes are included in this PR?

This PR only includes decoding plain data for ease of review. Supporting for RLE/dictionary will be filed soon.

Are there any user-facing changes?

Co-authored-by: Andrew Lamb <[email protected]>

alamb

Looks good to me -- thank you @XiangpengHao -- let me know if you think this PR is ready to merge

parquet/src/arrow/array_reader/byte_view_array.rs

alamb · 2024-07-01T19:12:00Z

parquet/src/arrow/array_reader/byte_view_array.rs

+        }
+
+        let mut buffer = ViewBuffer::default();
+        let mut decoder = ByteViewArrayDecoderPlain::new(


does this mean we could have DictionaryArray<Int32, StringView> (as in a dictionary array whose value array is a dictionary?)

Do you mean "whose value array is a string view"?
Yes, I think for optimal performance, the value buffer of the dictionary should also be in string view type to avoid double copying.

alamb · 2024-07-01T19:16:14Z

parquet/src/arrow/array_reader/byte_view_array.rs

+            if self.validate_utf8 {
+                check_valid_utf8(unsafe { buf.get_unchecked(start_offset..end_offset) })?;
+            }


Another way that might be faster would be to defer the checking until after all the views were made. Then, you could take a second pass through for view validation. Maybe that would be faster than doing it inlined here

I have a plan for very fast utf8 vailidation, but don't want to complicate this PR here. I'll file a follow up PR that addresses the validation issue, we will hopefully see that loading StringViewArray is similar to loading BinaryViewArray.

Co-authored-by: Andrew Lamb <[email protected]>

XiangpengHao · 2024-07-02T02:52:01Z

I think this PR is good to go now @alamb

This reverts commit 5e68870.

alamb · 2024-07-02T10:37:18Z

Thanks @XiangpengHao

XiangpengHao added 7 commits June 26, 2024 00:35

implement sort for view types

f32aabc

add bench for binary/binary view

8f1c887

Merge branch 'apache:master' into master

7a7a246

Merge remote-tracking branch 'origin/master' into string-view-bench

6b3f1b9

add view buffer, prepare for byte_view_array reader

45d7752

make clippy happy

3e243ad

add byte view array reader

1f3d7ca

github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2024

fix doc link

e5c7bde

XiangpengHao mentioned this pull request Jun 28, 2024

Implement dictionary support for reading ByteView from parquet #5973

Merged

alamb and others added 5 commits June 28, 2024 07:08

Merge remote-tracking branch 'apache/master' into parquet-string-view

1b45c91

reuse make_view_unchecked

25ad3c2

Update parquet/src/arrow/buffer/view_buffer.rs

002b73d

Co-authored-by: Andrew Lamb <[email protected]>

update

7e8ff6a

Merge branch 'parquet-string-view' into parquet-string-view-2

3068578

github-actions bot added the arrow Changes to the arrow crate label Jun 28, 2024

XiangpengHao mentioned this pull request Jun 28, 2024

Add view buffer for parquet reader #5970

Merged

XiangpengHao added 2 commits June 28, 2024 16:36

rename and inline

5846ff0

Merge branch 'parquet-string-view' into parquet-string-view-2

af635e5

XiangpengHao marked this pull request as ready for review July 1, 2024 12:26

Merge branch 'apache:master' into parquet-string-view-2

ab90310

github-actions bot removed the arrow Changes to the arrow crate label Jul 1, 2024

alamb mentioned this pull request Jul 1, 2024

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 apache/datafusion#11190

Closed

10 tasks

alamb approved these changes Jul 1, 2024

View reviewed changes

XiangpengHao and others added 3 commits July 1, 2024 22:41

Update parquet/src/arrow/array_reader/byte_view_array.rs

4bb9988

Co-authored-by: Andrew Lamb <[email protected]>

use unused

5e68870

update

f1c33d7

Revert "use unused"

9c5972f

This reverts commit 5e68870.

alamb merged commit 859c4ad into apache:master Jul 2, 2024
16 checks passed

XiangpengHao mentioned this pull request Jul 9, 2024

Improve performance reading ByteViewArray from parquet by removing an implicit copy #6031

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement directly build byte view array on top of parquet buffer #5972

Implement directly build byte view array on top of parquet buffer #5972

XiangpengHao commented Jun 27, 2024 •

edited by alamb

Loading

alamb left a comment

alamb Jul 1, 2024

XiangpengHao Jul 2, 2024

alamb Jul 1, 2024

XiangpengHao Jul 2, 2024

XiangpengHao commented Jul 2, 2024

alamb commented Jul 2, 2024

Implement directly build byte view array on top of parquet buffer #5972

Implement directly build byte view array on top of parquet buffer #5972

Conversation

XiangpengHao commented Jun 27, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 1, 2024

Choose a reason for hiding this comment

XiangpengHao Jul 2, 2024

Choose a reason for hiding this comment

alamb Jul 1, 2024

Choose a reason for hiding this comment

XiangpengHao Jul 2, 2024

Choose a reason for hiding this comment

XiangpengHao commented Jul 2, 2024

alamb commented Jul 2, 2024

XiangpengHao commented Jun 27, 2024 •

edited by alamb

Loading