Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement directly build byte view array on top of parquet buffer #5972

Merged
merged 20 commits into from
Jul 2, 2024

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Jun 27, 2024

Which issue does this PR close?

This PR is not ready to review until we merged #5970 .

Part of #5904 , sequel to #5968 and #5970

Rationale for this change

This PR has the real work to directly transform the parquet page buffer into a view buffer without extra copy.
To see the performance difference, you can run:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/Binary.*Array/plain encoded"
arrow_array_reader/BinaryArray/plain encoded, mandatory, no NULLs
                        time:   [321.33 µs 322.06 µs 323.48 µs]
                        change: [-1.1124% -0.8542% -0.4913%] (p = 0.00 < 0.05)
                        Change within noise threshold.
arrow_array_reader/BinaryArray/plain encoded, optional, no NULLs
                        time:   [327.34 µs 327.65 µs 328.06 µs]
                        change: [-0.8087% -0.7037% -0.5861%] (p = 0.00 < 0.05)
                        Change within noise threshold.
arrow_array_reader/BinaryArray/plain encoded, optional, half NULLs
                        time:   [416.00 µs 416.32 µs 416.65 µs]
                        change: [+0.4281% +0.6535% +0.8394%] (p = 0.00 < 0.05)
                        Change within noise threshold.

arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs
                        time:   [238.51 µs 239.00 µs 239.40 µs]
                        change: [-3.9669% -3.6590% -3.3511%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs
                        time:   [243.73 µs 243.82 µs 243.91 µs]
                        change: [+1.3771% +1.5136% +1.6413%] (p = 0.00 < 0.05)
                        Performance has regressed.
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs
                        time:   [179.69 µs 180.14 µs 180.87 µs]
                        change: [-0.7692% -0.4983% -0.1523%] (p = 0.00 < 0.05)
                        Change within noise threshold.

You should find that with this PR, reading BinaryViewArray is faster than reading BinaryArray -- a milestone from making StringViewArray faster than StringArray.

When this (set of) PR is merged, the last piece is to make utf8 validation fast, so that string view can maintain the advantage.

What changes are included in this PR?

This PR only includes decoding plain data for ease of review. Supporting for RLE/dictionary will be filed soon.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2024
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 28, 2024
@XiangpengHao XiangpengHao marked this pull request as ready for review July 1, 2024 12:26
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Jul 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @XiangpengHao -- let me know if you think this PR is ready to merge

}

let mut buffer = ViewBuffer::default();
let mut decoder = ByteViewArrayDecoderPlain::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we could have DictionaryArray<Int32, StringView> (as in a dictionary array whose value array is a dictionary?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "whose value array is a string view"?
Yes, I think for optimal performance, the value buffer of the dictionary should also be in string view type to avoid double copying.

Comment on lines +317 to +319
if self.validate_utf8 {
check_valid_utf8(unsafe { buf.get_unchecked(start_offset..end_offset) })?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way that might be faster would be to defer the checking until after all the views were made. Then, you could take a second pass through for view validation. Maybe that would be faster than doing it inlined here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a plan for very fast utf8 vailidation, but don't want to complicate this PR here. I'll file a follow up PR that addresses the validation issue, we will hopefully see that loading StringViewArray is similar to loading BinaryViewArray.

@XiangpengHao
Copy link
Contributor Author

I think this PR is good to go now @alamb

This reverts commit 5e68870.
@alamb alamb merged commit 859c4ad into apache:master Jul 2, 2024
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

Thanks @XiangpengHao

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants