-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look into optimizing reading FixedSizeBinary arrays from parquet #6219
Comments
I think there might be some confusion here, apache/datafusion#11170 (comment) appears to be misreading of a profile. Whereas #6159 (comment) concerns the non-arrow codepaths which are not optimised and perform an allocation for each value |
I could definitely be confused -- is there any low hanging fruit left for reading FixedSizeBinary in the ArrowReader that you know of? |
Not that immediately springs to mind, but it has been almost 2 years so I could just have forgotten |
Unless I'm mistaken those are PRs for the non-arrow reader and some newly added unreleased functionality |
Correct, but #6222 is at least in the same module as this issue 😅. While I'm working on that I can take a look at the other decoders and see if there's any low hanging fruit (although tbh I'm not seeing any at the moment). |
It seems like the majority of time is spent converting
|
I did a quick test of FIXED_LEN_BYTE_ARRAY(16)/Decima128 vs unannotated FIXED_LEN_BYTE_ARRAY(16), and the latter was much faster.
I don't think FLBA decoding in the arrow reader is the culprit. |
Pardon a little more spam on this, but as I dig deeper into the arrow-rs/arrow-array/src/array/primitive_array.rs Line 1318 in a693f0f
Buffer , while creating a null buffer as it goes. But in FixedLenByteArrayReader::consume_batch , we already have a null buffer in binary
I'm wondering if it would make sense here to (in the cases where we're converting from ArrowType::Decimal128(p, s) => {
let nb = binary.take_nulls();
let decimal = binary
.iter()
.map(|o| match o {
Some(b) => i128::from_be_bytes(sign_extend_be(b)),
None => i128::default(),
});
let decimal = Decimal128Array::from_iter_values_with_nulls(decimal, nb)
.with_precision_and_scale(*p, *s)?;
Arc::new(decimal)
} Am I missing something subtle (not out of the question...this is my 5th attempt or so) that would break this? In particular, is it safe to assume the null buffer in |
It sounds plausible, but I've not spent much time with this code beyond what was necessary to allow ripping out the legacy ComplexObjectArrayReader, and that was almost 2 years ago. There is almost certainly some low hanging fruit when it comes to reading Decimal128Array and IntervalArray, that is part of why I was quite so confused to see this issue which suggests the opposite. |
Thanks. I'll clean up what I have and submit a PR in the next day or two then. I agree that the initial report was likely a misunderstanding. |
I think we can plausibly claim that we have accomplished this ticket: Looking into optimizing the reads. So marking it as closed. Let's open other tickets as we find other ways to improve things |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We have anecdotal evidence in DataFusion (see @samuelcolvin 's ticket apache/datafusion#11170) that reading 16 byte UUID values from
Decimal128
is much faster thanFixedSizeBinary
, despite seemingly very little difference between the two@appletreeisyellow found that most of the time is spent reading parquet: apache/datafusion#11170 (comment)
@etseidl also noted the slowness when working with FixedSizeBinary here #6159 (comment)
Describe the solution you'd like
Look into improving the parquet reader so reading FixedSizeBinary was faster
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: