-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10927: [Rust][Parquet] Add Decimal to ArrayBuilderReader #8880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8880 +/- ##
==========================================
- Coverage 83.25% 83.24% -0.01%
==========================================
Files 196 196
Lines 48116 48177 +61
==========================================
+ Hits 40059 40106 +47
- Misses 8057 8071 +14
Continue to review full report at Codecov.
|
|
@seddonm1 Thanks for mentioning this issue - I was not aware of it. Could you point me in the direction of some modules / tests I could use to get started with data fusion to help with it? Currently, |
|
The TPC-H queries are fairly basic but do decimal operations like these which I think are all decimal type: sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc, |
|
@sweb , datafusion is built on top of the arrow crate, and relies on both the array types and the Different nodes in the logical plan require different here is an example of a test that does not check IO with CSV, only the execute in memory. I do not know if we have DecimalType on the files Essentially, almost everything that has a |
|
@seddonm1 @jorgecarleitao thank you for providing context! I will have a look and try to come up with a first implementation. |
Please open a separate JIRA 😃 The Parquet spec supports reading decimal data from a few other types ( I think reading from |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@nevi-me thank you for your review! I have opened a new issue with a subtask specific to this PR. I will try to provide implementations for missing types (I only considered fixed size binary, since this is what I get from Spark ;)) and add the corresponding writers as well. |
|
That's interesting. I tried reading a fixed size binary today (interval) but Spark complained that it doesn't support reading fixed size binary. So I thought decimal might be backed by something else. |
|
@nevi-me Here are the links to the Spark Parquet docs which talk about "legacy" mode which I think is what you are bumping into: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#configuration |
|
Interesting - now I need to find out why we are writing files in legacy mode ;) |
|
Thanks @sweb. I would like to assign the JIRA to you but need to add you to the contributor role there first but I'm there are multiple JIRA accounts with your name. Could you let me know which one is yours? |
|
Thanks @andygrove. |
|
@andygrove @sweb we've got CI failures, I suspect we merged too many changes without rebasing. |
|
Thanks @nevi-me I am looking now |
|
I will revert this PR for now |
|
The reason of the failure is not obvious to me. @sweb we'll need to rebase changes after we reopen this PR, so we can see what the issue is. |
|
@nevi-me @andygrove I will take a look at it and try to resolve it once master is stable and this PR is reopened. |
84220b7 to
41fee88
Compare
|
@nevi-me @andygrove I fixed the issue. There is still a CI failure, though I think that this is unrelated and just a hickup. Is there a way to restart the CI job without whitespace-commit? |
|
I've restarted the job, we can merge this after CI is all green. Thanks @sweb |

This PR introduces capabilities to construct a
DecimalArrayfrom Parquet files.This is done by adding a new
Converter<Vec<Option<ByteArray>>, DecimalArray>inparquet::arrow::converter:It is then used in
ArrayBuilderReaderusing a match guard to differentiate it from regular fixed size binaries:A test was added that uses a parquet file from
PARQUET_TEST_DATAto check whether loading and casting toDecimalArrayworks. I did not find a corresponding json file with the correct values, but I usedpyarrowto extract the correct values and hard coded them in the test.I thought that this PR would require #8784 to be merged, but they are independent. I used the same JIRA issue here - I hope that this is okay.