-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Dictionary
in Parquet Metadata Statistics
#11145
Comments
Yes - the code summarizing the max and min isn't working correctly for a Dictionary. In the test case, the max_value or min_value in a StringArray that needs to be mapped to the appropriate Dictionary type before being passed into the update_batch methods. I will have a go at fixing it but can't do that until tomorrow morning so someone else should be feel free to pick it up if they need a fix before then. |
In fact, it seems to me that the mapping to the correct dictionary type should probably be performed here? datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs Lines 452 to 454 in 7e49ccf
|
Well, I don't think it can be easily modified at the source and that maybe isn't the right thing to do. So probably best to just address it in |
take |
Describe the bug
When a column has data type in
Dictionary
, the parquet metadata statistics returnsExact(Dictionary(Int32, Utf8(NULL)))
for min and max valuesTo Reproduce
Run the test below in this file:
datafusion/datafusion/core/src/datasource/file_format/parquet.rs
Line 1363 in 8216e32
Expected behavior
Expect statistics to show the min and max values. For the reproducer given above, I'm expecting to get:
max_value
:Exact(Dictionary(Int32, Utf8("a")))
min_value
:Exact(Dictionary(Int32, Utf8("d")))
Additional context
The underlying statistics extraction code should have no problems extracting statistics from Dictionary columns
The code is
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 452 to 454 in 7e49ccf
And the tests are here:
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Lines 1729 to 1768 in 7e49ccf
I wonder if something about the code that summarizes the statistics across row groups
datafusion/datafusion/core/src/datasource/file_format/parquet.rs
Lines 468 to 495 in 7e49ccf
doesn't handle dictionaries correctly 🤔
The text was updated successfully, but these errors were encountered: