-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Support dictionary type in parquet metadata statistics. #11169
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @efredine -- I think this PR fixes the bug 🙏
I left some comments about how to improve the tests -- let me know what you think. I think we can also improve the test in a follow on PR as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing it! @efredine 💯
I left comments in the test to avoid confusions
assert_eq!(c_dic_stats.null_count, Precision::Exact(0)); | ||
assert_eq!( | ||
c_dic_stats.max_value, | ||
Precision::Exact(Utf8(Some("c".into()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precision::Exact(Utf8(Some("c".into()))) | |
Precision::Exact(Utf8(Some("d".into()))) |
To avoid any confusion, with the new dictionary keys, the max value is "d"
Thanks @alamb @appletreeisyellow - I should have reviewed the tests more closely - thanks for the feedback. I will make the adjustments now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @efredine and @appletreeisyellow
…11169) * fix: Support dictionary type in parquet metadata statistics. * Simplify tests. --------- Co-authored-by: Eric Fredine <[email protected]>
Which issue does this PR close?
Closes #11145.
Rationale for this change
What changes are included in this PR?
Modifies
create_max_min_accs
to instantiate accumulators for unpacked data - the value DataType of the Dictionary.This bug is very similar to a previous bug that impacted the Min/Max aggregate functions. #1235
In addition, the
min_max_aggregate_data_type
fn is copied fromdatafusion/datafusion/physical-expr/src/aggregate/min_max.rs
Lines 63 to 73 in 8216e32
I'm unsure if copying the function is the right thing to do in order to prevent coupling between the crates or if it should be moved to some core crate? It also seems like the dedicated min and max functions are to be refactored into a user defined functions?
Are these changes tested?
Yes - a new test has been added.
Are there any user-facing changes?
Currently implemented so the column statistics are returned as an unpacked type. So for
DataType::Dictionary(Int32, Utf8)
the min or max value is returned asExact(Utf8("a"))
. Would it be better to return it asExact(Dictionary(Int32, Utf8("a")))
? I'm unsure what the previous implementation would have returned and whether or not it was correct. But it's possible that returning it as an unpacked type would be a breaking change if it previously returned it as a Dictionary type.Happy to change the implementation to return a Dictionary type. I'm just unsure which offers the best experience.