-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns #8311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| DCHECK(lhs.is_array()); | ||
| if (lhs.type()->id() == Type::DICTIONARY && rhs.type()->id() == Type::DICTIONARY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm What do you think about adding kernels to scalar_compare.cc which do this inside compute:: ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this sounds fine, can you open a JIRA issue about it?
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the non-performant way of decoding is fine for now (certainly because the array+scalar case will be more common).
But should there be some more tests added?
Could also use the small reproducer from the issue (my comment) to add as a python test
| } | ||
|
|
||
| auto maybe_min = min->CastTo(field->type()); | ||
| auto maybe_max = max->CastTo(field->type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change behaviour? For a dictionary with string values, is field->type() string or dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatisticsAsScalars returns scalars whose types are the correct physical type, so even if the column was dictionary(string) min and max would be just string before this cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(IE, it only changes behavior in cases where the physical type wasn't appropriate)
Parquet row group statistics did not respect dict encoding. Also added a workaround to support filtering a dictionary encoded column.