API: Add Aggregate expression evaluation#6405
Conversation
|
@huaxingao, I was looking at #6252 and I wanted to try out implementing aggregation in either the core or API modules so that the majority of the logic could be shared rather than needing to implement it in every processing engine. Could you please take a look at this and see if it seems reasonable? The basic idea is to use
Then this also adds
This is based on #6252, but tries to keep as much logic as possible in core/API. What do you think? Could we incorporate this into #6252? |
| return safeGet(map, key, null); | ||
| } | ||
|
|
||
| <V> V safeGet(Map<Integer, V> map, int key, V defaultValue) { |
There was a problem hiding this comment.
should this belong to some util class or possibly null isnt allowed
|
@rdblue Thank you very much for the PR! I will get your code to my local and work on integrating my changes into yours. |
| if (count < 0) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Curious when would this ever be negative? or is this just for this logic to be defensive against bad metadata?
There was a problem hiding this comment.
Some imported Avro files had incorrect metadata several versions ago. I don't think it is widespread, but it is good to handle it.
| List<Types.NestedField> resultFields = Lists.newArrayList(); | ||
| for (int pos = 0; pos < aggregates.size(); pos += 1) { | ||
| BoundAggregate<?, ?> aggregate = aggregates.get(pos); | ||
| aggregatorsBuilder.add(aggregates.get(pos).newAggregator()); |
There was a problem hiding this comment.
I guess here we could also reuse aggregate?
| return null; | ||
| } | ||
|
|
||
| return result(); |
There was a problem hiding this comment.
Do you mean return current();?
| @Override | ||
| protected Long countFor(DataFile file) { | ||
| // NaN value counts were not required in v1 and were included in value counts | ||
| return safeAdd(safeGet(file.valueCounts(), fieldId), safeGet(file.nanValueCounts(), fieldId, 0L)); |
There was a problem hiding this comment.
Shall we subtract the nullValueCounts?
There was a problem hiding this comment.
Yes, you're right. This will include NaN and null values:
Map from column id to number of values in the column (including null and NaN values)
That means we should actually not add the NaN count.
|
@rdblue Thank you very much for the PR! The changes are much cleaner and more generic now. These can be wrapped cleanly in Spark. Once your PR is in, I will make Spark changes on top of your changes. Thanks a lot! |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This PR has classes for implementing aggregation expressions in the API module.