API: Add Aggregate expression evaluation by rdblue · Pull Request #6405 · apache/iceberg

rdblue · 2022-12-12T00:08:06Z

This PR has classes for implementing aggregation expressions in the API module.

rdblue · 2022-12-12T00:16:11Z

@huaxingao, I was looking at #6252 and I wanted to try out implementing aggregation in either the core or API modules so that the majority of the logic could be shared rather than needing to implement it in every processing engine.

Could you please take a look at this and see if it seems reasonable?

The basic idea is to use BoundAggregate to do two things:

Extract a value to aggregate in eval(StructLike) or eval(DataFile), which is similar to how eval is used for other expressions
Create an Aggregator that keeps track of the aggregate state

Then this also adds AggregateEvaluator that operates on a list of aggregate expressions

aggEval = AggregateEvaluator.create(tableSchema, expressions) binds the expressions and creates aggregators for each one
aggEval.update(StructLike) and aggEval.update(DataFile) updates each expression aggregator
aggEval.result() returns a StructLike with the aggregated values
aggEval.resultType() returns a StructType for the aggregated values

This is based on #6252, but tries to keep as much logic as possible in core/API. What do you think? Could we incorporate this into #6252?

zinking · 2022-12-12T02:30:32Z

api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java

+    return safeGet(map, key, null);
+  }
+
+  <V> V safeGet(Map<Integer, V> map, int key, V defaultValue) {


should this belong to some util class or possibly null isnt allowed

huaxingao · 2022-12-12T20:06:25Z

@rdblue Thank you very much for the PR! I will get your code to my local and work on integrating my changes into yours.

amogh-jahagirdar · 2022-12-13T03:14:01Z

api/src/main/java/org/apache/iceberg/expressions/CountStar.java

+    if (count < 0) {
+      return null;
+    }


Curious when would this ever be negative? or is this just for this logic to be defensive against bad metadata?

Some imported Avro files had incorrect metadata several versions ago. I don't think it is widespread, but it is good to handle it.

amogh-jahagirdar · 2022-12-13T03:49:54Z

api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java

+    List<Types.NestedField> resultFields = Lists.newArrayList();
+    for (int pos = 0; pos < aggregates.size(); pos += 1) {
+      BoundAggregate<?, ?> aggregate = aggregates.get(pos);
+      aggregatorsBuilder.add(aggregates.get(pos).newAggregator());


I guess here we could also reuse aggregate?

huaxingao · 2022-12-19T01:05:42Z

api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java

+        return null;
+      }
+
+      return result();


Do you mean return current();?

huaxingao · 2022-12-19T01:05:56Z

api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java

+  @Override
+  protected Long countFor(DataFile file) {
+    // NaN value counts were not required in v1 and were included in value counts
+    return safeAdd(safeGet(file.valueCounts(), fieldId), safeGet(file.nanValueCounts(), fieldId, 0L));


Shall we subtract the nullValueCounts?

Yes, you're right. This will include NaN and null values:

Map from column id to number of values in the column (including null and NaN values)

That means we should actually not add the NaN count.

huaxingao · 2022-12-19T01:14:14Z

@rdblue Thank you very much for the PR! The changes are much cleaner and more generic now. These can be wrapped cleanly in Spark. Once your PR is in, I will make Spark changes on top of your changes. Thanks a lot!

PraveenNanda124

Looks good

github-actions · 2024-08-24T00:13:04Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-31T00:13:36Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

API: Add Aggregate expression evaluation.

97d657c

github-actions bot added the API label Dec 12, 2022

rdblue mentioned this pull request Dec 12, 2022

push down min/max/count to iceberg #6252

Closed

zinking reviewed Dec 12, 2022

View reviewed changes

amogh-jahagirdar reviewed Dec 13, 2022

View reviewed changes

huaxingao reviewed Dec 19, 2022

View reviewed changes

PraveenNanda124 reviewed Dec 24, 2022

View reviewed changes

huaxingao mentioned this pull request Jan 19, 2023

push down min/max/count to iceberg #6622

Merged

github-actions bot added the stale label Aug 24, 2024

github-actions bot closed this Aug 31, 2024

Conversation

rdblue commented Dec 12, 2022

Uh oh!

rdblue commented Dec 12, 2022

Uh oh!

zinking Dec 12, 2022

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Dec 12, 2022

Uh oh!

amogh-jahagirdar Dec 13, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Dec 13, 2022

Choose a reason for hiding this comment

Uh oh!

huaxingao Dec 19, 2022

Choose a reason for hiding this comment

Uh oh!

huaxingao Dec 19, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Dec 19, 2022

Uh oh!

PraveenNanda124 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 24, 2024

Uh oh!

github-actions bot commented Aug 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants