Skip to content

API: Add Aggregate expression evaluation#6405

Closed
rdblue wants to merge 1 commit intoapache:mainfrom
rdblue:agg-evaluation
Closed

API: Add Aggregate expression evaluation#6405
rdblue wants to merge 1 commit intoapache:mainfrom
rdblue:agg-evaluation

Conversation

@rdblue
Copy link
Copy Markdown
Contributor

@rdblue rdblue commented Dec 12, 2022

This PR has classes for implementing aggregation expressions in the API module.

@github-actions github-actions bot added the API label Dec 12, 2022
@rdblue
Copy link
Copy Markdown
Contributor Author

rdblue commented Dec 12, 2022

@huaxingao, I was looking at #6252 and I wanted to try out implementing aggregation in either the core or API modules so that the majority of the logic could be shared rather than needing to implement it in every processing engine.

Could you please take a look at this and see if it seems reasonable?

The basic idea is to use BoundAggregate to do two things:

  1. Extract a value to aggregate in eval(StructLike) or eval(DataFile), which is similar to how eval is used for other expressions
  2. Create an Aggregator that keeps track of the aggregate state

Then this also adds AggregateEvaluator that operates on a list of aggregate expressions

  • aggEval = AggregateEvaluator.create(tableSchema, expressions) binds the expressions and creates aggregators for each one
  • aggEval.update(StructLike) and aggEval.update(DataFile) updates each expression aggregator
  • aggEval.result() returns a StructLike with the aggregated values
  • aggEval.resultType() returns a StructType for the aggregated values

This is based on #6252, but tries to keep as much logic as possible in core/API. What do you think? Could we incorporate this into #6252?

return safeGet(map, key, null);
}

<V> V safeGet(Map<Integer, V> map, int key, V defaultValue) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this belong to some util class or possibly null isnt allowed

@huaxingao
Copy link
Copy Markdown
Contributor

@rdblue Thank you very much for the PR! I will get your code to my local and work on integrating my changes into yours.

Comment on lines +38 to +40
if (count < 0) {
return null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious when would this ever be negative? or is this just for this logic to be defensive against bad metadata?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some imported Avro files had incorrect metadata several versions ago. I don't think it is widespread, but it is good to handle it.

List<Types.NestedField> resultFields = Lists.newArrayList();
for (int pos = 0; pos < aggregates.size(); pos += 1) {
BoundAggregate<?, ?> aggregate = aggregates.get(pos);
aggregatorsBuilder.add(aggregates.get(pos).newAggregator());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here we could also reuse aggregate?

return null;
}

return result();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean return current();?

@Override
protected Long countFor(DataFile file) {
// NaN value counts were not required in v1 and were included in value counts
return safeAdd(safeGet(file.valueCounts(), fieldId), safeGet(file.nanValueCounts(), fieldId, 0L));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we subtract the nullValueCounts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This will include NaN and null values:

Map from column id to number of values in the column (including null and NaN values)

That means we should actually not add the NaN count.

@huaxingao
Copy link
Copy Markdown
Contributor

@rdblue Thank you very much for the PR! The changes are much cleaner and more generic now. These can be wrapped cleanly in Spark. Once your PR is in, I will make Spark changes on top of your changes. Thanks a lot!

Copy link
Copy Markdown

@PraveenNanda124 PraveenNanda124 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 24, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants