Skip to content

Conversation

@jackylee-ch
Copy link
Contributor

We’re using Aggregate to generate per-column statistics from data files—metrics like value counts, min/max, and null counts—but discovered that null counts (CountNull) aren’t currently supported. This PR adds null count support for data files. It also enables any other consumer to easily collect null counts moving forward.

@github-actions github-actions bot added the API label Sep 3, 2025
@jackylee-ch
Copy link
Contributor Author

@huaxingao PTAL

@LuciferYang
Copy link

Could you help review this pr when you have time? @huaxingao Thanks ~

@kevinjqliu
Copy link
Contributor

@kevinjqliu
Copy link
Contributor

@huaxingao
Copy link
Contributor

should we also add to SparkAggregates

I believe CountNull is not supported by Spark because it's not standard SQL, which is why I didn't add it at first; However, it's very useful for data-file statistics, so I think it's reasonable to add the support.

@kevinjqliu
Copy link
Contributor

i wonder if we need to update the docs too

/**
* The aggregate functions that can be pushed and evaluated in Iceberg. Currently only three
* aggregate functions Max, Min and Count are supported.
*/

we have since added CountStar, CountNonNull, and now CountNull

@kevinjqliu
Copy link
Contributor

i wonder if this is consider a "spec change" or at least clarification.

from https://iceberg.apache.org/spec/#data-file-fields,
i see value_counts, null_value_counts, and nan_value_counts

@huaxingao
Copy link
Contributor

In the original java doc Count covers both Count(*) and Count(field), but now since we also support CountNull, lets update the java doc and spell out each of them. Something like this:

 /** 
  * The aggregate functions that can be evaluated in Iceberg. Supported aggregates include
  * Min(field), Max(field), Count(*), Count(field) and CountNull(field)
  */ 

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao huaxingao merged commit ade1263 into apache:main Sep 17, 2025
43 checks passed
@huaxingao
Copy link
Contributor

Thanks @jackylee-ch for the PR! Thanks @kevinjqliu for the review!

@jackylee-ch
Copy link
Contributor Author

Thanks for your preview. @huaxingao @kevinjqliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants