Skip to content

Expose nan_count in $partitions metadata table#10709

Merged
losipiuk merged 3 commits intotrinodb:masterfrom
findinpath:iceberg-partition-nan-count
Apr 4, 2022
Merged

Expose nan_count in $partitions metadata table#10709
losipiuk merged 3 commits intotrinodb:masterfrom
findinpath:iceberg-partition-nan-count

Conversation

@findinpath
Copy link
Contributor

@findinpath findinpath commented Jan 20, 2022

Description

Expose nan_count partition metadata information in the $partitions metadata table.

Is this change a fix, improvement, new feature, refactoring, or other?

This is a new feature added to be consistent with the information exposed by Iceberg partitions table.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

This change is targeted primarily at the Iceberg connector.
However, because this change requires exposing new information from trino-orc module, it may (although it shouldn't) affect other Trino functionality which reads/writes ORC files.

How would you describe this change to a non-technical end user or system administrator?

This change adds new metadata information about REAL, DOUBLE columns in Iceberg $partitions metadata table.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x ) Release notes entries required with the following suggested text:

# Iceberg
* Expose `nan_count` in `$partitions` metadata table. ({pr}`10709`)

@cla-bot cla-bot bot added the cla-signed label Jan 20, 2022
@findinpath findinpath marked this pull request as draft January 20, 2022 13:32
@findinpath findinpath force-pushed the iceberg-partition-nan-count branch 2 times, most recently from b5575a8 to 7761c00 Compare February 21, 2022 08:47
@findinpath findinpath force-pushed the iceberg-partition-nan-count branch 2 times, most recently from 3adbbe7 to dca26c4 Compare February 23, 2022 04:42
@findinpath findinpath marked this pull request as ready for review February 23, 2022 08:50
@findinpath findinpath force-pushed the iceberg-partition-nan-count branch from dca26c4 to 6064f0e Compare February 23, 2022 08:52
@findinpath findinpath requested a review from dain February 23, 2022 09:49
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@findinpath findinpath requested a review from homar March 8, 2022 12:42
@findinpath findinpath force-pushed the iceberg-partition-nan-count branch from 6064f0e to 4c44a5f Compare March 8, 2022 12:42
@findinpath
Copy link
Contributor Author

Rebased on master due to conflicts.

Nan values relate solely to  double statistics.
When dealing with nan values, there can't be delivered any
range statistics about the data. Therefore a  new
field `nanValueCount` has been introduced in the `ColumnStatistics`
to deal with this situation.
@losipiuk losipiuk force-pushed the iceberg-partition-nan-count branch from 4c44a5f to a9586b3 Compare March 25, 2022 13:19
BooleanStatistics booleanStatistics,
IntegerStatistics integerStatistics,
DoubleStatistics doubleStatistics,
Long numberOfNanValues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put it inside DoubleStatistics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels you coud keep min/max as null if there are NaNs but keep nans count inside object. Woudl that not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial intention as well.
However the DoubleStatisticsBuilder doesn't allow dealing with NaNs

private Optional<DoubleStatistics> buildDoubleStatistics()
{
// if there are NaN values we cannot say anything about the data
if (nonNullValueCount == 0 || hasNan) {
return Optional.empty();
}
return Optional.of(new DoubleStatistics(minimum, maximum));

cc @dain

@findinpath
Copy link
Contributor Author

@losipiuk CPTAL ?

@losipiuk losipiuk merged commit 576955f into trinodb:master Apr 4, 2022
@github-actions github-actions bot added this to the 376 milestone Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants