Skip to content

Fix reading Iceberg $manifests table when contains_nan is NULL#11241

Merged
findepi merged 1 commit intotrinodb:masterfrom
findinpath:iceberg-contains-nan
Mar 25, 2022
Merged

Fix reading Iceberg $manifests table when contains_nan is NULL#11241
findepi merged 1 commit intotrinodb:masterfrom
findinpath:iceberg-contains-nan

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Mar 1, 2022

Description

This change affects queries on $manifests Iceberg metadata tables.

Is this change a fix, improvement, new feature, refactoring, or other?

This PR fixes the handling of contains_nan field for partition summaries in Iceberg manifests table.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

This change affects the Iceberg connector.

How would you describe this change to a non-technical end user or system administrator?

Add handling for reading correctly the manifests of Iceberg files written with older Iceberg version (that didn't contain yet contains_nan field handling).

Related issues, pull requests, and links

Fixes #11237

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

Comment on lines 140 to 146
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional may be handy here

            Optional.ofNullable(summary.containsNaN()).ifPresentOrElse(
                    containsNan -> BOOLEAN.writeBoolean(rowBuilder, containsNan),
                    rowBuilder::appendNull);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking now also about something similar. A similar change would be beneficial for unboxing other types as well.

Alternatively this could be handled behind the scenes within BOOLEAN if we pass Boolean instead of boolean

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to avoid nullable reference types where possible. Quite easy to miss.

The writer being "smarter" means it's easy to introduce mistakes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all

The field is nullable per the spec. It's generally only missing for older data files, so it's usually encountered (and why it's not super reproducible). But it is nullable per the spec. Older tables won't have this field populated and so that is a source of issue.

It is easy to encounter NPEs in my opinion so I typically prefer to handle things with optionals, but it is usually present (if that informs your thinking at all for performance etc) 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test it?

@findinpath
Copy link
Copy Markdown
Contributor Author

findinpath commented Mar 2, 2022

I have tried with write.metadata.metrics.default set to none for not writing any metrics, but contains_nan was still set to false in my test case.

In the snippet below can be seen that the containsNan field can't be NULL :

https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/PartitionSummary.java#L77-L81

@vincentpoon can you point out on how to reproduce reading NULL for contains_nan ? Were you reading from Iceberg files written with an older version of Iceberg?

@vincentpoon
Copy link
Copy Markdown
Member

@findinpath I'm not sure, we had tables and data from an older Trino version, 363. Reading the $manifests was problematic for that data, but for newer tables we just created, we don't have this problem.

@kbendick
Copy link
Copy Markdown
Contributor

kbendick commented Mar 3, 2022

Hi all!

Stopping in as I have some knowledge in this area and have contributed in it upstream in the Iceberg repo

TLDR - It's almost certainly related to older data. The partition summary field contains_nan is handled by the writers. An extra column was added to support it, and so for backwards compatibility reasons, it's considered nullable. It also sometimes doesn't show up in ORC.

Older versions of Iceberg didn't set this in partition summaries field. So older data will be missing this field.

To support nan metrics a new "metadata row" was added to the writer which counts nan's. This lower level data is missing in some formats, but the top level partition summary is almost always populated

For the manifest summary of a partition field, it is now set for all table formats that I tested for V1 (which was all of them), but it's not a requirement to be written technically for backwards compatibility purposes if it would be too expensive to write and it's also technically nullable in the spec, so formats that don't contain metrics aren't required to have it and cases such as in place file imports may not support it at some later date because of what is in the spec.

So it should be treated as nullable. Other reasons to keep it nullable would be things like writing NaN as a bucketed partition field may not have been handled correctly in certain engines and the compatibility concerns make it much more practical to keep as nullable.

However, in practice, at the partition summary level contains_nan is pretty much always populated for a partition summary in newer versions of Iceberg. In older versions, this field won't be present in the underlying summaries and so won't be present.

@findinpath findinpath requested a review from losipiuk March 8, 2022 17:27
@findepi findepi changed the title Add null handling for contains_nan field in partition summaries Fix reading Iceberg $manifests table when contains_nan is NULL Mar 15, 2022
@findepi
Copy link
Copy Markdown
Member

findepi commented Mar 15, 2022

@findinpath can you please rename commit to something like Fix reading Iceberg $manifests table when contains_nan is NULL?

@findinpath findinpath force-pushed the iceberg-contains-nan branch from 2d62430 to 0e0735f Compare March 15, 2022 11:18
@findepi findepi merged commit cf92324 into trinodb:master Mar 25, 2022
@findepi findepi added the no-release-notes This pull request does not require release notes entry label Mar 25, 2022
@github-actions github-actions bot added this to the 375 milestone Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

Query against $manifests Iceberg table fails with NullPointerException

6 participants