Fix reading Iceberg $manifests table when contains_nan is NULL#11241
Fix reading Iceberg $manifests table when contains_nan is NULL#11241findepi merged 1 commit intotrinodb:masterfrom
Conversation
There was a problem hiding this comment.
Optional may be handy here
Optional.ofNullable(summary.containsNaN()).ifPresentOrElse(
containsNan -> BOOLEAN.writeBoolean(rowBuilder, containsNan),
rowBuilder::appendNull);There was a problem hiding this comment.
I was thinking now also about something similar. A similar change would be beneficial for unboxing other types as well.
Alternatively this could be handled behind the scenes within BOOLEAN if we pass Boolean instead of boolean
There was a problem hiding this comment.
I'd love to avoid nullable reference types where possible. Quite easy to miss.
The writer being "smarter" means it's easy to introduce mistakes.
There was a problem hiding this comment.
Hi all
The field is nullable per the spec. It's generally only missing for older data files, so it's usually encountered (and why it's not super reproducible). But it is nullable per the spec. Older tables won't have this field populated and so that is a source of issue.
It is easy to encounter NPEs in my opinion so I typically prefer to handle things with optionals, but it is usually present (if that informs your thinking at all for performance etc) 👍
There was a problem hiding this comment.
For reference, here is the equivalent code for the scan in Spark: https://github.com/apache/iceberg/blob/68091037944ff7e9de91e7b619f313a8e98c1adc/core/src/main/java/org/apache/iceberg/ManifestsTable.java#L102-L123
Note that the column is nullable: https://github.com/apache/iceberg/blob/68091037944ff7e9de91e7b619f313a8e98c1adc/core/src/main/java/org/apache/iceberg/ManifestsTable.java#L41
|
I have tried with In the snippet below can be seen that the @vincentpoon can you point out on how to reproduce reading |
|
@findinpath I'm not sure, we had tables and data from an older Trino version, 363. Reading the |
|
Hi all! Stopping in as I have some knowledge in this area and have contributed in it upstream in the Iceberg repo TLDR - It's almost certainly related to older data. The partition summary field Older versions of Iceberg didn't set this in partition summaries field. So older data will be missing this field. To support nan metrics a new "metadata row" was added to the writer which counts nan's. This lower level data is missing in some formats, but the top level partition summary is almost always populated For the manifest summary of a partition field, it is now set for all table formats that I tested for V1 (which was all of them), but it's not a requirement to be written technically for backwards compatibility purposes if it would be too expensive to write and it's also technically nullable in the spec, so formats that don't contain metrics aren't required to have it and cases such as in place file imports may not support it at some later date because of what is in the spec. So it should be treated as nullable. Other reasons to keep it nullable would be things like writing NaN as a bucketed partition field may not have been handled correctly in certain engines and the compatibility concerns make it much more practical to keep as nullable. However, in practice, at the partition summary level |
|
@findinpath can you please rename commit to something like |
2d62430 to
0e0735f
Compare
Description
This change affects queries on
$manifestsIceberg metadata tables.This PR fixes the handling of
contains_nanfield for partition summaries in Iceberg manifests table.This change affects the Iceberg connector.
Add handling for reading correctly the manifests of Iceberg files written with older Iceberg version (that didn't contain yet
contains_nanfield handling).Related issues, pull requests, and links
Fixes #11237
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: