Document special tables exposed by Iceberg#10514
Conversation
There was a problem hiding this comment.
There was a problem hiding this comment.
Thank you very much.
24bdb7d to
3d98369
Compare
|
Looks like #10480 will introduce more to be documented with $properties |
There was a problem hiding this comment.
The Iceberg connector maintains several hidden tables that provide metadata for a specific table. You can query each metadata table by appending the metadata table name to the table name::
SELECT * FROM "test_table$data"
9bc3be7 to
061e801
Compare
e720e58 to
1ee8c11
Compare
kbendick
left a comment
There was a problem hiding this comment.
This looks really good and is an important addition.
I've gone over most of the types and they all look good to me. I'll do another pass, but I left some initial nits / questions. Please feel free to resolve any comments that aren't relevant etc, as I'm new to this review and less familiar with the Trino codebase (coming mostly from the core Iceberg side).
I'd also mention that in the Iceberg docs, we provide at least one query that shows how to make use of the metadata tables. Such as this query that joins the history table on snapshots table.
Probably the most important metadata table (in my opinion) is the $files table, as many users want to inspect their underlying storage to see / count files, but that can be incorrect given that we retain older data. In a follow up, that might be something to emphasize.
As a follow up to this, I think it would be good to include some example queries like that to help users use these metadata tables. That's also something we need to work on in the iceberg docs themselves, so I'd be happy to collaborate with somebody on that or help them find the right people to do so. 😄
Please feel free to reach out to me anytime on the Iceberg slack or the Trino slack if I can be of help with anything or help finding additional points of contact.
There was a problem hiding this comment.
Nit: Within iceberg core, we typically refer to these as just Metadata Tables. Is there a conflicting concept within Trino that conflicts naming-wise which makes it better to avoid that terminology at the top-level?
If not, I would maybe make this heading Metadata Tables to be consistent. Or maybe Special Metadata Tables if you're looking to avoid conflicts.
There was a problem hiding this comment.
@mosabua do you know why the metadata tables are called special tables in the hive connector?
There was a problem hiding this comment.
Nit: Possibly the usage of provides general metadata information about the table is redundant without further clarifying that the metadata is user supplied (or generated) tags for the tables in addition to configured table properties (some of which are specific to iceberg).
Maybe The $properties table provides access to general information about iceberg table configuration and any additional metadata key/value pairs that users or engines mighthave tagged the table with. ?
Alternatively, it might be ok to simply reference that this is similar to hive's TBLPROPERTIES?
For reference I found the following in Hive's documentation:
The TBLPROPERTIES clause allows you to tag the table definition with your own
metadata key/value pairs. Some predefined table properties also exist,
such as last_modified_user and last_modified_time which are automatically
added and managed by Hive.
Iceberg doesn't supply last_modified_time as a tblproperty itself, but we also don't specifically try to stop HMS from adding these.
There was a problem hiding this comment.
I just noticed there is another PR specifically related to $properties, so I'll likely move this comment there if appropriate =)
There was a problem hiding this comment.
I just noticed there is another PR specifically related to $properties
#10480 is already merged.
I'm updating correspondingly the docs here.
There was a problem hiding this comment.
Question: is contains_nan not included from within Trino?
There was a problem hiding this comment.
Nit: Would it be worth mentioning for these that the integer keys are the field IDs used by Iceberg?
I don't have a good way at the moment to express that (especially without looking at the rest of the existing docs). This could be a follow up item, if anything.
There was a problem hiding this comment.
Good point.
I have adapted this to:
Mapping between the Iceberg column ID and its corresponding size within the columnar file
There was a problem hiding this comment.
So this is an odd one, as it's not currently used in open source (though it's coming and a high priority on the roadmap).
The javadoc for this interface (which is a light wrapper around a ByteBuffer) refer to this as metadata about an encrypted data file's encryption key. More specifically, it likely refers to the location of the encryption key but that's probably being iterated on to be more efficient so I wouldn't state that for now.
Maybe Metadata about the encryption key used to encrypt this file, if applicable?
I should mention that the avro doc comment for this just simply says Encryption key metadata blob.
There was a problem hiding this comment.
Might consider referring to this as an enum.
There was a problem hiding this comment.
Can you suggest me an appropriate way to bring up the enum aspect in the description?
There was a problem hiding this comment.
I would just talk about it as list of valid options or available values or so. And define what each means
f46a253 to
6fd7499
Compare
losipiuk
left a comment
There was a problem hiding this comment.
Some nits.
Let me know when ready to merge
fec4e49 to
084a712
Compare
There was a problem hiding this comment.
follow-up PR for exposing further fields in the $manifests table: #10809
85910ab to
16fbf84
Compare
16fbf84 to
80ab441
Compare
mosabua
left a comment
There was a problem hiding this comment.
Thanks for all the updates. Great addition to the docs.
No description provided.