-
Notifications
You must be signed in to change notification settings - Fork 3k
Adding documentation for metadata tables #3159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this!
| @@ -0,0 +1,134 @@ | |||
| # Metadata Tables | |||
|
|
|||
| This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer to change line on full sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the definitions page
|
|
||
| ## Table Schema | ||
|
|
||
| ### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the actual names of the tables, instead of the class name, like files, manifests, entries, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use of this HTML tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Jack on the section names.
| | _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent | | ||
| | | _required_ | **`sequence-number`** | A monotonically increasing long that tracks the order of changes to a table | | ||
| | _required_ | _required_ | **`timestamp-ms`** | A timestamp when the snapshot was created, used for garbage collection and table inspection | | ||
| | _optional_ | _required_ | **`manifest-list`** | The location of a manifest list for this snapshot that tracks manifest files with additional meadata | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
| +----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+ | ||
| ``` | ||
|
|
||
| ### Metadata Table Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what other people think, I think I would prefer having the schema all in that metadata.md page, and we can remove this section in Spark and only provide a SQL example for the syntax to query the table + a link to that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much value in having these schemas here. Seems like they'll just get out of date.
| ### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable` | ||
|
|
||
| | Column name | Required | Data type | Description | | ||
| |-----------------------|-----------|-------------------|-------------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a Column ID column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need IDs? I'm not sure those are valuable to users.
| | equality_ids | | `list<int>` | Equality comparison field IDs | ||
| | sort_order_id | | int | Sort order ID | ||
|
|
||
| ### <a id="AllEntriesTable"></a> 2. `AllEntriesTable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anchors are generated automatically, so no need to add them in the markdown.
| @@ -0,0 +1,134 @@ | |||
| # Metadata Tables | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this documentation? Much of this is already covered in the Spark queries page, which this links to for docs on how to query the metadata tables.
Who is the intended audience for these docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is not a Spark only feature, the intention is to make it a top level documentation, and have proper tables for the metadata table schema instead of showing Java code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a pretty good argument against adding this, though. Documenting what tables exist without an engine is confusing to anyone looking for how to view metadata tables. Just saying that these tables exist and what their schemas are doesn't help a user coming to the docs.
If this were part of a document on the API that explained how to use the metadata tables and was targeted at engine developers, I think it would be more valuable. But I think placing it under tables will just cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that is a confusion point. So do you recommend having one section for each engine around system tables? My current thought is to have this page introducing the Iceberg schema, linking to related sections in each engine page for examples of using system tables.
One of my intention to add it was because this is a very important feature not exist in other similar products, and it provides huge benefits for users to build data management capabilities around such information. Tables like manifests and files also support optimizations like predicate pushdown and file pruning, which essentially solves the big metadata issue. So I feel it's a pity to hide such information too deeply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think for now we should include this for each engine. They do expose the tables differently (like Trino's $ syntax) so I don't see a lot of value in splitting docs into common and engine-specific. That just makes it harder to find what you're looking for.
| - Performance: performance.md | ||
| - Reliability: reliability.md | ||
| - Schemas: schemas.md | ||
| - Table evolution: evolution.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the reordering here.
|
Sounds like we agree that we would rather document this in each engine to avoid confusion. I'm going to close this PR, but if there is still disagreement feel free to reopen it. |
Issue: #757