Skip to content

Conversation

@BACtaki
Copy link

@BACtaki BACtaki commented Sep 20, 2021

Issue: #757

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this!

@@ -0,0 +1,134 @@
# Metadata Tables

This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer to change line on full sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the definitions page


## Table Schema

### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the actual names of the tables, instead of the class name, like files, manifests, entries, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of this HTML tag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jack on the section names.

| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent |
| | _required_ | **`sequence-number`** | A monotonically increasing long that tracks the order of changes to a table |
| _required_ | _required_ | **`timestamp-ms`** | A timestamp when the snapshot was created, used for garbage collection and table inspection |
| _optional_ | _required_ | **`manifest-list`** | The location of a manifest list for this snapshot that tracks manifest files with additional meadata |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
```

### Metadata Table Schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what other people think, I think I would prefer having the schema all in that metadata.md page, and we can remove this section in Spark and only provide a SQL example for the syntax to query the table + a link to that page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much value in having these schemas here. Seems like they'll just get out of date.

### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`

| Column name | Required | Data type | Description |
|-----------------------|-----------|-------------------|-------------|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a Column ID column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need IDs? I'm not sure those are valuable to users.

| equality_ids | | `list<int>` | Equality comparison field IDs
| sort_order_id | | int | Sort order ID

### <a id="AllEntriesTable"></a> 2. `AllEntriesTable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anchors are generated automatically, so no need to add them in the markdown.

@@ -0,0 +1,134 @@
# Metadata Tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this documentation? Much of this is already covered in the Spark queries page, which this links to for docs on how to query the metadata tables.

Who is the intended audience for these docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is not a Spark only feature, the intention is to make it a top level documentation, and have proper tables for the metadata table schema instead of showing Java code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pretty good argument against adding this, though. Documenting what tables exist without an engine is confusing to anyone looking for how to view metadata tables. Just saying that these tables exist and what their schemas are doesn't help a user coming to the docs.

If this were part of a document on the API that explained how to use the metadata tables and was targeted at engine developers, I think it would be more valuable. But I think placing it under tables will just cause confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that is a confusion point. So do you recommend having one section for each engine around system tables? My current thought is to have this page introducing the Iceberg schema, linking to related sections in each engine page for examples of using system tables.

One of my intention to add it was because this is a very important feature not exist in other similar products, and it provides huge benefits for users to build data management capabilities around such information. Tables like manifests and files also support optimizations like predicate pushdown and file pruning, which essentially solves the big metadata issue. So I feel it's a pity to hide such information too deeply.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think for now we should include this for each engine. They do expose the tables differently (like Trino's $ syntax) so I don't see a lot of value in splitting docs into common and engine-specific. That just makes it harder to find what you're looking for.

- Performance: performance.md
- Reliability: reliability.md
- Schemas: schemas.md
- Table evolution: evolution.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the reordering here.

@rdblue
Copy link
Contributor

rdblue commented Sep 26, 2021

Sounds like we agree that we would rather document this in each engine to avoid confusion. I'm going to close this PR, but if there is still disagreement feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants