Skip to content
Open
Changes from 17 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
8c5d276
add materialzied view to view spec
JanKaul Aug 21, 2024
afc4a0d
add uuid to source table
JanKaul Aug 29, 2024
b2d0b68
Aktualisieren von view-spec.md
JanKaul Sep 5, 2024
27783c7
improve refresh-state description
JanKaul Sep 7, 2024
e85ab16
remove identifier from refresh-state
JanKaul Sep 19, 2024
cff9596
fix comments
JanKaul Oct 4, 2024
a8b52b2
incorporate comments
JanKaul Nov 21, 2024
521477f
fix MV introduction
JanKaul Nov 25, 2024
ed85e95
Update format/view-spec.md
JanKaul Dec 4, 2024
3bc583c
fix comments
JanKaul Dec 4, 2024
49d5da8
fix spelling
JanKaul Dec 10, 2024
d18c9da
fix view-version-id in refresh-state
JanKaul Dec 15, 2024
eb7d71b
Update format/view-spec.md
JanKaul Jan 10, 2025
8ffff63
Update format/view-spec.md
JanKaul Jan 10, 2025
6b065f5
fix introduction wording
JanKaul Jan 10, 2025
0e17881
rename full identifier to table identifier
JanKaul Feb 12, 2025
7e9dc11
fix comments
JanKaul Feb 18, 2025
efed628
Update format/view-spec.md
JanKaul May 13, 2025
0673113
Update format/view-spec.md
JanKaul May 13, 2025
476aced
Update format/view-spec.md
JanKaul May 13, 2025
a02ff98
Merge branch 'apache:main' into materialized-view-spec
JanKaul Jul 6, 2025
9a377e0
clarify storage "fresh", "stale" and "invalid"
JanKaul Jul 6, 2025
3fec943
clarify that refresh-state is set on every storage table snapshot
JanKaul Jul 6, 2025
413ceb4
Add reference to from table to MV spec
Jul 6, 2025
295035e
Remove optional catalog field from storage table identifier
Jul 6, 2025
25adf5b
fix typo
Jul 6, 2025
ae0b005
Update format/view-spec.md
stevenzwu Jul 28, 2025
95740e0
Update format/view-spec.md
JanKaul Aug 20, 2025
9b492c9
fix refresh metadata
JanKaul Aug 20, 2025
878b66b
fix comment
JanKaul Aug 20, 2025
fe2dec7
fix duplication
JanKaul Aug 21, 2025
a2ca4b2
fix view-version-id
JanKaul Oct 22, 2025
6dffbee
update refresh-state
JanKaul Oct 22, 2025
ec2ac6a
fix comments
JanKaul Oct 29, 2025
48c1553
add max-staleness
JanKaul Nov 26, 2025
d154f8d
fix lint errors
JanKaul Nov 26, 2025
25d70dd
Add the case for max-staleness being null
JanKaul Nov 26, 2025
e07b7ef
remove last line
JanKaul Nov 26, 2025
9cc611f
Fix max-staleness description
JanKaul Dec 4, 2025
e2041f8
Add storage table to Source table description
JanKaul Dec 4, 2025
b83f1fd
oncorporate comments
JanKaul Dec 6, 2025
40d1fb7
fix other inconsistencies
JanKaul Dec 6, 2025
72f2725
update matview spec
JanKaul Dec 17, 2025
a6bc657
incorporate comments
JanKaul Dec 18, 2025
6378991
clarify coarse-grained freshness evaluation
JanKaul Dec 18, 2025
9e06580
clarify freshness
JanKaul Dec 18, 2025
bd0f6c3
remove trust point
JanKaul Dec 18, 2025
8131d61
fix spelling
JanKaul Dec 18, 2025
0fd8dc3
add materialized view to source state description
JanKaul Dec 18, 2025
f543a34
incorporate comments
JanKaul Dec 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions format/view-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,28 @@ An atomic swap of one view metadata file for another provides the basis for maki

Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location.

### Materialized Views

Materialized views are a type of view with precomputed results from the view query stored as a table.
When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step.

Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, known as the storage table, which stores the precomputed data.
The metadata for a materialized view extends the Iceberg view metadata, adding a pointer to the precomputed data and refresh information to determine if the data is still fresh.
The refresh information is composed of data about the so-called "source tables", which are the tables referenced in the query definition of the materialized view.
The storage table can be in the states of "fresh", "stale" or "invalid", which are determined from the following situations:
* **fresh** -- The `snapshot_id`s of the last refresh operation match the current `snapshot_id`s of the source tables.
* **stale** -- The `snapshot_id`s do not match, indicating that a refresh operation needs to be performed to capture the latest source table changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the case where the current materialized view references another materialized view? If we determine that the current materialized view is not stale, we should also ensure that any referenced materialized views are not marked as invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that from the rest of the spec, nested view is indeed a valid use case, for materialized view referring other materialized views or even common views. However it seems to me that the rest of the spec didn't mention one of these states being recorded anywhere in the spec, which I would also agree to not include since it feels very challenging to ensure this info could be trustworthy especially when there are multiple engines doing reads/writes on the same data. With this, do we mention these fresh/stale/invalid here mostly for establishing a nomenclature for view users? If so I wonder if it's worth explicitly mentioning that to avoid the user expecting one of these three states to be recorded in their view metadata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenjian2664 and @yyanyy for your comments. These are great points. The current definition supports nested views in that it fully expands the query tree at refresh time and stores all directly and indirectly referenced tables in the refresh-state. This can include multiple levels of views.
This currently would also apply to materialized views. However, I think you're making a good point. When referencing another materialized view it would be easier to just store the snapshot_id of its storage table instead of tracking all table references in the materialized view.

You're correct in assuming that fresh/stale/invalid are introduced to have some nomenclature, these to not occur in the metadata. I will try to make this clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put together a doc on the two options regarding source MVs with pros and cons. please take a look.
https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which option do you prefer? And should I include this in the spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the framework for state and lineage information in this doc. I understand the conclusion is:

  • Lineage information is on the view side. It maps immediate children of a view to their UUIDs.
  • Refresh state information is on the table side. It maps deeply nested children of the materialized view (using their UUID primarily) to snapshot IDs/version IDs.

Now to the point of this discussion: if a child happens to be an MV, then it is conceptually still a view. The above framework would also naturally capture that: View version of the view aspect of the MV will be captured, and underlying table snapshot IDs would also be captured, since we are storing deeply nested state information.

So to summarize, I prefer to handle MVs as views because:

  • They are actually views (tables is just implementation detail of MV).
  • This framing blends well with previously set lineage and state information; it does not introduce new language or treatment, so keeps things simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmoustafa make sure I understand you correctly. If we are using the two options described in this doc, you prefer option 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmoustafa I think we discussed that storing the lineage inside the view was unnecessary and that having the refresh state in the storage table was enough. The engine that queries the view will have to parse, validate and check staleness in order to replace the view with the storage table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu, yes I prefer option 1.
@bennychow, thanks for the clarification. I agree that storing lineage information in the view is a nice-to-have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't have first hand experience working with MV by here's my 2 cents: I'm not completely sure if we need to couple this with the lineage discussion, since I feel that they may serve different purposes. I think from the doc Steven shared, the main purpose and advantage of option 2 is to help engine to determine if the MV is stale or not. Obviously the exact criteria for determining this is engine specific, but just from a high level guess, I think when the use case of someone creating MV1 from MV2 and MV3 emerges, most likely this user/engine would expect MV1 to be refreshed based on MV2 and MV3's data most of the time, instead of recursively obtain the most deeply nested source table and start from there; otherwise it feels that there's not much point creating MV2 and MV3 as materialized view and use them as the children for MV1, normal views would be enough. Because of this, I think the fact that it's a nested MV makes the "materialized" part of the child views more interesting for engine processing.

* **invalid** -- The current `version_id` of the materialized view does not match the `refresh-version-id` of the refresh state.

## Specification

### Terms

* **Schema** -- Names and types of fields in a view.
* **Version** -- The state of a view at some point in time.
* **Storage table** -- Iceberg table that stores the precomputed data of the materialized view.
* **Source table** -- A table reference that occurs in the query definition of the materialized view. The materialized view depends on the data from the source tables.
* **Source view** -- A view reference that occurs in the query definition of the materialized view. The materialized view depends on the definitions from the source views.

### View Metadata

Expand Down Expand Up @@ -82,9 +98,12 @@ Each version in `versions` is a struct with the following fields:
| _required_ | `representations` | A list of [representations](#representations) for the view definition |
| _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog |
| _required_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier |
| _optional_ | `storage-table` | A [storage table identifier](#storage-table-identifier) of the storage table |

When `default-catalog` is `null` or not set, the catalog in which the view is stored must be used as the default catalog.

When 'storage-table' is `null` or not set, the entity is a common view, otherwise it is a materialized view.

#### Summary

Summary is a string to string map of metadata about a view version. Common metadata keys are documented here.
Expand Down Expand Up @@ -160,6 +179,57 @@ Each entry in `version-log` is a struct with the following fields:
| _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) |
| _required_ | `version-id` | ID that `current-version-id` was set to |

#### Storage Table Identifier

The table identifier for the storage table that stores the precomputed results.

| Requirement | Field name | Description |
|-------------|----------------|-------------|
| _optional_ | `catalog` | A string specifying the name of the catalog. If set to `null`, the catalog is the same as the view's catalog |
| _required_ | `namespace` | A list of strings for namespace levels |
| _required_ | `name` | A string specifying the name of the table/view |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case (Storage Table Identifier). this is always a table , and nvere a view . Right?

Suggested change
| _required_ | `name` | A string specifying the name of the table/view |
| _required_ | `name` | A string specifying the name of the table |


### Storage table metadata

This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialzied views.
The property "refresh-state" is set on the table [snapshot summary](https://iceberg.apache.org/spec/#snapshots) to determine the freshness of the precomputed data of the storage table.

| Requirement | Field name | Description |
|-------------|-----------------|-------------|
| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string |

#### Refresh state

The refresh state record captures the state of all source tables and source views in the fully expanded query tree of the materialized view, including indirect references. Indirect references are the tables/views that are not directly referenced in the query but are nested within other views. The refresh state has the following fields:

| Requirement | Field name | Description |
|-------------|----------------|-------------|
| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed |
| _required_ | `source-table-states` | A list of [source table](#source-table) records for all tables that are directly or indirectly referenced in the materialized view query |
| _required_ | `source-view-states` | A list of [source view](#source-view) records for all views that are directly or indirectly referenced in the materialized view query |
| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started |

#### Source table

A source table record captures the state of a source table at the time of the last refresh operation.

| Requirement | Field name | Description |
|-------------|----------------|-------------|
| _required_ | `uuid` | The uuid of the source table |
| _required_ | `snapshot-id` | Snapshot-id of when the last refresh operation was performed |
| _optional_ | `ref` | Branch name of the source table being referenced in the view query |

When `ref` is `null` or not set, it defaults to "main".

#### Source view

A source view record captures the state of a source view at the time of the last refresh operation.

| Requirement | Field name | Description |
|-------------|----------------|-------------|
| _required_ | `uuid` | The uuid of the source view |
| _required_ | `version-id` | Version-id of when the last refresh operation was performed |

## Appendix A: An Example

The JSON metadata file format is described using an example below.
Expand Down