-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Materialized View Spec #11041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Materialized View Spec #11041
Conversation
62ba549 to
b986cf0
Compare
bennychow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JanKaul
I left some minor comments around wording. Otherwise, I believe your changes here capture everything we need for a minimum MV spec.
The mailing list did talk about including the partial identifiers for the source table and source view records to improve usability. While not absolutely necessary, I think its a pretty good addition to include too.
https://lists.apache.org/thread/9lc3t4k0hw4d0hn07lgy9t2vgp2fm0om
Thanks
90f517e to
2790b0d
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right direction, thanks @JanKaul. I think we should reopen. Also left some suggestions.
2790b0d to
af76391
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, one more minor suggestion
af76391 to
521477f
Compare
|
When materialized view is created, two entities would be added to the catalog - a view and a storage table. From the engine perspective it is important to expose it as a single object during listing. Are there any rules how catalog implementors should deal with these objects? E.g., shall we expose the view via |
|
Yes correct, the idea is to filter-out the storage-table for the |
format/view-spec.md
Outdated
| | Requirement | Field name | Description | | ||
| |-------------|----------------|-------------| | ||
| | _required_ | `namespace` | A list of strings for namespace levels | | ||
| | _required_ | `name` | A string specifying the name of the table/view | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case (Storage Table Identifier). this is always a table , and nvere a view . Right?
| | _required_ | `name` | A string specifying the name of the table/view | | |
| | _required_ | `name` | A string specifying the name of the table | |
format/view-spec.md
Outdated
| Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. | ||
| Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. | ||
| During read time, a materialized view (storage table) can be interpreted as "fresh", "stale" or "invalid", depending on the following situations: | ||
| * **fresh** -- The `snapshot_id`s of the last refresh operation match the current `snapshot_id`s of all the source tables, OR all source table snapshots that differ from the last refresh have timestamps within a configured staleness window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think during the last meeting we have agreed to use "delayed view definition/semantics" as it doesn't mandate optimizations and allows for false negatives (allows claiming that view is stale as freshness logic is naive ). The wording above mandates very specific optimization of additional checks needed if refresh is outside of the staleness window.
proposal:
Invalid: The current version_id of the materialized view does not match the view-version-id recorded in its refresh state. A read operation cannot proceed using the materialized view's data.
Fresh: if it is Valid and the stored data represents the result set that would have been retrieved if the underlying View Query was executed at some point during the defined Staleness Window.
stale: if it is valid but not Fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you very much for your improvements! I've updated the text
format/view-spec.md
Outdated
|
|
||
| 1. The number of versions to retain is controlled by the view property: `version.history.num-entries`. | ||
| 2. Properties are used for metadata such as `comment` and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. | ||
| 3. The `max-staleness-ms` field only applies to materialized views and must be set to `null` for common views. When `max-staleness-ms` is not `null`, the query engine may return data directly from the `storage-table` without refreshing if all source table snapshots that differ from those used in the last refresh have timestamps within `max-staleness-ms` of the current time. When `max-staleness-ms` is `null` for a materialized view, the data in the `storage-table` is always considered fresh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this staleness definition is different from the one in the line 54 (in the current form it seem to require to NOT implement optimization that is mandated above in the line 54)
proposal:
The max-staleness-ms: defines staleness window for determining freshness state see above section ### Materialized Views
PS
the same problem on line 82 (contradictory definition as opposed to reference)
talatuyarer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I capture these comments from latest community sync, cc @stevenzwu
format/view-spec.md
Outdated
|
|
||
| The refresh state record captures the state of source tables, views, and materialized views at refresh time. | ||
|
|
||
| * Source view states are stored in `source-view-states`. It includes indirect references — views nested within other views (excluding MVs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in the meeting we were converging on including MVs here and below. Lets revisit during next meeting to make sure there is no ambiguity.
format/view-spec.md
Outdated
|
|
||
| During read time, a materialized view (storage table) can be interpreted as "fresh", "stale" or "invalid", depending on the following situations: | ||
|
|
||
| * **invalid** -- The current `version_id` of the materialized view does not match the `view-version-id` recorded in its refresh state. A read operation cannot proceed using the materialized view's data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this definition creates interestin artifacts
BASE_TABLE1=>MV1 (SELECT a from BASE_TABLE1) =>MV2 (select a FROM MV1)
we refreshed both MV1 and MV2
we run ALTER MV statement and added new column to the MV1 definition.
at this point MV1 is in invalid while MV2 is valid ( this seem strange / inconsistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find trying to use invalid, fresh and stale statuses as too generic. We could instead describe the 3 high level ways a consumer would validate the freshness:
- Use refresh-start-timestamp-ms
- Use source-table-states and source source-view-states. Optionally validating nested MV refresh-state..
- Advanced validation through view SQL expansion and refresh-state comparison.
For 3, the consumer may determine that affected partitions in the table are still up to date after pushing down filters.
|
Benny,
Are you proposing to simplify states from
tri-state (fresh, stale ,invalid ). down to 2 states (fresh, stale) ?
with various suggested above flavors implementing freshness
determination ?
Personally I would prefer the simplification (I am not certain i
interpreted your comment correctly)
PS
If we were to pursue this option we would need to make some definitive
statements about not initialized MVs.
…On Fri, Dec 12, 2025 at 8:46 PM Benny Chow ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In format/view-spec.md
<#11041 (comment)>:
> +When `ref` is `null` or not set, it defaults to "main".
+
+#### Source view
+
+A source view record captures the state of a source view at the time of the last refresh operation.
+
+| Requirement | Field name | Description |
+|-------------|----------------|-------------|
+| _required_ | `uuid` | The uuid of the source view |
+| _required_ | `version-id` | Version-id of when the last refresh operation was performed |
+
+#### Status Interpretation
+
+During read time, a materialized view (storage table) can be interpreted as "fresh", "stale" or "invalid", depending on the following situations:
+
+* **invalid** -- The current `version_id` of the materialized view does not match the `view-version-id` recorded in its refresh state. A read operation cannot proceed using the materialized view's data.
I personally find trying to use invalid, fresh and stale statuses as too
generic. We could instead describe the 3 high level ways a consumer would
validate the freshness:
1. Use refresh-start-timestamp-ms
2. Use source-table-states and source source-view-states. Optionally
validating nested MV refresh-state..
3. Advanced validation through view SQL expansion and refresh-state
comparison.
For 3, the consumer may determine that affected partitions in the table
are still up to date after pushing down filters.
—
Reply to this email directly, view it on GitHub
<#11041 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/B3NEUQTQP2XE7PBFVSAGYE34BOKY7AVCNFSM6AAAAABNKKAL3KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNZUGE3TOMBQGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I guess what I am proposing is to just remove the "Status Interpretation" section. It's just not possible to define a "fresh" status because different producers and consumer have different tolerances for refresh. Consumer A may be very strict and "fresh" means all source tables and view are the latest. Whereas, consumer B may define "fresh" as referenced MVs refreshed within the max-staleness window. |
bennychow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JanKaul Thanks for updating this based on our 12/16 community sync. I made some suggestions for clarifying the refresh-state structure with a dependency graph.
format/view-spec.md
Outdated
|
|
||
| #### Freshness | ||
|
|
||
| A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
A materialized view is considered fresh when its precomputed data is usable by consumers. As tables and views referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the materialized view's dependency graph. When this occurs, the materialized view (storage table) is considered stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A materialized view is considered fresh when its precomputed data is usable by consumers.
This definition doesn't capture what should be considered fresh. Later part on consumer responsibilities is kind of defining what fresh mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
A materialized view is considered fresh when its precomputed data is usable by consumers. As tables and views referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the materialized view's dependency graph. When this occurs, the materialized view (storage table) is considered stale.
For me "reflect the materialized view's dependency graph" doesn't have the same meaning. In many cases the graph will be the same just that the data at the nodes changed. In my opinion that's not clear in your formulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Jan that reflect the materialized view's dependency graph isn't quite accurate
format/view-spec.md
Outdated
| | _required_ | `versions` | A list of known [versions](#versions) of the view [1] | | ||
| | _required_ | `version-log` | A list of [version log](#version-log) entries with the timestamp and `version-id` for every change to `current-version-id` | | ||
| | _optional_ | `properties` | A string to string map of view properties [2] | | ||
| | _optional_ | `max-staleness-ms` | The maximum time interval in milliseconds during which changed source table snapshots are considered fresh enough to skip refreshing [3] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the wording here isn't super clear how this config should be used. It also doesn't capture the delayed view semantic that @igorbelianski-cyber mentioned.
We should remove to skip refreshing part, as consumers can have other fallback behaviors like fail or treat MV as a logical view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the delayed view semantics to the Notes section below
format/view-spec.md
Outdated
|
|
||
| #### Freshness | ||
|
|
||
| A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A materialized view is considered fresh when its precomputed data is usable by consumers.
This definition doesn't capture what should be considered fresh. Later part on consumer responsibilities is kind of defining what fresh mean.
format/view-spec.md
Outdated
|
|
||
| Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. | ||
| Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. | ||
| Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads like "Source materialized view" excluded from the list on purpose ( i remember use agreeing that source materialized view should be included with precise spelling out of what it means)
|
|
||
| #### Refresh state | ||
|
|
||
| The refresh state record captures the unique dependencies in the materialized view's dependency graph. These dependencies include source Iceberg tables, views, and nested materialized views that allow a consumer to determine the freshness of the materialized view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unique dependencies" seem to indicate that either diamond pattern of queries is not supported or there is a policy to choose which instance of dependency to be selected if the same table queried multiple times ( indirectly through different upstream MVs at different points in time)
to p[roperly support this we probable need to store optional query path in each entry of refresh state
|
|
||
| **Producer responsibilities:** | ||
| - The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. | ||
| - The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source states list may be empty if the source state cannot be determined, or producer freshness policy doesn't require for staleness determination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent to always treat the data as fresh should be specified with max-staleness-ms set to null. If that is what you mean with doesn't require staleness determination.
The issue is that with an emtpy list the behavior for "cannot be determined" and "doesn't need to be determined" is different. In case of "cannot be determined" the consumer should use the coarse-grained check. But with hte "doesn't need to be determined" the consumer should just treat the data as fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the scenario is "freshness needs to be determined" and "coarse grain check deemed to be sufficient" (list is superfluous in this case)
format/view-spec.md
Outdated
| - The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. A materialized view is fresh if `refresh-start-timestamp-ms` is within the window `[now - max-staleness-ms, now]`. | ||
| - The consumer may additionally compare the `source-states` list against the states loaded from the catalog. If this evaluation determines the materialized view is fresh, it overrides the coarse-grained evaluation result. | ||
| - The consumer may parse the view definition to implement a more sophisticated policy. | ||
| - When a materialized view is considered stale, the consumer can fail, refresh inline, or treat the materialized view as a logical view. The consumer must not consume from the storage table when the materialized view is stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must not consume from the storage table when the materialized view is stale.
we discussed a lot of false negatives (view is fresh but consumer can not reliably detect it )
may be something like :
must not consume from the storage table when the materialized view doesn't meet freshness criteria.
|
|
||
| | Requirement | Field name | Description | | ||
| |-------------|----------------|-------------| | ||
| | _required_ | `type` | A string that must be set to `table` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as i mentioned above we may nee path here or formulate some policy(producer mus store the oldes snapshot id for the duplicate objects )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a great idea to store the oldest snapshot. This would get as around needing to add machinery to track the query graph path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while it is simpler, I am wondering if it can be implemented correctly. How can it move forward? Without the lineage path, when a MV refresh detected a nested MV has been refreshed, how does the producer know that the older snapshot is from this lineage path or from another way? Without the knowledge, the producer won't know if it can move the state forward or not.
this discussion leads me to think about an orthogonal question. What if the snapshots referenced by the refresh-state have been purged due to retention? Do we need to spell out the consumer behavior in this case? Maybe fall back to the coarse-grained evaluation with refresh-start-timestamp-ms and max-staleness-ms, which is a required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your first point. The purpose of the refresh-state is to determine freshness of a MV. If an MV depends on two separate snapshots through other nested MVs it will always be the older snapshot that will be critical for max-staleness-ms and render a MV stale. The producer of a storage table has to parse the SQL anyway and can determine the lineage itself if needed.
Your second point is a major issue. Maybe we should be using the sequence-number of the snapshot instead of the snapshot-id.
igorbelianski-cyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestions(mostly originating from my recollection of the meeting discussion)
This PR implements the Iceberg Materialized View Proposal #10043 by adding a section for Materialized Views to the View spec. It follows the design of the proposal document.
The idea is to resolve any remaining questions before starting the voting process on the dev list.