Views, Spark: Add support for Materialized Views; Integrate with Spark SQL #9830

wmoustafa · 2024-02-29T03:06:36Z

Spec

This patch adds support for materialized views in Iceberg and integrates the implementation with Spark SQL. It reuses the current spec of Iceberg views and tables by leveraging table properties to capture materialized view metadata. Those properties can be added to the Iceberg spec to formalize materialized view support.

Below is a summary of all metadata properties introduced or utilized by this patch, classified based on whether they are associated with a table or a view, along with their purposes:

Properties on a View:

iceberg.materialized.view:
- Type: View property
- Purpose: This property is used to mark whether a view is a materialized view. If set to true, the view is treated as a materialized view. This helps in differentiating between virtual and materialized views within the catalog and dictates specific handling and validation logic for materialized views.
iceberg.materialized.view.storage.table:
- Type: View property
- Purpose: Specifies the identifier of the storage table associated with the materialized view. This property is used for linking a materialized view with its corresponding storage table, enabling data management and query execution based on the stored data freshness.

Properties on a Table:

iceberg.base.snapshot.[UUID]:
- Type: Table property
- Purpose: These properties store the snapshot IDs of the base tables at the time the materialized view's data was last updated. Each property is prefixed with base.snapshot. followed by the UUID of the base table. They are used to track whether the materialized view's data is up to date with the base tables by comparing these snapshot IDs with the current snapshot IDs of the base tables. If all the base tables' current snapshot IDs match the ones stored in these properties, the materialized view's data is considered fresh.
iceberg.materialized.view.version:
- Type: Table property
- Purpose: This property tracks the parent view version ID when the storage table is created (or refreshed). The table is usable only when the view version ID property matches the current parent view version ID.

Spark SQL

This patch introduces support for materialized views in the Spark module by adding support for Spark SQL CREATE MATERIALIZED VIEW and adding materialized view handling for the DROP VIEW DDL command. When a CREATE MATERIALIZED VIEW command is executed, the patch interprets the command to create a new materialized view, which involves not only registering the view's metadata (including marking it as a materialized view with the appropriate properties) but also setting up a corresponding storage table to hold the materialized data and setting the base table current snapshot IDs (at creation time). The storage table identifier is passed by a new clause STORED AS '...'. If no STORED AS clause is specified, a default storage table identifier is assigned. When a DROP VIEW command is issued for a materialized view, the patch ensures that both the metadata for the materialized view and its associated storage table are properly removed from the catalog. Support for REFRESH MATERIALIZED VIEW is left as a future enhancement.

Spark Catalog

This patch enhances the SparkCatalog to intelligently decide whether to return the view text metadata for a materialized view or the data from its associated storage table based on the freshness of the materialized view. Within the loadTable method, the patch first checks if the requested table corresponds to a materialized view by loading the view from the Iceberg catalog. If the identified view is marked as a materialized view (using the iceberg.materialized.view property), the patch then assesses its freshness. If it is fresh, the loadTable method proceeds to load and return the storage table associated with the materialized view, allowing users to query the pre-computed data directly. However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

Notes

This patch intentionally avoids introducing new Iceberg or engine object APIs. The intention is to start a discussion on whether such APIs are required, and the best objects to model them. There is a number of trade-offs based on each choice.
The InMemoryCatalog has been extended to use a test LocalFileIO due to an existing gap in a pure InMemoryCatalog (with InMemoryFileIO), with working with data files (which are required by the storage table). The extension of the InMemoryCatalog to use LocalFileIO ended up promoting a couple of methods to public, but the intention is again to start a discussion about the best way to address the current gap.

rdblue · 2024-03-11T19:23:57Z

core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java

  ViewMetadata internalApply() {
+    // Replacing a materialized view is not supported because the old storage location will wrongly
+    // transfer to the new version
+    // if not handled properly.


I think we will want to allow this by adding the view version ID to metadata in the table. If you load the view, then load the table and the version doesn't match between them then the table cannot be used.

Agreed, just keeping it out of scope of this PR. Let me know if we should keep in this PR's scope. (also agree that we can change the comment in case we leave it out of scope).

Since this relates to the spec, I have decided to add this support. Now there is a new view version property materialized.view.version.id that is tracked at the table level and it is factored into freshness evaluation.

bennychow · 2024-03-13T07:07:28Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+                            MaterializedViewUtil
+                                .MATERIALIZED_VIEW_BASE_SNAPSHOT_PROPERTY_KEY_PREFIX))
+            .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+    List<Table> baseTables = MaterializedViewUtil.extractBaseTables(view.sqlFor("spark").sql());


Suppose view A references view B and you materialize view A. Does the base tables include view B? This is because a change in SQL definition to view B would also invalidate the MV. I'm not sure if the baseTables also includes views in the plan?

Base tables do not include view B, but rather its (leaf) base tables. That is a good point. We should track view B version as well. In the current patch, we are tracking the parent view version ID, but we should do the same for the children views (if any).

manuzhang · 2024-03-14T02:57:53Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/MaterializedViewUtil.java

+      "iceberg.base.snapshot.";
+  public static final String MATERIALIZED_VIEW_VERSION_PROPERTY_KEY =
+      "iceberg.materialized.view.version";
+  private static final String MATERIALIZED_VIEW_STORAGE_TABLE_IDENTIFIER_SUFFIX = ".storage.table";


does this work with SparkSessionCatalog which requires single-part namespace?

manuzhang · 2024-03-14T03:18:27Z

...ns/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateOrReplaceTagExec.scala


  override protected def run(): Seq[InternalRow] = {
-    catalog.loadTable(ident) match {
+    catalog


Redundant change

bennychow · 2024-03-25T05:27:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+              icebergBaseTable.currentSnapshot() == null
+                  ? 0
+                  : icebergBaseTable.currentSnapshot().snapshotId());
+      if (!baseTableSnapshotsProperties


This could also be optimized a bit to not count base table snapshot changes where the DataOperations is REPLACE.

was wondering if it's not too much, does it makes sense to have freshness check, configurable ? This would kinda give view creator more control on the freshness check and avoid un-necessary re-loading only when the current snapshot id changed ?

singhpk234 · 2024-04-20T17:57:13Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+          .get(
+              MaterializedViewUtil.MATERIALIZED_VIEW_BASE_SNAPSHOT_PROPERTY_KEY_PREFIX
+                  + icebergBaseTable.uuid())


As per spec looks like uuid was optional for v1, how do we handle that here ?

Looks like it was added via here https://github.com/apache/iceberg/pull/264/files

singhpk234

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ?
2/ Any ideas / plans for incremental refresh ?

wmoustafa · 2024-04-20T22:01:16Z

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ? 2/ Any ideas / plans for incremental refresh ?

These are very good questions. To me looks like if there is an external process that guarantees the freshness, then the current implementation still holds. Manual REFRESH will boil down to no-op, and isFresh will always return true.

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

singhpk234 · 2024-05-02T21:27:47Z

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

wmoustafa · 2024-05-22T23:35:27Z

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

It really depends on the query and the size of the delta and whole table etc. There is an extension of that work that is currently taking place to get an idea about the cost of some basic queries (e.g., a few joins/aggregations + filters & projections), and coming up with a reasonable cost model (including choosing to not perform incremental at all if incremental is deemed more expensive).

github-actions · 2024-10-21T00:16:14Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-10-28T00:16:20Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

[Views] Implement Materialized Views; Integrate with Spark SQL

aea1f32

github-actions bot added spark core labels Feb 29, 2024

nastra self-requested a review February 29, 2024 08:14

Represent the storage table using its catalog identifier

0aac396

rdblue reviewed Mar 11, 2024

View reviewed changes

Add support for replacing view version

a9e1bee

bennychow reviewed Mar 13, 2024

View reviewed changes

wmoustafa mentioned this pull request Mar 13, 2024

[Proposal] Iceberg Materialized View Spec #6420

Closed

manuzhang reviewed Mar 14, 2024

View reviewed changes

bennychow reviewed Mar 25, 2024

View reviewed changes

wmoustafa mentioned this pull request Mar 28, 2024

Iceberg Materialized Views #10043

Open

6 tasks

singhpk234 reviewed Apr 20, 2024

View reviewed changes

wmoustafa mentioned this pull request May 8, 2024

[Spec] Add Iceberg Materialized View Spec #10280

Closed

github-actions bot added the stale label Oct 21, 2024

github-actions bot closed this Oct 28, 2024

Views, Spark: Add support for Materialized Views; Integrate with Spark SQL #9830

Views, Spark: Add support for Materialized Views; Integrate with Spark SQL #9830

Uh oh!

Conversation

wmoustafa commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Spec

Properties on a View:

Properties on a Table:

Spark SQL

Spark Catalog

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

wmoustafa commented Apr 20, 2024

Uh oh!

singhpk234 commented May 2, 2024

Uh oh!

wmoustafa commented May 22, 2024

Uh oh!

github-actions bot commented Oct 21, 2024

Uh oh!

github-actions bot commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wmoustafa commented Feb 29, 2024 •

edited

Loading