Spark: Add support for Iceberg views #9332

nastra · 2023-12-18T10:11:41Z

This is currently a WIP and still requires tests and a few improvements.

Most code is from apache/spark#44197 but there were a few workarounds required to make this work without actually requiring changes from the Spark repo.

This PR can be considered an umbrella PR. I've created separate smaller PRs for easier reviewing:

Reading from views: Spark: Add support for reading Iceberg views #9340
Creating views via Spark SQL: TBD
Renaming views via Spark SQL: Spark: Support renaming views #9343

nastra · 2023-12-18T11:47:07Z

...rk-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CreateViewAnalysis.scala

+/**
+ * Resolve views in CREATE VIEW and ALTER VIEW AS plans and convert them to logical plans.
+ */
+case class CreateViewAnalysis(spark: SparkSession)


the logic is almost identical to the CreateViewAnalysis in apache/spark#44197, which a few minor tweaks around using the correct case class

nastra · 2023-12-18T11:49:41Z

....5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala

+  protected lazy val catalogManager: CatalogManager = spark.sessionState.catalogManager
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case u@UnresolvedRelation(nameParts, _, _) if catalogManager.v1SessionCatalog.isTempView(nameParts) =>


these first 2 case classes are from apache/spark#44197 Analyzer#ViewSubstitution

nastra · 2023-12-18T11:50:35Z

....5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala

+        .map(createViewRelation(parts.quoted, _))
+        .getOrElse(u)
+
+    case ShowCreateTable(ResolvedV2View(_, ident, view), _, _) =>


all of the other cases are handled in ResolveCatalogs in apache/spark#44197 but we can't override that behavior, and thus handle those cases here

nastra · 2023-12-18T11:50:44Z

....5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala

+  // look at `AnalysisContext.catalogAndNamespace` when resolving relations with single-part name.
+  // If `AnalysisContext.catalogAndNamespace` is non-empty, analyzer will expand single-part names
+  // with it, instead of current catalog and namespace.
+  def resolveViews(plan: LogicalPlan): LogicalPlan = plan match {


in apache/spark#44197 resolving views happens in the Analyzer

nastra · 2023-12-18T11:53:55Z

...tensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala

+      if (catalog.viewExists(ident)) {
+        catalog.dropView(ident)
+      }
+      // FIXME: replaceView API doesn't exist in Spark 3.5


currently the ViewCatalog in Spark doesn't have a replaceView(..) API. This feature was added with apache/spark#43677 but requires a Spark release + version bump

...rk-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CreateViewAnalysis.scala

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

rdblue · 2023-12-18T17:18:59Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

+      if (!isResolvingView || isReferredTempViewName(nameParts)) return nameParts
+
+      if (nameParts.length == 1) {
+        AnalysisContext.get.catalogAndNamespace :+ nameParts.head


Is AnalysisContext maintained by this code?

I doubt we want to use AnalysisContext. If I remember correctly from the original Netflix rules, we avoided context by rewriting unresolved references in the table to use fully-qualified identifiers. Can we do that instead?

this code was copied 1:1 from Spark's Analyzer here so that we can plug-in the lookup of views.

aokolnychyi · 2023-12-18T19:28:58Z

I should be able to take a look this week.

rdblue · 2023-12-18T22:35:23Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala


-        case _ =>
-          false
+    private def isReferredTempViewName(nameParts: Seq[String]): Boolean = {


I don't think this is needed for Iceberg views because Iceberg views cannot reference anything other than tables or other Iceberg views. Here's the Javadoc for referredTempViewNames:

All the temp view names referred by the current view we are resolving. It's used to make sure the relation resolution is consistent between view creation and view resolution. For example, if t was a permanent table when the current view was created, it should still be a permanent table when resolving the current view, even if a temp view t has been created.

Because we know that all references must be tables, the set of names that can be resolved as temporary views should always be empty.

rdblue · 2023-12-18T22:40:53Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

+    private def lookupTempView(identifier: Seq[String]): Option[TemporaryViewRelation] = {
+      // We are resolving a view and this name is not a temp view when that view was created. We
+      // return None earlier here.
+      if (isResolvingView && !isReferredTempViewName(identifier)) return None


Because an Iceberg view must not reference a temporary view, this check isn't needed and I don't think we will need isResolvingView either since that references the AnalysisContext.

I think that the only time this should resolve temporary views is when the original query (as parsed) references a temporary view. Any time there is a single-part identifier in a view, it should be resolved using the view's default catalog and namespace.

In addition, we need to ensure that there is no conflicting temporary view definition when a view is created. We'll need to check that any single-part identifier in the view SQL does not represent a temporary view at creation time.

rdblue · 2023-12-18T22:42:38Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

+          v1Table.v1Table.tableType == CatalogTableType.VIEW =>
+          val v1Ident = v1Table.catalogTable.identifier
+          val v2Ident = Identifier.of(v1Ident.database.toArray, v1Ident.identifier)
+          analysis.ResolvedPersistentView(catalog, v2Ident, v1Table.catalogTable.schema)


I don't think that we need this. The rules here only need to detect and resolve Iceberg views. We don't want to resolve v1 views because that logic may change and we don't want to alter the behavior of existing views.

rdblue · 2023-12-18T22:43:38Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

+                                   identifier: Seq[String],
+                                   viewOnly: Boolean = false): Option[LogicalPlan] = {
+      lookupTempView(identifier).map { tempView =>
+        ResolvedTempView(identifier.asIdentifier, tempView.tableMeta.schema)


We don't want to resolve a temp view. We only need to skip it and let the resolution logic in Spark handle resolution.

rdblue · 2023-12-18T22:45:28Z

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

+
+    private def lookupView(catalog: CatalogPlugin, ident: Identifier): Option[LogicalPlan] =
+      loadView(catalog, ident).map {
+        case view if CatalogV2Util.isSessionCatalog(catalog) =>


Why is there a different path for the session catalog? I think it should only matter that the view was a v2 view.

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala

rdblue · 2023-12-18T22:48:31Z

...nsions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/views/ResolvedV2View.scala

+import org.apache.spark.sql.connector.catalog.View
+import org.apache.spark.sql.connector.catalog.ViewCatalog
+
+case class ResolvedV2View(


Is this resolved? The view hasn't been parsed yet.

I checked and resolved is true:

rdblue · 2023-12-18T22:49:24Z

....5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala

+    throw new AnalysisException(
+      s"Cannot move view between catalogs: from=$oldCatalog and to=$newCatalog")
+
+  def lookupView(catalog: CatalogPlugin, ident: Identifier): Option[LogicalPlan] =


Is this a copy of lookupView from the extensions?

amogh-jahagirdar · 2023-12-19T03:48:17Z

....5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala

+  // look at `AnalysisContext.catalogAndNamespace` when resolving relations with single-part name.
+  // If `AnalysisContext.catalogAndNamespace` is non-empty, analyzer will expand single-part names
+  // with it, instead of current catalog and namespace.
+  def resolveViews(plan: LogicalPlan): LogicalPlan = plan match {


Is there anywhere we ensure the view isn't stale, and if it is we fail? For example if the output schema of the view does not match the computed schema of the query? This can happen if the table's schema evolves but the view is not updated to reflect that.

I'll make sure to handle this in #9340 and also have a test where the schema of the table evolved

nastra · 2023-12-19T10:15:05Z

We discussed internally to split this PR out into smaller pieces. I've created #9340 to split out creation + reading of views. Additional PRs will follow.

nastra · 2024-02-24T10:59:13Z

Closing this as all of the functionality has been merged individually

github-actions bot added the spark label Dec 18, 2023

ajantha-bhat self-requested a review December 18, 2023 10:12

Spark: Add support for Iceberg views

3a35a2f

ajantha-bhat removed their request for review December 18, 2023 11:03

nastra force-pushed the spark-with-iceberg-views branch from 0cb895d to d5ae7ca Compare December 18, 2023 11:44

nastra commented Dec 18, 2023

View reviewed changes

nastra force-pushed the spark-with-iceberg-views branch from d5ae7ca to 23cc43f Compare December 18, 2023 11:58

nastra requested review from RussellSpitzer, aokolnychyi and rdblue December 18, 2023 11:59

nastra marked this pull request as draft December 18, 2023 14:01

nastra added 2 commits December 18, 2023 16:39

minor fixes

dfc169e

tests

c53e4e4

nastra force-pushed the spark-with-iceberg-views branch from 23cc43f to c53e4e4 Compare December 18, 2023 16:09

rdblue reviewed Dec 18, 2023

View reviewed changes

...rk-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CreateViewAnalysis.scala Show resolved Hide resolved

rdblue reviewed Dec 18, 2023

View reviewed changes

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala Show resolved Hide resolved

rdblue reviewed Dec 18, 2023

View reviewed changes

.../scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala Show resolved Hide resolved

rdblue reviewed Dec 18, 2023

View reviewed changes

amogh-jahagirdar reviewed Dec 19, 2023

View reviewed changes

nastra mentioned this pull request Dec 19, 2023

Spark: Add support for reading Iceberg views #9340

Merged

nastra closed this Feb 24, 2024

Spark: Add support for Iceberg views #9332

Spark: Add support for Iceberg views #9332

Uh oh!

Conversation

nastra commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Dec 18, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nastra commented Feb 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nastra commented Dec 18, 2023 •

edited

Loading

nastra commented Dec 19, 2023 •

edited

Loading