[SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog #35636

jzhuge · 2022-02-23T21:39:04Z

What changes were proposed in this pull request?

This PR adds support to load, create, alter, and drop views in DataSource V2 catalogs.

This is an umbrella PR that combines the following commits that probably should be merged one by one.

View catalog interface
View substitution rule
Create view DDL
View SQL DDLs
Caching for ViewCatalog

Why are the changes needed?

Support views stored in DataSourceV2 catalogs. Details in SPIP.

Does this PR introduce any user-facing change?

Support views from DataSource V2 catalogs in SQL
Support views from DataSource V2 catalogs in DataFrame API

How was this patch tested?

New unit tests

Regression

DDLParserSuite
PlanResolutionSuite
DataSourceV2SQLSuite

xkrogen · 2022-03-01T23:06:50Z

@jzhuge The PR as-is seems to only contain the view catalog interface, not the other 3 facets that you mentioned. Is this intentional?

jzhuge · 2022-03-01T23:17:06Z

Sorry, not added yet.

On Tue, Mar 1, 2022 at 3:07 PM Erik Krogen ***@***.***> wrote: @jzhuge <https://github.com/jzhuge> The PR as-is seems to only contain the view catalog interface, not the other 3 facets that you mentioned. Is this intentional? — Reply to this email directly, view it on GitHub <#35636 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOL5JBTAQ7T2PPHR73HSFLU52PJLANCNFSM5PFSWFAQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- John Zhuge

melin · 2022-03-11T09:02:02Z

#28147 PR available? Why submit a new PR?

jzhuge · 2022-03-24T16:24:31Z

@melin #28147 was marked stale and closed.

jzhuge · 2022-03-31T17:07:57Z

Retest please

rdblue · 2022-04-05T20:09:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Is this change needed? It looks suspicious

Compile error without the change:

[ERROR] /Users/jzhuge/Repos/upstream-spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1002: reference to View is ambiguous; it is imported twice in the same scope by import org.apache.spark.sql.connector.catalog._ and import org.apache.spark.sql.catalyst.plans.logical._

Got it. This avoids a conflicting import. Thanks!

rdblue · 2022-04-05T20:10:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala

Why does this change NoSuchNamespaceException? Can't the caller add cause = None?

+1, we should be able to do without.

Actually, this is necessary for Java code to call new NoSuchNamespaceException(namespace) without a cause.

The following constructor (default value None to cause) will not work for Java:

def this(namespace: Array[String], cause: Option[Throwable] = None)

How about having these 2 constructors?

def this(namespace: Array[String], cause: Throwable) def this(namespace: Array[String])

rdblue · 2022-04-05T20:12:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java

I think that this API should behave like the TableCatalog API. That is, the create/replace/createOrReplace methods should return a View object.

rdblue · 2022-04-05T20:15:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java

Could this be a mix-in for atomic replace or atomic createOrReplace? The logic here is not atomic and so I don't see much value in having this method. I would rather the caller drop and create the view instead.

Removed them

rdblue · 2022-04-05T20:15:52Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java

I think this should return a View.

frankliee · 2022-05-10T16:16:52Z

Does this PR need to update ResolveSessionCatalog?
Currently, CreateView only supports V1 catalog.

jzhuge · 2022-05-10T16:21:52Z

Likely. This PR will support creating v2 views.

frankliee · 2022-05-11T08:27:43Z

Likely. This PR will support creating v2 views.

Thanks.
What about other commands, such as dropview, alterivew?
They also do not support v2 view, currently.

jzhuge · 2022-05-11T15:54:06Z

Yes, many view DDL commands will be implemented, you can find a complete list in SPIP.

ljfgem · 2022-08-23T20:44:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // The relation is a view, so we wrap the relation by:
+      // 1. Add a [[View]] operator over the relation to keep track of the view desc;
+      // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view.
+      SubqueryAlias(name, View(desc, false, qualifiedChild))


Hi @jzhuge, I found that with the original Hive session catalog, the resolved plan of the view select * from default.t is like:

'SubqueryAlias spark_catalog.default.test_view +- View (`default`.`test_view`, ['intCol,'structCol,'boolCol]) +- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8] +- 'Project [*] +- 'UnresolvedRelation [default, t], [], false

Looks like the project node:

+- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8]

is added based on the schema stored in the view's desc (CatalogTable). So that the schema info (casing and nullability) is preserved.

But in the current implementation, there is not such a project node based on the view's schema, do you think we need to add the similar project node if the view's schema is provided?

I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.

Sounds good.

Could you point me to the code for "built-in Hive external catalog implementation"?

Hi @jzhuge, built-in Hive external catalog adds the project node according to the schema in this method
I would suggest to add the similar project node here if the provided schema of the view is not null, like:

if (desc.schema != null) { val projectList = { val viewColumnNames = desc.viewQueryColumnNames val nameToCounts = viewColumnNames.groupBy(identity).mapValues(_.length) val nameToCurrentOrdinal = scala.collection.mutable.HashMap.empty[String, Int] val viewDDL = { val viewName = desc.identifier val viewText = view.sql val userSpecifiedColumns = if (desc.schema.fieldNames.toSeq == desc.viewQueryColumnNames) { "" } else { s"(${desc.schema.fieldNames.mkString(", ")})" } Some(s"CREATE OR REPLACE VIEW $viewName $userSpecifiedColumns AS $viewText") } viewColumnNames.zip(desc.schema).map { case (name, field) => val count = nameToCounts(name) val ordinal = nameToCurrentOrdinal.getOrElse(name, 0) nameToCurrentOrdinal(name) = ordinal + 1 val col = GetViewColumnByNameAndOrdinal(desc.identifier, name, ordinal, count, viewDDL) Alias(UpCast(col, field.dataType), field.name)(explicitMetadata = Some(field.metadata)) } } SubqueryAlias(name, View(desc, isTempView = false, Project(projectList, qualifiedChild))) } else { SubqueryAlias(name, View(desc, isTempView = false, qualifiedChild)) }

ljfgem · 2022-08-24T01:23:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      case u @ UnresolvedRelation(nameParts, _, _) if v1SessionCatalog.isTempView(nameParts) =>
+        u
+      case u @ UnresolvedRelation(
+          parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if !isSQLOnFile(parts) =>


Could we use CatalogAndIdentifier here instead of limiting it to the non-session catalog? There might be the use case that spark_catalog is set to be custom table & view catalog, like CoralSparkViewCatalog

wmoustafa · 2022-08-24T05:58:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // The relation is a view, so we wrap the relation by:
+      // 1. Add a [[View]] operator over the relation to keep track of the view desc;
+      // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view.
+      SubqueryAlias(name, View(desc, false, qualifiedChild))


I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.

wmoustafa · 2022-08-29T20:44:43Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java

+   * @return the view description
+   * @throws NoSuchViewException If the view doesn't exist or is a table
+   */
+  View loadView(Identifier ident) throws NoSuchViewException;


Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).

Similar to these loadTable variants? Sure

default Table loadTable(Identifier ident, String version) default Table loadTable(Identifier ident, long timestamp)

Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).

jzhuge · 2022-08-29T20:47:57Z

Thanks @ljfgem @wmoustafa for the comments. Looking ...

Also may add a few unit tests to illustrate some scenarios.

jzhuge · 2022-10-11T07:00:25Z

Hi Melin, I opened a new PR because #28147 <#28147> was closed and it was way out of sync from the current main branch. Thanks,

…

On Fri, Mar 11, 2022 at 1:02 AM melin ***@***.***> wrote: #28147 <#28147> PR available? Why submit a new PR? — Reply to this email directly, view it on GitHub <#35636 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOL5JCGJGNK3D4FROQQWBTU7MDZLANCNFSM5PFSWFAQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- John Zhuge

ljfgem · 2022-10-27T15:38:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala

+
+  override val schema: StructType = view.schema
+
+  override val viewText: Option[String] = None


Isn't it supposed to be Option(view.sql)? Or how does it differ from the val sql below? I think we just need to keep one for callers?

You are right! And V2ViewDescription.sql can be removed.

How about changing ShowCreateV2View to take (Identifier, View) instead of V2ViewDescription?

What is the benefit of doing such a modification? And by the way, I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used? Maybe we can remove this parameter.

ShowCreateV2View is for v2 view while V2ViewDescription is an adaptor between v1 and v2, so it makes more sense for ShowCreateV2View not to use V2ViewDescription, instead, relying on V2 interfaces solely.

I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used?

It is needed by the abstract class QueryPlan.

Thanks for explaining. Do you think we also need to modify ShowViewPropertiesExec and DescribeViewExec which uses V2ViewDescription? Looks like using V2ViewDescription would bring some convenience especially for DescribeViewExec. I am good with both approach.

Indeed, V2ViewDescription is an API adaptor for easy cloning of v1 SQL commands, and adding missing APIs like comment or owner (stored in properties on disk), etc.

So keep using V2ViewDescription?

Yeah I am fine with keep using V2ViewDescription, I don't think the change is necessary.

github-actions · 2023-05-06T00:17:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

melin · 2023-10-13T04:41:04Z

@jzhuge Have a plan to complete this PR?

jzhuge · 2023-10-13T04:44:15Z

Yes

…

On Thu, Oct 12, 2023 at 9:41 PM melin ***@***.***> wrote: @jzhuge <https://github.com/jzhuge> Have a plan to complete this PR? — Reply to this email directly, view it on GitHub <#35636 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOL5JE3Q65KVSAQLBHQMW3X7DA6ZANCNFSM5PFSWFAQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

github-actions bot added the SQL label Feb 23, 2022

melin mentioned this pull request Mar 9, 2022

Views on top of Iceberg tables apache/iceberg#644

Closed

jzhuge force-pushed the SPARK-31357-2 branch 2 times, most recently from 2ae6f03 to b468d01 Compare March 17, 2022 03:10

jzhuge force-pushed the SPARK-31357-2 branch from b468d01 to 6a99465 Compare March 24, 2022 16:22

jzhuge force-pushed the SPARK-31357-2 branch from 6a99465 to 1f3c08e Compare April 1, 2022 15:52

rdblue reviewed Apr 5, 2022

View reviewed changes

jzhuge force-pushed the SPARK-31357-2 branch from 1f3c08e to 84b813a Compare April 5, 2022 21:24

jzhuge mentioned this pull request Apr 28, 2022

[WIP] API/Core: View support apache/iceberg#4657

Closed

jzhuge added 2 commits August 17, 2022 08:06

[SPARK-39799][SQL] DataSourceV2: View catalog interface

ca3ea2e

[SPARK-39800][SQL][WIP] DataSourceV2: View Support

bab80bc

jzhuge force-pushed the SPARK-31357-2 branch from 84b813a to bab80bc Compare August 17, 2022 16:05

jzhuge mentioned this pull request Aug 18, 2022

[SPARK-39799][SQL] DataSourceV2: View catalog interface #37556

Closed

ljfgem reviewed Aug 23, 2022

View reviewed changes

ljfgem reviewed Aug 24, 2022

View reviewed changes

ljfgem mentioned this pull request Aug 25, 2022

[Sample PR|WIP] Spark View Catalog with HMS and Coral linkedin/coral#297

Draft

wmoustafa reviewed Aug 29, 2022

View reviewed changes

ljfgem reviewed Oct 27, 2022

View reviewed changes

jzhuge changed the title ~~[SPARK-31357][SQL][WIP] Catalog API for view metadata~~ [SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog Jan 25, 2023

nastra mentioned this pull request Feb 22, 2023

Does spark-iceberg support create view ? apache/iceberg#6903

Closed

github-actions bot added the Stale label May 6, 2023

github-actions bot closed this May 7, 2023


		override val schema: StructType = view.schema

		override val viewText: Option[String] = None

[SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog #35636

[SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog #35636

Uh oh!

Conversation

jzhuge commented Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xkrogen commented Mar 1, 2022

Uh oh!

jzhuge commented Mar 1, 2022 via email

Uh oh!

melin commented Mar 11, 2022

Uh oh!

jzhuge commented Mar 24, 2022

Uh oh!

jzhuge commented Mar 31, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frankliee commented May 10, 2022

Uh oh!

jzhuge commented May 10, 2022

Uh oh!

frankliee commented May 11, 2022

Uh oh!

jzhuge commented May 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljfgem Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljfgem Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wmoustafa Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jzhuge commented Aug 29, 2022

Uh oh!

jzhuge commented Oct 11, 2022 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jzhuge commented Feb 23, 2022 •

edited

Loading

ljfgem Sep 29, 2022 •

edited

Loading

ljfgem Aug 24, 2022 •

edited

Loading

wmoustafa Aug 29, 2022 •

edited

Loading

ljfgem Nov 6, 2022 •

edited

Loading