Skip to content

Conversation

@jzhuge
Copy link
Member

@jzhuge jzhuge commented Feb 23, 2022

What changes were proposed in this pull request?

This PR adds support to load, create, alter, and drop views in DataSource V2 catalogs.

This is an umbrella PR that combines the following commits that probably should be merged one by one.

  • View catalog interface
  • View substitution rule
  • Create view DDL
  • View SQL DDLs
  • Caching for ViewCatalog

Why are the changes needed?

Support views stored in DataSourceV2 catalogs. Details in SPIP.

Does this PR introduce any user-facing change?

  • Support views from DataSource V2 catalogs in SQL
  • Support views from DataSource V2 catalogs in DataFrame API

How was this patch tested?

New unit tests

Regression

  • DDLParserSuite
  • PlanResolutionSuite
  • DataSourceV2SQLSuite

@github-actions github-actions bot added the SQL label Feb 23, 2022
@xkrogen
Copy link
Contributor

xkrogen commented Mar 1, 2022

@jzhuge The PR as-is seems to only contain the view catalog interface, not the other 3 facets that you mentioned. Is this intentional?

@jzhuge
Copy link
Member Author

jzhuge commented Mar 1, 2022 via email

@melin
Copy link

melin commented Mar 11, 2022

#28147 PR available? Why submit a new PR?

@jzhuge jzhuge force-pushed the SPARK-31357-2 branch 2 times, most recently from 2ae6f03 to b468d01 Compare March 17, 2022 03:10
@jzhuge
Copy link
Member Author

jzhuge commented Mar 24, 2022

@melin #28147 was marked stale and closed.

@jzhuge
Copy link
Member Author

jzhuge commented Mar 31, 2022

Retest please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed? It looks suspicious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile error without the change:

[ERROR] /Users/jzhuge/Repos/upstream-spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1002: reference to View is ambiguous;
it is imported twice in the same scope by
import org.apache.spark.sql.connector.catalog._
and import org.apache.spark.sql.catalyst.plans.logical._

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This avoids a conflicting import. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change NoSuchNamespaceException? Can't the caller add cause = None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should be able to do without.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is necessary for Java code to call new NoSuchNamespaceException(namespace) without a cause.

The following constructor (default value None to cause) will not work for Java:

  def this(namespace: Array[String], cause: Option[Throwable] = None)

How about having these 2 constructors?

  def this(namespace: Array[String], cause: Throwable)
  def this(namespace: Array[String])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this API should behave like the TableCatalog API. That is, the create/replace/createOrReplace methods should return a View object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a mix-in for atomic replace or atomic createOrReplace? The logic here is not atomic and so I don't see much value in having this method. I would rather the caller drop and create the view instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return a View.

@frankliee
Copy link
Contributor

Does this PR need to update ResolveSessionCatalog?
Currently, CreateView only supports V1 catalog.

@jzhuge
Copy link
Member Author

jzhuge commented May 10, 2022

Likely. This PR will support creating v2 views.

@frankliee
Copy link
Contributor

Likely. This PR will support creating v2 views.

Thanks.
What about other commands, such as dropview, alterivew?
They also do not support v2 view, currently.

@jzhuge
Copy link
Member Author

jzhuge commented May 11, 2022

Yes, many view DDL commands will be implemented, you can find a complete list in SPIP.

// The relation is a view, so we wrap the relation by:
// 1. Add a [[View]] operator over the relation to keep track of the view desc;
// 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view.
SubqueryAlias(name, View(desc, false, qualifiedChild))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jzhuge, I found that with the original Hive session catalog, the resolved plan of the view select * from default.t is like:

'SubqueryAlias spark_catalog.default.test_view
+- View (`default`.`test_view`, ['intCol,'structCol,'boolCol])
   +- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8]
      +- 'Project [*]
         +- 'UnresolvedRelation [default, t], [], false

Looks like the project node:

   +- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8]

is added based on the schema stored in the view's desc (CatalogTable). So that the schema info (casing and nullability) is preserved.

But in the current implementation, there is not such a project node based on the view's schema, do you think we need to add the similar project node if the view's schema is provided?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Could you point me to the code for "built-in Hive external catalog implementation"?

Copy link

@ljfgem ljfgem Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jzhuge, built-in Hive external catalog adds the project node according to the schema in this method
I would suggest to add the similar project node here if the provided schema of the view is not null, like:

      if (desc.schema != null) {
        val projectList = {
          val viewColumnNames = desc.viewQueryColumnNames
          val nameToCounts = viewColumnNames.groupBy(identity).mapValues(_.length)
          val nameToCurrentOrdinal = scala.collection.mutable.HashMap.empty[String, Int]
          val viewDDL = {
            val viewName = desc.identifier
            val viewText = view.sql
            val userSpecifiedColumns =
              if (desc.schema.fieldNames.toSeq == desc.viewQueryColumnNames) {
                ""
              } else {
                s"(${desc.schema.fieldNames.mkString(", ")})"
              }
            Some(s"CREATE OR REPLACE VIEW $viewName $userSpecifiedColumns AS $viewText")
          }

          viewColumnNames.zip(desc.schema).map { case (name, field) =>
            val count = nameToCounts(name)
            val ordinal = nameToCurrentOrdinal.getOrElse(name, 0)
            nameToCurrentOrdinal(name) = ordinal + 1
            val col = GetViewColumnByNameAndOrdinal(desc.identifier, name, ordinal, count, viewDDL)
            Alias(UpCast(col, field.dataType), field.name)(explicitMetadata = Some(field.metadata))
          }
        }
        SubqueryAlias(name, View(desc, isTempView = false, Project(projectList, qualifiedChild)))
      } else {
        SubqueryAlias(name, View(desc, isTempView = false, qualifiedChild))
      }

case u @ UnresolvedRelation(nameParts, _, _) if v1SessionCatalog.isTempView(nameParts) =>
u
case u @ UnresolvedRelation(
parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if !isSQLOnFile(parts) =>
Copy link

@ljfgem ljfgem Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use CatalogAndIdentifier here instead of limiting it to the non-session catalog? There might be the use case that spark_catalog is set to be custom table & view catalog, like CoralSparkViewCatalog

// The relation is a view, so we wrap the relation by:
// 1. Add a [[View]] operator over the relation to keep track of the view desc;
// 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view.
SubqueryAlias(name, View(desc, false, qualifiedChild))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.

* @return the view description
* @throws NoSuchViewException If the view doesn't exist or is a table
*/
View loadView(Identifier ident) throws NoSuchViewException;
Copy link

@wmoustafa wmoustafa Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to these loadTable variants? Sure

  default Table loadTable(Identifier ident, String version)
  default Table loadTable(Identifier ident, long timestamp)

Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).

@jzhuge
Copy link
Member Author

jzhuge commented Aug 29, 2022

Thanks @ljfgem @wmoustafa for the comments. Looking ...

Also may add a few unit tests to illustrate some scenarios.

@jzhuge
Copy link
Member Author

jzhuge commented Oct 11, 2022 via email


override val schema: StructType = view.schema

override val viewText: Option[String] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it supposed to be Option(view.sql)? Or how does it differ from the val sql below? I think we just need to keep one for callers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! And V2ViewDescription.sql can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing ShowCreateV2View to take (Identifier, View) instead of V2ViewDescription?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of doing such a modification? And by the way, I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used? Maybe we can remove this parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShowCreateV2View is for v2 view while V2ViewDescription is an adaptor between v1 and v2, so it makes more sense for ShowCreateV2View not to use V2ViewDescription, instead, relying on V2 interfaces solely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used?

It is needed by the abstract class QueryPlan.

Copy link

@ljfgem ljfgem Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. Do you think we also need to modify ShowViewPropertiesExec and DescribeViewExec which uses V2ViewDescription? Looks like using V2ViewDescription would bring some convenience especially for DescribeViewExec. I am good with both approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, V2ViewDescription is an API adaptor for easy cloning of v1 SQL commands, and adding missing APIs like comment or owner (stored in properties on disk), etc.

So keep using V2ViewDescription?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am fine with keep using V2ViewDescription, I don't think the change is necessary.

@jzhuge jzhuge changed the title [SPARK-31357][SQL][WIP] Catalog API for view metadata [SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog Jan 25, 2023
@github-actions
Copy link

github-actions bot commented May 6, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 6, 2023
@github-actions github-actions bot closed this May 7, 2023
@melin
Copy link

melin commented Oct 13, 2023

@jzhuge Have a plan to complete this PR?

@jzhuge
Copy link
Member Author

jzhuge commented Oct 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants