-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31357][SQL][WIP] Support DataSource V2 View Catalog #35636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jzhuge The PR as-is seems to only contain the view catalog interface, not the other 3 facets that you mentioned. Is this intentional? |
|
Sorry, not added yet.
On Tue, Mar 1, 2022 at 3:07 PM Erik Krogen ***@***.***> wrote:
@jzhuge <https://github.com/jzhuge> The PR as-is seems to only contain
the view catalog interface, not the other 3 facets that you mentioned. Is
this intentional?
—
Reply to this email directly, view it on GitHub
<#35636 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOL5JBTAQ7T2PPHR73HSFLU52PJLANCNFSM5PFSWFAQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
John Zhuge
|
|
#28147 PR available? Why submit a new PR? |
2ae6f03 to
b468d01
Compare
|
Retest please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed? It looks suspicious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compile error without the change:
[ERROR] /Users/jzhuge/Repos/upstream-spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1002: reference to View is ambiguous;
it is imported twice in the same scope by
import org.apache.spark.sql.connector.catalog._
and import org.apache.spark.sql.catalyst.plans.logical._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This avoids a conflicting import. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change NoSuchNamespaceException? Can't the caller add cause = None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should be able to do without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is necessary for Java code to call new NoSuchNamespaceException(namespace) without a cause.
The following constructor (default value None to cause) will not work for Java:
def this(namespace: Array[String], cause: Option[Throwable] = None)
How about having these 2 constructors?
def this(namespace: Array[String], cause: Throwable)
def this(namespace: Array[String])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this API should behave like the TableCatalog API. That is, the create/replace/createOrReplace methods should return a View object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a mix-in for atomic replace or atomic createOrReplace? The logic here is not atomic and so I don't see much value in having this method. I would rather the caller drop and create the view instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should return a View.
|
Does this PR need to update |
|
Likely. This PR will support creating v2 views. |
Thanks. |
|
Yes, many view DDL commands will be implemented, you can find a complete list in SPIP. |
| // The relation is a view, so we wrap the relation by: | ||
| // 1. Add a [[View]] operator over the relation to keep track of the view desc; | ||
| // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view. | ||
| SubqueryAlias(name, View(desc, false, qualifiedChild)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jzhuge, I found that with the original Hive session catalog, the resolved plan of the view select * from default.t is like:
'SubqueryAlias spark_catalog.default.test_view
+- View (`default`.`test_view`, ['intCol,'structCol,'boolCol])
+- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8]
+- 'Project [*]
+- 'UnresolvedRelation [default, t], [], false
Looks like the project node:
+- 'Project [upcast(getviewcolumnbynameandordinal(`default`.`test_view`, intCol, 0, 1), IntegerType) AS intCol#6, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, structCol, 0, 1), StructField(doubleCol,DoubleType,true), StructField(stringCol,StringType,true)) AS structCol#7, upcast(getviewcolumnbynameandordinal(`default`.`test_view`, boolCol, 0, 1), BooleanType) AS boolCol#8]
is added based on the schema stored in the view's desc (CatalogTable). So that the schema info (casing and nullability) is preserved.
But in the current implementation, there is not such a project node based on the view's schema, do you think we need to add the similar project node if the view's schema is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Could you point me to the code for "built-in Hive external catalog implementation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jzhuge, built-in Hive external catalog adds the project node according to the schema in this method
I would suggest to add the similar project node here if the provided schema of the view is not null, like:
if (desc.schema != null) {
val projectList = {
val viewColumnNames = desc.viewQueryColumnNames
val nameToCounts = viewColumnNames.groupBy(identity).mapValues(_.length)
val nameToCurrentOrdinal = scala.collection.mutable.HashMap.empty[String, Int]
val viewDDL = {
val viewName = desc.identifier
val viewText = view.sql
val userSpecifiedColumns =
if (desc.schema.fieldNames.toSeq == desc.viewQueryColumnNames) {
""
} else {
s"(${desc.schema.fieldNames.mkString(", ")})"
}
Some(s"CREATE OR REPLACE VIEW $viewName $userSpecifiedColumns AS $viewText")
}
viewColumnNames.zip(desc.schema).map { case (name, field) =>
val count = nameToCounts(name)
val ordinal = nameToCurrentOrdinal.getOrElse(name, 0)
nameToCurrentOrdinal(name) = ordinal + 1
val col = GetViewColumnByNameAndOrdinal(desc.identifier, name, ordinal, count, viewDDL)
Alias(UpCast(col, field.dataType), field.name)(explicitMetadata = Some(field.metadata))
}
}
SubqueryAlias(name, View(desc, isTempView = false, Project(projectList, qualifiedChild)))
} else {
SubqueryAlias(name, View(desc, isTempView = false, qualifiedChild))
}
| case u @ UnresolvedRelation(nameParts, _, _) if v1SessionCatalog.isTempView(nameParts) => | ||
| u | ||
| case u @ UnresolvedRelation( | ||
| parts @ NonSessionCatalogAndIdentifier(catalog, ident), _, _) if !isSQLOnFile(parts) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use CatalogAndIdentifier here instead of limiting it to the non-session catalog? There might be the use case that spark_catalog is set to be custom table & view catalog, like CoralSparkViewCatalog
| // The relation is a view, so we wrap the relation by: | ||
| // 1. Add a [[View]] operator over the relation to keep track of the view desc; | ||
| // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view. | ||
| SubqueryAlias(name, View(desc, false, qualifiedChild)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this node for two reasons:
(1) For feature parity with the built-in Hive external catalog implementation.
(2) We have seen cases where merely resolving view based on text loses some casing and nullability information of underlying tables.
This project node should leverage the schema() method of the View API.
| * @return the view description | ||
| * @throws NoSuchViewException If the view doesn't exist or is a table | ||
| */ | ||
| View loadView(Identifier ident) throws NoSuchViewException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to these loadTable variants? Sure
default Table loadTable(Identifier ident, String version)
default Table loadTable(Identifier ident, long timestamp)
Should this API take a list of options (similar to how tables can be queried with a list of options)? For example, I can imagine cases where time travel can be be applied to views (especially with integrating views with other data sources like Iceberg).
|
Thanks @ljfgem @wmoustafa for the comments. Looking ... Also may add a few unit tests to illustrate some scenarios. |
|
Hi Melin,
I opened a new PR because #28147
<#28147> was closed and it was way out
of sync from the current main branch.
Thanks,
…On Fri, Mar 11, 2022 at 1:02 AM melin ***@***.***> wrote:
#28147 <#28147> PR available? Why
submit a new PR?
—
Reply to this email directly, view it on GitHub
<#35636 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOL5JCGJGNK3D4FROQQWBTU7MDZLANCNFSM5PFSWFAQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
John Zhuge
|
|
|
||
| override val schema: StructType = view.schema | ||
|
|
||
| override val viewText: Option[String] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it supposed to be Option(view.sql)? Or how does it differ from the val sql below? I think we just need to keep one for callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! And V2ViewDescription.sql can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing ShowCreateV2View to take (Identifier, View) instead of V2ViewDescription?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of doing such a modification? And by the way, I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used? Maybe we can remove this parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShowCreateV2View is for v2 view while V2ViewDescription is an adaptor between v1 and v2, so it makes more sense for ShowCreateV2View not to use V2ViewDescription, instead, relying on V2 interfaces solely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used?
It is needed by the abstract class QueryPlan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. Do you think we also need to modify ShowViewPropertiesExec and DescribeViewExec which uses V2ViewDescription? Looks like using V2ViewDescription would bring some convenience especially for DescribeViewExec. I am good with both approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, V2ViewDescription is an API adaptor for easy cloning of v1 SQL commands, and adding missing APIs like comment or owner (stored in properties on disk), etc.
So keep using V2ViewDescription?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am fine with keep using V2ViewDescription, I don't think the change is necessary.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@jzhuge Have a plan to complete this PR? |
|
Yes
…On Thu, Oct 12, 2023 at 9:41 PM melin ***@***.***> wrote:
@jzhuge <https://github.com/jzhuge> Have a plan to complete this PR?
—
Reply to this email directly, view it on GitHub
<#35636 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOL5JE3Q65KVSAQLBHQMW3X7DA6ZANCNFSM5PFSWFAQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What changes were proposed in this pull request?
This PR adds support to load, create, alter, and drop views in DataSource V2 catalogs.
This is an umbrella PR that combines the following commits that probably should be merged one by one.
Why are the changes needed?
Support views stored in DataSourceV2 catalogs. Details in SPIP.
Does this PR introduce any user-facing change?
How was this patch tested?
New unit tests
Regression